Monday, April 30, 2012

Background, experiment and results of CSI Collecting (CSIC)

The essential idea of CSI Collecting is to incorporate human pattern detection to improve the capabilities of search and optimization algorithm.

The question is, do humans have the ability to detect patterns any better than computers?  If not, then any advantage human interaction brings is equivalent to some computer algorithm.  Hence, there is nothing especially unique about human involvement, and there is no guarantee it will bring an advantage to a particular problem.

On the other hand, if humans do have a pattern recognition capability that is non-computational then we can expect human involvement to bring an advantage to all problems where an advantage is possible.  There are problems where there is no other way to solve the problem than by trying all combinations, and in such problems human interaction will bring no advantage whether such interaction is non-computational or not.

Why would we think a non-computational ability exists in humans?  This question is a bit more involved to answer.  The answer is a hypothesis, and is based on a couple assumptions which are potentially testable.

The starting point is Dr. Dembski's paper "A search for a search".  This paper states two important points: 1) a search process cannot produce more information than is originally placed in the search algorithm, and 2) a search for search also cannot produce more information than is place in the initial search algorithm.  The term information in these papers refers to a metric of how much more likely a search algorithm will find a search target than a purely random sample based search.

According to these two points, as long as the only means we have of inserting information into a search process is another search process, then we can never insert information into a search process.  This means that for all problem domains any search will perform no better than a random sample search.  The only way to get information into a search process is by a non-search.  However, all algorithms that are capable of inserting information can be characterized as searches.  Thus, the source of information must ultimately be non-algorithmic.  Since all computers are algorithmic, this means that the source of information must also be non-computational.

The one known source of non-algorithmic information is intelligence.  According to intelligent design, intelligence can create information, in particular, complex specified information.  However, is complex specified information the same kind of information that a search process requires?

The mathematical definition of complex specified information is approximately the amount of specificational resources multiplied by the number of specifications we are interested in divided by the number of possible specifications.  This characterization can be fit to the search process.  The range of potential solutions that a search will examine is equivalent to the number of possible specifications.  The target set of solutions is equivalent to the specifications we are interested in.  The specificational resources are the number of solutions sampled by the search process.  If the search process selects the target specifications more often than we would expect given their proportion to the number of potential specifications, then when we calculate the CSI formula on the search process, it will contain information.

Now if we invert the argument that CSI implies intelligence, then we can say intelligence can potentially, but not necessarily, impart information into a search process.  As we saw when comparing the definition of CSI to information in a search process, if information is imparted into a search process, it will find its target more often than without the information.  So, if an intelligent being, such as a person, is involved in a search process, then the intelligent being can cause any given search algorithm to perform better than mathematically possible.

This is the theoretical basis behind the CSI Collecting (CSIC) method.  Since humans have some sort of ability to insert information into any search algorithm, then there must be a general methodology for allowing humans to interact with search algorithms.  Does such a methodology exist in modern computer science?  Currently, there are a number of researchers are attempting to incorporate human interaction into search processes, the most well known being the Google search engine, which uses our search patterns and web page linking to improve our searches results.  However, none of these methods attempt to establish a generalized approach.  The general field is known as human computation, and there is currently no theoretical foundation for the field that is compatible with intelligent design's dual claims that algorithms cannot create information and that intelligent agents can.

Now, you may be asking, given that CSIC seems to be a unique approach to human computation, how do I plan to test my hypothesis, and what are my results?  First, I must admit there are practical problems to testing the efficacy of human computation.  The most glaring problem is distinguishing between A) a general non-algorithmic ability to improve a search algorithm and B) happenstance that an algorithmic ability improves a search algorithm.  In theory the two cases are quite distinct: there will always be instances of search algorithms where A improves the search and B does not, for any type of algorithmic ability in B.  In practice it is not possible to test an ability over all possible problems, we can only test a very small subset of all problems.  However, the good thing is that there are many finite problem subsets that have approximately the same characteristic as the set of all problems.  So, if I pick such a subset, while I cannot categorically state that the tested ability falls in A instead of B, I can make such a claim with a strong degree of confidence.

Which subset of problems have the necessary characteristic?  It is the subset of problems that the No Free Lunch Theorem (NFLT) applies, or almost applies (ANFLT).  Unfortunately, the set of NFLT problems is not very large.  Fortunately, the nest of ANFLT problems is very large.  In fact, most problems of interest quite likely fall into this set because they are NP complete or harder.  So, as long as I pick a problem that is ANFLT, I can at least give a confidence rating to my hypothesis, as well as potentially generate useful results.

The second practical issue is determining whether humans categorically contributed to the search process, or I just picked a bad search algorithm.  This case too I cannot say definitively whether my algorithm is to blame or not.  The best I can do is give a degree of confidence.  And in my case, I cannot even mathematically quantify this confidence since my current work is merely a first pass to see if the idea shows promise.

The general technique I use is as follows.  I use a standard multi-objective genetic algorithm.  A genetic algorithm takes a set of solutions, measures how good each solution is according to some fitness valuation function, varies the solutions to generate a new set using variation operators such as mutation and crossover, and then from both sets of solutions selects a final set according to some critera.  The algorithm then reiterates this process on each subsequent final set until a stopping criteria is reached.  The solutions themselves are represented to the algorithm as fixed length bit strings.  The one innovation I add to create CSIC is to allow a person to provide input for the selection phase.  The person can only see the bit string solution and its fitness valuation.

The metric of comparison between the human and the genetic algorithm is the number of unique solution valuations needed to improve on the best found solution to date.  The search (algorithmic or human manual) process requiring fewer unique solution valuations to find a better solution is considered the superior search process.

My first problem domain consisted of scheduling member roles for multiple club meetings.  This is a form of the well known scheduling problem, and as such is harder than NPC.  My method was to run the genetic algorithm until it stopped finding better solutions for a large number of iterations.  Then I manually attempted to find better solutions.  In about 0.05% of the fitness valuations the algorithmic search used and was still unable to find better solutions, I was able to find three superior solutions.

My second problem domain of choice is that of coming up with the primes that generate an RSA public private key pair.  This domain is much harder than NPC, and as such is quite likely an ANFLT domain.  I ran the algorithmic search algorithm in tandem with 500 attempts by humans using the Amazon Mechanical Turk service.  The humans contributed 2.4% of the overall set of superior solutions.

These results are still preliminary, and there is much hard mathematical work to be done to quantify specifically the degree to which humans contributed, and whether this contribution is statistically significant enough to validate the ID hypothesis.  However, at the very least, it is clear that humans can contribute to algorithmic optimization when given only the exact same data available to the algorithm.  So, the initial results show promise, however so slight.