Having designated chess, with its clearly defined rules and winning conditions, to be a game conducive to attack by artificially intelligent systems, we will now consider the idea of winning conditions in a more general light, specifically with regard to linguistic tasks. The question, in short, is, “Can we define linguistic games and/or tasks with clearly formalized success conditions?” (The term “success conditions” has been substituted for “winning conditions” because not all of the examples we’ll be looking at share an equal degree of family resemblance to games. Nevertheless, a task with a success condition and a game with a winning condition will often be similarly tractable for similar reasons, so the substitution should be a valid one in all the following cases.)

When dealing with success conditions, it often becomes all too easy to beg the question. Consider a hypothetical machine that takes a word-pair as input. For example: (“dog”,”cat”), (“knife”,”blue”), (“kill”,”murder”), and (“fuse”,”join.”) The machine’s task is to output whether or not the words in the pair are synonyms. If we could program a context-free success condition, then the success-condition-checker algorithm would have to perform a task equivalent to the task whose success it is designed to check. In other words, it would have to (figuratively) say to itself, “I just outputted that ‘kill’ and ‘murder’ are synonyms. Are these words, in fact, synonyms? Let me check…” This inquiry merely repeats the inquiry originally being tested.

An ideal success condition is one that is completely context-free. A computer that takes as input a chess position and outputs whether or not one of the players is in checkmate is almost trivial to implement. The fact that chess’s success condition is trivially defined allows those who research the problem domain of chess to focus entirely on the heuristics used to reach the winning condition, not on defining the winning condition itself. However, finding linguistic domains with this kind of clearly defined success condition is a pipe dream.

Luckily, there are several ways to skirt the issue of creating a context-free success condition.

1) Man-Made Datasets

The first (and perhaps the most common) solution is to fake a context-free success condition by creating a large dataset of context-dependent success conditions. For the synonym-checker-program above, the need for a built-in success condition can be somewhat alleviated by using prejudged, man-made datasets like these: (“dog”,”cat”, FALSE), (“knife”,”blue”, FALSE), (“kill”,”murder”, TRUE), (“fuse”,”join”, TRUE), etc.

The interesting thing about this method is that for every such dataset constructed, there exists a hypothetical (yet trivial) algorithm that can perform with %100 success on that dataset. The algorithm would simply be allowed to access a copy of the very dataset against which it is being tested. This is similar to a chess program that is invested with a very extensive opening book but no other decision making procedures. Such a program would be able to play high level chess for several moves before its reservoir of data ran dry. Likewise a synonym checker would be able to successfully recognize synonyms for as long as the word-pairs happened to exist in its dataset.

The goal of using such a dataset, however, is not usually to make a trivial algorithm that operates only on predefined input data but to check (and help train) a less trivial algorithm that operates on more a general dataset. In effect, using such a dataset can turn a problem domain which is not game-like into one with reasonably well defined success conditions.

As a side note, in the particular case of the synonym-checker, the best way to stupefy the whole task might actually be to make a more-or-less exhaustive list of every possible word-pair. Thanks to many centuries of stellar lexicographic work, almost every non-technical word in the English language has been collected, documented, and defined. And to make matters easier, the average educated person only knows about 20,000 of these words, so making a table of every possible word-pair would entail about 20,000^2 entries. If each English word is, on average, 7 letters long and if each letter were represented as an eight-bit ASCII character, then the number of total bits used to store all of the words would equal 20,000^2 * 7 * 8. To each entry, we would need to append a TRUE/FALSE bit in order to designate whether the pair is a synonym or not. That’s another 20,000 bits. And I suppose we would need a special ASCII character used as a separator (like a comma) between the words. That’s another eight-bits per entry: 160,000 bits. So the grand total is:

(20,000^2) * 7 * 8 + 20,000 + 160,000 = 22,400,180,000 bits

Because there are 8,589,934,592 bits in a gigabyte, we would require a database totaling a measly 2.6 gigabytes. These days, this kind of hard drive space is pretty small. And even then, we could greatly reduce the size by including in the database only valid synonyms — with the assumption that anything not in the database is a non-synonym. Granted, there are actually about 500,000 words in the Oxford English Dictionary (which would require a word-pair database of more than 1,500 gigabytes) — but using a database with only the 20,000-word vocabulary possessed by an average intelligent human suggests that the algorithm and a human being would score equally well in the synonym-recognizing game. And that’s usually the goal, after all: human-level performance on intelligence-requiring tasks.

2) Partially Checkable Success Conditions

Instead of making the success condition context-independent by rolling together a lot of context-dependent conditions, another option is to make use of context-free conditions that aren’t quite sufficient. Before looking at an example, here’s a sample task:

Given a statement, produce a statement in response that is both 1) something an intelligent human might say and 2) doesn’t repeat any previous outputs. For example:

Input: “What’s your name?”

Output: “I’m Stephen.”

Input: “That cat is pretty.”

Output: “I hate cats.”

This is a stripped-down version of the Turing test — a version that completely disregards the context of any larger discourse. Nonetheless, this is a hard problem for a computer to deal with. Certainly, we can write an algorithm that checks whether a possible output has been previously outputted; but checking whether the output is something an intelligent human might say presupposes an ability to recognize something fundamental about intelligent language production, which is precisely what this particular task is designed to accomplish. Nonetheless, we can build a potentially adequate success condition by checking quasi-necessary conditions such as grammaticality. There exist parts-of-speech taggers and grammar checkers which can judge whether “I hate cats” and “I’m Stephen” are syntactically valid, which can possibly be stipulated as a necessary condition within the realm of this particular task. But grammaticality can, by no means, be considered a sufficient condition. After all, many unintelligent responses may be grammatically correct. But at least it’s something. If an algorithm always produces grammatical responses, it’s quite a bit closer to its goal than an algorithm that produces gibberish.

3) Tackle the Success Condition as a Separate Problem

The problem of defining a success condition can sometimes be tackled as its own problem — perhaps as a prerequisite to converting the original problem domain into a game-like problem domain.

This route suggests a combination of the two previously mentioned ways of dealing with an elusive success condition. The first method above should be recognizable as part-one of the two-part task of stupefication: available context-dependent knowledge must be distilled into a database-like form. And the second method above should be recognizable as part-two of the two-part task of stupefication: discoverable context-independent heuristics must be applied to the problem domain whenever the previously mentioned context-dependent knowledge is inadequate. If problem X has a success condition whose verification is itself a task in need of stupefication, then perhaps problem X is not ready to be tackled yet. The problem of making a program that will pass the Turing test provides a nice example of where abandoning the problem, in favor of tackling the evaluation of the problem’s success condition, could lead to an extremely fruitful simplification.

Instead of making a Turing Test Passer (TTP), suppose we turned our attention to making an Automated Turing Test Judge (ATTJ), which would take a conversation as input and would output whether or not the conversation’s interlocutors were both human beings or not. This is hardly an easy problem, but it is certainly an easier one. Why? Because its success conditions are extremely conducive to being stupefied by constructing a large dataset. There exist hundreds of thousands of literary dialogues, television scripts, recorded conversations, transcribed interrogations, instant messenger exchanges, and online forum debates — all available for use in a multifarious dataset of different linguistic interactions, each of which an intelligent human being would likely find to be intelligible. In addition, finding a dataset of conversations that a human being would likely not find intelligible would be more difficult but could be automatically generated by scrambling random words and/or sentences in the valid data. This dataset of good and bad input could then be used to test any number of artificially intelligent systems designed to be an ATTJ. This kind of automated testing is precisely what a TTP project lacks. A TTP can only be evaluated according to the judgment of human being who either is or is not fooled by the TTP. With a more clearly definable success condition, an ATTJ project would have a definite leg up on its less clearly defined counterpart. And who knows? If a successful ATTJ were to be created, it might constitute a major step toward creating a successful TTP. Formally defining the winning condition of the Turing Test would be like programming a chess computer to recognize checkmate — an ability which it could hardly survive without.

In closing, it is important to realize that creating an ATTJ and creating a TTP are not equivalent tasks. An off-the-cuff and incorrect supposition would be that if a computer can judge a Turing test then it can pass a Turing test. But this simply isn’t necessarily the case. We can see this clearly by assuming the existence of a theoretically perfect ATTJ — a program that can (for any inputted conversation) output whether or not the conversation’s interlocutors were human beings. In general, the task of deciding whether an input has a given property may be impressive — but it is much less difficult than generating an output with a given property. Suppose we attempted to bootstrap our ATTJ into a TTP by using the following method:

1) At any given point during the Turing test, if it is the ATTJ’s turn to talk, generate all possible responses that are one word in length (all 20,000 of them, let’s say.) We’ll call this set “R” (for “responses”) and we’ll call any given word in it “r.”

2) Give the ATTJ the following input: The conversation-so-far plus the first “r” in “R.” If the ATTJ decides that the conversation is one that a human being would have, then the ATTJ should respond with “r.” Otherwise, try the next “r” and the next “r” until one works, or until “R” is exhausted.

3) If a valid “r” is not found, proceed to generate all possible responses that are two words in length (all 20,000^2 of them, let’s suppose.) Call this set “R-2” and repeat step 2 above for all “r-2.” Then repeat step 3 again for “R-3” and “R-4” and so on until one of the following occurs: A) a valid response is found, in which case the process starts over at step 1; B) the size of “R-N” becomes larger than the computer’s available memory; C) The time it takes for the ATTJ algorithm to evaluate the validity of the current “r-N” is longer than an acceptable time limit.

Assuming the existence of a perfect ATTJ, this method might actually work for small N-values. But larger N-values quickly become intractable. And certain statements like “What’s your favorite 100-word sentence?” would seem to require responses with a large number of words. (An N of size 20,000^100 is likely to be intractable for most modern hardwares.)

This is hardly a formal proof that the existence of an ATTJ does not imply the existence of a TTP. Nonetheless it is strong evidence that a very simple algorithm would fail to convert an ATTJ into a TTP. Furthermore, the need for a more complex algorithm suggests with even more force that an ATTJ is not a sufficient condition for a TTP.

Additionally, those familiar with processes like evolutionary computation and neural net training will recognize that the creation of an ATTJ program is a domain with characteristics that make it very conducive to problem solving techniques that specialize in pattern recognition. A successful ATTJ, if one is ever created, is likely to utilize the pattern recognition abilities of neural nets and/or complex statistical models capable of capturing the common mathematical features of valid human-generated conversations. These methods, while historically highly successful at recognizing the key features of various system, do not immediately imply the ability to generate new systems with the same features. For example, a neural net with the ability to differentiate between a human face and a non-human face would not immediately be able to produce pictures of human faces. And a statistical model that could evaluate whether words in close proximity were words that tended (in common usage) to appear in close proximity would not necessarily be able to generate its own sensical sentences.

The upshot of everything is that we can gain a lot by first trying to turn a problem domain into a game-like domain with formally definable success conditions. And, failing this, much can be gained by tackling the evaluation of a domain’s success conditions as a unique problem domain in itself. This shifting of domains may lead to a useful reduction in the required level of algorithmic complexity, just as creating an Automated Turing Test Judge represents an easier milestone for artificial intelligence researchers than does the creation of a Turing Test Passer.