
Boxes like the one above will be familiar, and perhaps frustrating, to anyone who logs on to websites these days. The text is often difficult to read, and I doubt that I am alone in often misreading the contents and guessing wrong on a regular basis. The test is, of course, meant to be hard. The idea is to provide something which is barely legible, so that computers cannot decipher the text but humans can. To do this, the text is deliberately distorted to the point where it often takes some effort to figure it out.
Such a box is known as a CAPTCHA, which stands for “Completely Automated Public Turing test to tell Computers and Humans Apart”. In general, tests of this type are known as reverse Turing tests, after the mathematical genius and computing pioneer Alan Turing.

Sculpture of Alan Turing with apple in Manchester, England
In the course of his tragically short life, Alan Turing laid the foundations for the new field of computer science, broke the code of the German naval Enigma machine, thus saving countless lives in the Second World War, and came up with the test that bears his name. Even though computers were little more than a concept in his time, Turing was deeply interested in the potential of computing machines, and how far they could go toward thinking like humans. Rather than ask ‘can computers think?’, a question which he thought was mired in semantic problems, he came up with the idea that, if a computer could fool a human being into believing that it was communicating with another human, it could be said to be demonstrating ‘artificial’ human intelligence. He proposed that a test could be devised to determine whether a computer met this criteria, which became popularly know as a Turing Test.
In 1952 he was convicted of homosexuality, which was illegal in the UK at the time. In order to avoid prison, he was compelled to undergo hormonal injections to suppress his libido. He also lost his security clearance, which ended his career. On 8 June 1954, he was found dead with traces of cyanide in his system and a half eaten apple next to his bed (the bite out of Apple’s logo is rumoured to be a tribute to Turing). The death was ruled as suicide. Turing was 42 years old. Time Magazine named him as one of the 100 most influential people of the 20th century.
Designing a test to reliably tell computers from humans is precisely the problem facing internet security analysts. In the early days of the worldwide web, internet mail sites such as Hotmail and Yahoo! found that spammers could program computers to set up multiple accounts, and use them to launch spam. Similarly, blogging sites and internet forums were plagued by spammers plying their wares via comments and posts.
In 2000, Yahoo! started using CAPTCHAs designed by Luis von Ahn and Manuel Blum to counter the spam threat. Computers could not read the images, and thus were not able to complete the log-in process. Von Ahn and Blum had designed a Turing test that stopped spammers in their tracks. The idea was soon taken on by many other popular websites.

Luis von Ahn, virtually
Luis von Ahn has become a highly influential figure in web thinking. Discover magazine named him one of the fifty best brains in science, and he was awarded a prestigious McArthur Fellowship in 2006 (also known as the Genius Award).
His principal area of study is the field of human computation. Turning traditional thinking on its head, Von Ahn is interested in how humans can help computers to think. For example, humans look at an image and can immediately make out its constituent elements, whereas to a computer it is just a series of zeros and ones. Ideally a computer could tag photos uploaded to the internet so that humans could search for them by keyword, but there is no way currently that computers can do this. Von Ahn’s idea is to find ways to use human thinking power to achieve tasks like this. For example, he came up with the concept for the Google Image Labeler, which provides tags for google images by means of an online game. Two players work as a team, independently describing a series of photos, and are awarded points if they both come up with the same description. Google uses the information gained from this game to more accurately tag its stock of images so they can be found by its search engine.
In the light of his breakthroughs in human computation, von Ahn reconsidered the enormous aggregate human brain power being used to decode millions of his CAPTCHAs each day. It occurred to him that while millions of people every day were having to decode ambiguous text purely as part of a logon process, projects to convert old texts to digitised formats were foundering due to the shortcomings of their text recognition software. He decided to use his Genius Award grant to design a system which would allow CAPTCHAs to be usefully employed to assist in the conversion of old texts to digitised formats.

The limitations of optical character recognition (OCR)
The largest text digitisation project assisted by von Ahn’s reCAPTCHA process, as it is known, is run by the Open Content Alliance. The non-profit organisation has already converted over one million documents to digital format, all provided free of charge via their website. To digitise the documents, employees scan the pages of out-of-copyright texts, and the scanned images are then sent for optical character recognition (OCR) analysis. The analysis is about 90% accurate for newer books, but falls off to 60% for older texts. The unidentifiable text is then broken down into individual words and sent to participating websites as reCAPTCHAs. Two words, one which is already identified and which acts as a control, and an unidentified word, are presented to the user, having been stretched and distorted in ways which perplex computers but not people. The person logging in then enters both words. Correct recognition of the control word provides the login authorisation, but the user’s input for the other word is noted. Each unidentified word is served up a number of times until a consensus is reached as to what the word is. The process has a success rate of around 99.1%. The answer is then returned and inputted into the scanned text.
The reCAPTCHA service is provided free of charge to participating websites, which now include Facebook, Twitter, Craigslist, Stumbleupon and many more. Around 160 books’ worth of words are digitised every day through the process. ReCAPTCHA is also being used to digitise the entire archive of the New York Times, all the way back to 1851, which will be posted online with free access. Von Ahn’s goal is to digitise every out of copyright book and provide them all free of charge on the internet. This will keep us all busy logging in for a while, though. At the current rate of progress, the job will take 400 years to complete.


No comments yet
Comments feed for this article