Tuesday, August 26, 2008

Will reCAPTCHA Save Humanity?

Probably not, but it will make us more productive during the tedious process of proving our humanhood to some random web server.

Anyone who's using the internet these days for anything more involved than reading news knows those annoying sets of garbled characters that appear at the end of checkout and registration pages.

captcha

Those are called CAPTCHAs (an acronym for "completely automated public Turing test to tell computers and humans apart") and they are designed to test that we, the users of the web site are in fact human. Now, I don't disagree with the motivation behind CAPTCHAs. Today's Web is infested with crawling bots and other malicious agents running around causing all kinds of mayhem. But, it just seems like a terrible waste of time for something that aught to be straight forward (how many times have you mistyped a CAPTCHA and had to enter it over and over again until the server acknowledged your humanhood?).

Luis von Ahn, the inventor of the CAPTCHA, said in an interview recently that by his estimates people spend an average of 10 seconds solving one of those CAPTCHA puzzles. Multiply that by the number of CAPTCHAs solved daily (aprox. 200 million) and you come to the staggering number of approximately 500,000 hours per day world wide. Astonishing when you think about it in these terms.

That's exactly why he came up with the brilliant idea of the reCAPTCHA.

There is a growing number of libraries and archives who are working on digitizing their entire collections. The process involves scanning the printed document (book, newspaper, magazine, historical document, etc.) to an image file, and then running a software called OCR (for Optical Character Recognition) that tries to recognize the words in the scanned image and turns them into a searchable text document. It turns out that the OCR software can't "read" every document with 100% accuracy. That's where the power of crowd-sourcing comes in. recaptcha ReCAPTCHA uses the words that computers can't decipher with OCR software (therefore can't be "read" by malicious robots) and displays them to humans. Each reCAPTCHA actually consists of two words, one was successfully recognized by the computer and the other wasn't. Each image is also shown to several people to verify the accuracy of the translation. If they agree on the translation the transcription is considered accurate and will be added to the text it originally came from. Currently more than 40,000 web sites world wide are using reCAPTCHA technology including some you might have heard of like Ticketmaster, Facebook and CraigsList. There are plug-ins for WordPress, Joomla, Drupal and many other popular web applications, as well as APIs for PHP, ASP.NET, Java, Perl, Ruby etc. The implementation is simple and there are lots of resources available for site developers.

Next time you run into those squiggly characters when commenting on a blog, or signing up for an online email account, make sure your time is not wasted on a standard CAPTCHA text. Let the site owners know your time could be spent on saving humanity, or at least it's written word.

0 comments: