Google Ngrams: in the beginning was the word searchOctober 3rd, 2011 by Ville Miettinen
Long, long ago, before Android, G+ and self-driving cars, Google had one simple mission: to organize the world’s information and make it universally accessible and easy to use.
Nowadays, the big-friendly-search-giant sometimes seems more interested in irritating Mark Zuckerberg than promoting universal knowledge. But, just occasionally, Google gets back to basics.
In 2004 Google started digitizing books. Since then, 15 million volumes have been digitized by OCR software into Google’s virtual library. Recently, Harvard scholars Erez Lieberman Aiden and Jean-Baptiste Michel decided to try and turn this literary data-mountain into something “useful and accessible”. The result is Google Ngram viewer: a tool that searches and graphs the frequency of words contained in over 5 million books. Basically you type in a word and get back a pretty-yet-educational chart of say religion vs science or drinking habits through the ages. As Aiden and Michel enthusiastically demonstrated in a recent TED talk, it’s surprisingly addictive.
So, where’s the crowdsourcing angle in all this? Well interestingly, the OCR quality of Google’s uploaded books varies widely. Aiden and Michel found only about a third of the texts were good enough to use as Ngrams. So, how about a little help from the crowd? Google already use reCAPTCHA to help decipher old New York Times issues, why not go further? As we know from Digitalkoot, people are amazingly good at reading difficult text and love the chance to contribute to worthwhile projects – especially if they are presented in an entertaining form.
The crowd might even be able to tackle handwritten documents, and correct OCR errors (just try charting use vs ufe to see the problems OCR software has telling “s” and “f” apart). Imagine a GoogleBookHunt crowdsourced game – factual and fun, innovative and informative. It’s the beft of all possible worlds.