Mountain View (USA) - Google has added an application, called Tesseract, for optical character recognition (OCR) to its impressive collection of free software. It is a program that can be used to convert the text contained in an image, typically obtained by means of a scanner, into characters that can be understood by a word processor.
The engine behind Tesseract was originally created by HP , which, however, has ceased development since 1995: this despite the fact that at the time it was considered one of the best OCR software of the moment. About two years ago, HP donated the code to the University of Nevada in Las Vegas (UNLV), which has been working on fixing bugs ever since. For some months, Google has taken on the sponsorship of the initiative making it an open source project and now he claims that the program "is stable enough to be republished as open source".
Tesseract however still suffers from some important limitations : the first is the support of the English language only (no Italian spell checker, so to speak); the second is the inability to preserve the layout of the pages (such as columns and tables); the third is the poor ability to recognize texts printed on gray or colored sheets (in other words, it gives its best only with the classic black on white text). On Google's own admission, Tesseract it is far less accurate than the best OCR packages on the market today .
It should be considered that, although the developers of the UNLV have patched the code here and there, the technology behind Tesseract has remained essentially the same as ten years ago.
However, Google claims that Tesseract " it is far more accurate than any open source OCR out there " “, Moreover, its license allows anyone to improve it and integrate it into other applications: which is no small feat.
The big G has promised that it will continue to work with this software, and for this purpose it is hiring experts in technologies related to that OCR.
That Google is interested in OCR is not surprising : BigG is making great use of this technology for the digitization of books (see Google Book Search), moreover, as a search engine, it is particularly interested in accelerating the transition of all human knowledge to digital formats that can be indexed by its spiders.
Tesseract is currently only available in source code form, which can be downloaded from this SourceForge.net page.Google releases an open source OCR