Following my past blog entry : Google Books Killing Public Domain made in late 2006, I found an interesting publication of Luc Vincent from Google presented at ICDAR 2007 called Google Book Search: Document Understanding on a Massive Scale.
The document is covering all the challenges encountered when doing OCR and how to analyse and understand the results of the documents scanned. That's a very difficult topic including small little things like page ordering or chapter detection. The publication also introduces the ongoing work with the OCR software engine released as free software called Tesseract OCR and the OCROpus framework also available as free software. It's still very beta software but that's nice to see Google releasing some parts of their software.
Beside all the positive there is a small negative point :
We believe we can help by making some large chunks of our (out of copyright) data available to the Document Analysis research community.
This part reminds me of my old blog entry about the public domain books scanned by Google and becoming again proprietary work… Why don't they release all the public domain datasets to make them available to everyone without the current restrictive license ? That would be easier and could provide some more interesting (scientific or not) results just like the datasets available from Wikipedia.