Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You can scan a 500 page book in 20 minutes with you phone, non-destructively. Though it’s not fun and the quality will not be ideal, especially if there are graphics in the text also. For text-only books, quality matters little as long as the OCR can read the text.


I think that's ambitious. I've scanned a handful of books and although I don't know the page count, it's always been multi hour projects. Especially the editing afterwards to mark the text on the page and adjust warp tend to be time consuming. I even had a mounted tripod and remote to standardise.


Voice Dream Scanner on iOS recognizes pages and text and performs OCR. You just go snap snap snap.

For text that is sufficient. If you want to capture the layout also, or graphics, things get much more complicated. Even just getting headings requires post processing. So yeah, proper ebook vs just extracting the text are two different problems.


vFlat Scan [0] automates the unwarping.

[0] https://www.vflat.com/


There's also https://scantailor.org/ (and a maintained fork at https://github.com/4lex4/scantailor-advanced ) which semi-automates unwarping and other corrective tasks in scanned books.


I just tried this with a thumb-held warped page and Tesseract for OCR--worked pretty well! (Tesseract by itself struggles with warped pages.)


Graphics or any kind of symbols. I tried this with an out of print edition of an economics textbook from the 70s: the scans were fine but I never found any OCR that could even semi-faithfully reproduce the equations, nor anything in a table.

It occurs to me that a place where scans and first-pass OCRs of this stuff are shared with the intention of version controlling the resultant epub would be handy, such that small PRs could be applied cumulatively and the result be versioned (and therefore referenced). That's considerable work and legal risk, though.


That's not too far removed from what Distributed Proofreaders did/does.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: