You can scan a 500 page book in 20 minutes with you phone, non-destructively. Th...

suddenclarity · on April 9, 2023

I think that's ambitious. I've scanned a handful of books and although I don't know the page count, it's always been multi hour projects. Especially the editing afterwards to mark the text on the page and adjust warp tend to be time consuming. I even had a mounted tripod and remote to standardise.

leobg · on April 9, 2023

Voice Dream Scanner on iOS recognizes pages and text and performs OCR. You just go snap snap snap.

For text that is sufficient. If you want to capture the layout also, or graphics, things get much more complicated. Even just getting headings requires post processing. So yeah, proper ebook vs just extracting the text are two different problems.

layer8 · on April 9, 2023

vFlat Scan [0] automates the unwarping.

[0] https://www.vflat.com/

tomodachi94 · on April 9, 2023

There's also https://scantailor.org/ (and a maintained fork at https://github.com/4lex4/scantailor-advanced ) which semi-automates unwarping and other corrective tasks in scanned books.

beej71 · on April 9, 2023

I just tried this with a thumb-held warped page and Tesseract for OCR--worked pretty well! (Tesseract by itself struggles with warped pages.)

hkt · on April 9, 2023

Graphics or any kind of symbols. I tried this with an out of print edition of an economics textbook from the 70s: the scans were fine but I never found any OCR that could even semi-faithfully reproduce the equations, nor anything in a table.

It occurs to me that a place where scans and first-pass OCRs of this stuff are shared with the intention of version controlling the resultant epub would be handy, such that small PRs could be applied cumulatively and the result be versioned (and therefore referenced). That's considerable work and legal risk, though.

wiml · on April 9, 2023

That's not too far removed from what Distributed Proofreaders did/does.