Specializations > Computational Linguistics

Best OCR scan app for IPA?


Where can I find "the" best OCR scan app or coding script for scanning morpholinguistc texts (etymologies)?  What apps are the universities using to either OCR scan linguistic texts/lexocographies or has anyone sorted out what IPA symbols, punctuation, diacritics, etc... systematically scan as errors?

None at all. OCR technologies on the market operate on the basis of a "language", and no language uses IPA. This is annoying, and I would be happy if someone knew the technology well enough to create a Unicode-scoped OCR routine.

I use OCR every day for a variety of languages, but never IPA for the reasons panini listed. I wouldn't trust it anyway, given slight contrasts in symbols then varying with fonts. (It's hard enough for me to figure out what some symbols are supposed to be, especially when non-standard symbols are being used.)

If you want to take on this project seriously, though, you could look into customizing OCR software:

* Abbyy Finereader is one of the best on the market and it works for a number of languages but the Windows (not Mac!) version will also allow you to specify your own character set, so that is in principle an option. However, it would always do better with a limited character set, rather than throwing in the whole IPA and hoping for the best; additionally, the quality of the OCR depends in part on the training data and for many languages dictionaries (roughly spellcheck) that would be lacking if you used a custom character set.
* Another option would be to program this yourself, which could include training data and any customization you'd like. A good starting point would be: https://github.com/tesseract-ocr/
* You could also try a DIY workaround where you find something close to accurate and then correct individual characters. Assuming that you have very consistent scans (that's crucial) then you might find that errors are consistent. For example (this is probably not true), you might find that there's no "q" in your data but that ΓΈ is sometimes recognized as "q". You'd have to hope for most of the contrasts are represented differently (probably only in a relatively small set of possible IPA symbols for the data you're considering) and then basically do find-and-replace to fix the forms. That is, you would have to proofread a subset of the data to find consistent patterns, and attempt to apply the corrections automatically to the rest. That might save some time, might just be messy. So it depends on your specific project. I can imagine it working out well enough, with some luck, for an IPA list of words in a single language (or very similar related languages) but probably not for a broader data set.In short, I don't know how reliable this would be, given that the details are important, and there is no easy way to verify that the results are correct (as would be the case with, for example, reading English paragraphs and noting either spelling errors or words that don't make sense in context-- unless you know all of these transcriptions already you wouldn't know if they're right or wrong). In this case I'd advise typing them all out by hand. It would even take more time just to correct imperfect OCR! My guess is also that very few data sets would be long enough so as to require so much time to just type out by hand that solving IPA OCR would actually save you time.


[0] Message Index

Go to full version