Linguist Forum

Specializations => Computational Linguistics => Topic started by: ABBY on August 14, 2015, 11:49:28 PM

Title: Statistical sentence suggestion model like spell checking
Post by: ABBY on August 14, 2015, 11:49:28 PM

There are already spell checking models available which help us to find the suggested correct spellings based on a corpus of trained correct spellings. Can the granularity be increased to "word" from alphabet so that we can have even phrase suggestions , such that if an incorrect phrase is entered then it should suggest the nearest correct phrase from the corpus of correct phrases, of course it is trained from a list of valid phrases.

Are there any python libraries which achieve this functionality already or how to proceed for this for an existing large gold standard phrase corpus to get statistically relevant suggestions?

Note: this is different from a spell checker as the alphabets in a spell checker are finite whereas in a phrase correcter the alphabet is itself a word hence theoretically infinite , but we can limit the number of words from a phrase bank.
Title: Re: Statistical sentence suggestion model like spell checking
Post by: Daniel on August 15, 2015, 01:43:05 PM
A background assumption in linguistics is that language consists of a finite lexicon and infinite combinations. So it makes sense to have a word-dictionary, but a phrase-dictionary is not as easily applicable. You could come up with a list of the most common phrases, but that would then require some knowledge about syntactic structure (why do we have "red book" but not "the red"?).

The standard approach to this is to use N-grams, usually bi-grams or tri-grams, where the words next to a given word are taken into account. This is in fact used by spellcheckers and grammar checkers, at least in some programs. Your phone's spelling corrector will be based not only on whether the word is spelled correctly but also based on whether it is near other words that are often found with it. If a collocation (two+ words near each other that are of high probability) is identified, that might replace the one you're currently typing, rather than a correctly spelled word that does not fit in context.

More generally, N-gram models are used in Machine Translation as the "language model"-- after the translation model actually translates the content from the source language to the target language, the language model then cleans up the output so that it sounds more like natural language in the target model, (mostly) independently of the translation process itself and the source language.

So as a starting point, you can look up information about "N-grams" and work from there. You'll easily find a lot of information. Note that this method does not take syntactic structure into account in any way. "The red" and "red book" are (potentially) equally likely N-grams (="phrases"), even though one corresponds to a constituent in syntactic theory and the other does not. In this way it's somewhat naive, but in the end it works out quite well.

I don't have a specific answer to how you can implement this in python but the first thing that comes to mind is NLTK:
http://www.nltk.org/

(I am an active user of this type of technology and have some information about the theory behind it but I don't actually program it myself, not beyond a few very basic experiments.)

To see how the idea of N-grams works (and where it comes from), take a look at this paper:
http://www.mt-archive.info/50/SciAm-1949-Weaver.pdf
What's especially interesting is the chart on the last page, which the whole paper builds up to. The chart shows what looking at language as adjacency-correspondences or collocations (=N-grams) does, in contrast to using a dictionary, even regarding spelling. (Note that some spell-checkers also implement letter-based N-grams, such as in Optical Character Recognition [OCR] software that tries to convert a scanned image back into text, knowing that, for example, a vowel may be more common after a consonant than after a vowel, allowing it to better guess even unknown words like people's names.)