Linguist Forum
Specializations => Computational Linguistics => Topic started by: Nemi on August 17, 2018, 10:42:44 AM

Dear all,
I am currently involved in a research project regarding a Twitter Corpus. I'd like to analyze the emotional tendency of tweets with specific keywords. Right now I'm compiling lists of frequent collocations and here is my question:
I compiled frequency lists of collocations to my keywords, which are fine. However, log likelihood scores bring me much more interesting results to be honest. It contains more hashtags and emotional adjectives, where as frequency is (obviously) listing "and" and "I" at the beginning.
As I am a noob regarding statistics, I'm not sure however, if I can use the log likelihood score to analyze a sub corpus without comparing it to something else. (I know that the MIscore has shortcomings, as it ranks less common words more highly. That's why I ruled it out.) I have the feeling that the results for loglikelihood would be a pitfall, since it is not measuring the whole corpus.
When analyzing collocations and I just want to know what people type mostly around a keyword, not in comparison to anything  not even the whole corpus itself! , is it sufficient just to go by frequency or would a corpus linguist cringe? The scores measure probability, but since I already have my subcorpus targeting my keyword, is the score even applicable?
Best regards
Nemi

I can't answer this from the technical perspective of a computational linguist, but I'll try to comment briefly.
Statistically:
When we use a statistical test, we are usually comparing two things. We want to show (roughly) that one is significantly more likely than another. There are much more complex statistical tests, but they all boil down to basically that core. This means that you can either compare two values to see whether one is bigger than the other (whether the difference between them is significant), or whether one value is significantly different from an expected value (e.g., 0 (lack of or observation of an unexpected event), 50% (as in a coin toss), etc.). Since the things you will be comparing will be similar, the comparison is usually unitless. So in that sense, a logtransform may not be a problem at all. Of course it might mess up the statistical test mathematically, so beware of that. But in principle this may be fine, if you pick an appropriate statistical test. (Note that your intuitions about statistical distributions are irrelevant, so you shouldn't use logs just because you think they look better, but rely instead on a statistical test to tell you whether the results are 'interesting', that is, significant, in a technical sense.)
Regarding frequency and log likelihood for words in a corpus, you probably already know about Zipf's law: https://en.wikipedia.org/wiki/Zipf%27s_law
But that partially addresses what you're observing: the relationship between the frequency of words is typically linear on a logarithmic scale. So there may be some sound motivation behind your decision.
The best answer is to consult current sources (e.g., journal articles) doing something like what you're trying to do, and then using the same (or slightly adjusted) methodology for your project. That's how you're likely to get published, at least. If you don't have a good reason for doing something else, that's where to start.
There are various approaches to dealing with frequencies in corpora (that's really the whole field of corpus linguistics!), and there are a number of ways to try to standardize values, find a balance between frequent words and frequent pairings to look at interesting patterns of attraction (e.g. https://en.wikipedia.org/wiki/Collostructional_analysis), etc. Don't reinvent the wheel if there's already a method out there, and that way your results will be comparable anyway.

Daniel, thank you a lot for your comment! :) You're right, will search more for already existing measures. I wish I had more mathematical background, maybe it's also time for me to read more about statistics in general. I was too tempted to use log just because the results were amazing, but the more I think about it, I'm convinced that it is not statistically relevant here. It would be, if I used the whole corpus, maybe I will look into that.
Thanks!

Your hesitation is appropriate.
However, what is interesting about Zipf's Law is that it applies regardless of the particular data set you're looking at. (Obviously more approximately on smaller data sets.) So, it's possible there could be reasons for using it even on small data sets in your approach. But yes, look into existing research, and make sure there is a principled reason for approaching it that way.