Recent Posts

Pages: [1] 2 3 ... 10
Computational Linguistics / Re: Frequency and log-likelihood applied to a supcorpus
« Last post by Daniel on August 17, 2018, 11:39:58 AM »
Your hesitation is appropriate.

However, what is interesting about Zipf's Law is that it applies regardless of the particular data set you're looking at. (Obviously more approximately on smaller data sets.) So, it's possible there could be reasons for using it even on small data sets in your approach. But yes, look into existing research, and make sure there is a principled reason for approaching it that way.
Computational Linguistics / Re: Frequency and log-likelihood applied to a supcorpus
« Last post by Nemi on August 17, 2018, 11:37:25 AM »
Daniel, thank you a lot for your comment!  :) You're right, will search more for already existing measures. I wish I had more mathematical background, maybe it's also time for me to read more about statistics in general. I was too tempted to use log just because the results were amazing, but the more I think about it, I'm convinced that it is not statistically relevant here. It would be, if I used the whole corpus, maybe I will look into that.
Computational Linguistics / Re: Frequency and log-likelihood applied to a supcorpus
« Last post by Daniel on August 17, 2018, 11:04:15 AM »
I can't answer this from the technical perspective of a computational linguist, but I'll try to comment briefly.

When we use a statistical test, we are usually comparing two things. We want to show (roughly) that one is significantly more likely than another. There are much more complex statistical tests, but they all boil down to basically that core. This means that you can either compare two values to see whether one is bigger than the other (whether the difference between them is significant), or whether one value is significantly different from an expected value (e.g., 0 (lack of or observation of an unexpected event), 50% (as in a coin toss), etc.). Since the things you will be comparing will be similar, the comparison is usually unit-less. So in that sense, a log-transform may not be a problem at all. Of course it might mess up the statistical test mathematically, so beware of that. But in principle this may be fine, if you pick an appropriate statistical test. (Note that your intuitions about statistical distributions are irrelevant, so you shouldn't use logs just because you think they look better, but rely instead on a statistical test to tell you whether the results are 'interesting', that is, significant, in a technical sense.)

Regarding frequency and log likelihood for words in a corpus, you probably already know about Zipf's law:
But that partially addresses what you're observing: the relationship between the frequency of words is typically linear on a logarithmic scale. So there may be some sound motivation behind your decision.

The best answer is to consult current sources (e.g., journal articles) doing something like what you're trying to do, and then using the same (or slightly adjusted) methodology for your project. That's how you're likely to get published, at least. If you don't have a good reason for doing something else, that's where to start.

There are various approaches to dealing with frequencies in corpora (that's really the whole field of corpus linguistics!), and there are a number of ways to try to standardize values, find a balance between frequent words and frequent pairings to look at interesting patterns of attraction (e.g., etc. Don't reinvent the wheel if there's already a method out there, and that way your results will be comparable anyway.
Linguist's Lounge / Re: Introduction Thread
« Last post by Nemi on August 17, 2018, 10:50:25 AM »
Hi to all! My name is Nemi and I just started to dive into Corpus Linguistics. Currently I'm doing some research on Twitter. As I am pretty new to all this (my former field was philosophy), I have a lot of questions and was very happy, when I discovered this forum. I hope my questions are not too benign and apologize in advance, if they are :) Thank you for reading and nice to meet all of you!
Computational Linguistics / Frequency and log-likelihood applied to a supcorpus
« Last post by Nemi on August 17, 2018, 10:42:44 AM »
Dear all,

I am currently involved in a research project regarding a Twitter Corpus. I'd like to analyze the emotional tendency of tweets with specific keywords. Right now I'm compiling lists of frequent collocations and here is my question:

I compiled frequency lists of collocations to my keywords, which are fine. However, log likelihood scores bring me much more interesting results to be honest. It contains more hashtags and emotional adjectives, where as frequency is (obviously) listing "and" and "I" at the beginning.
As I am a noob regarding statistics, I'm not sure however, if I can use the log likelihood score to analyze a sub corpus without comparing it to something else. (I know that the MI-score has shortcomings, as it ranks less common words more highly. That's why I ruled it out.) I have the feeling that the results for log-likelihood would be a pitfall, since it is not measuring the whole corpus.

When analyzing collocations and I just want to know what people type mostly around a keyword, not in comparison to anything - not even the whole corpus itself! -, is it sufficient just to go by frequency or would a corpus linguist cringe? The scores measure probability, but since I already have my subcorpus targeting my keyword, is the score even applicable?

Best regards
Linguist's Lounge / Re: Is Burushaski, at its core, an Indo-European language?
« Last post by Daniel on August 17, 2018, 09:06:33 AM »
That's a fringe argument. Similar ideas have been proposed for decades, and they just aren't conclusive, because if there is such a relationship, it's too distant to demonstrate beyond a reasonable doubt. I'm not particularly opposed to the idea of it, but just about any theory for Burushaski's relationship is as good as any other at this point, so all you get from papers like that is some limited evidence in favor of one argument, but it's really just "circumstantial" (in the legal sense) because it doesn't show the relationship beyond a reasonable doubt-- it just would be compatible with the explanation, and if the explanation is correct, then probably a remaining trace of the original relationship.

In the end, questions like this are interesting, but there's a reason they haven't been answered. Of course Burushaski is related to something-- that should surprise no one. But we haven't yet been able to show which living language(s) that would be.

There are three reasons to remain skeptical:
1. If this was shown beyond a reasonable doubt, it would be big news. The fact that linguists have not reached consensus is telling.
2. Indo-European is an easy and lazy possibility. Given the extreme time depth (something like 10,000 years or more?), there are many, many other viable possibilities, and the Indo-European-centrism is just an artifact of the sociology of the field. It's no more likely than any other family to be related to Burushaski, but there has been a huge amount of research trying to link those up, so in a sense this is almost evidence against that particular possibility. It might be right, but why not also look just as hard at a possible connection to Tungusic or whatever other families haven't gotten that much attention. In the end, if you look hard enough for patterns, you'll find something that looks like a pattern, but that doesn't mean it's really evidence, especially when it's weak.
3. Clear evidence of relationships comes from widespread, systematic correspondences in languages. Pointing out individual features (e.g., pronominal paradigms) that happen to look similar in two languages leaves open the very real possibility for coincidence, or even borrowing. When we see one similarity, but an absence of other corresponding similarities, we should be skeptical. There is a reasonable argument for an ancient relationship between Indo-European and Uralic based on some prominal forms, for example, and while I wouldn't completely reject the possibility, I'll remain skeptical until we know more.

It's good to think about these issues in terms of two subtly but importantly different questions:
1. What is our best guess at the moment?
2. Should we accept that guess as a probable fact?

There's a "why not" argument for thinking Burushaski might be related to Indo-European, and that's not entirely unreasonable. Maybe! But there is no reason to assume that why not or best guess argument should make us assume it is correct or a resolved issue.

The problem is that many people looking at these issues want answers, rather than more questions or just interesting discussions. And all we have most of the time is complicated details, not conclusions.

To frame this from a slightly different perspective, consider the macro-family theories such as Eurasiatic or Nostratic. On the one hand, the current iterations of those theories are probably wrong and do not have enough evidence to back them up. However, I personally like them in the sense of giving me a vague intuitive idea of what the past might have been like. So there's a way in which I think of something like that (I don't even mind calling it "Eurasiatic"), in a very vague sense (that is, plus or minus several language families, unknown at this point), is probably a reasonable understanding. But I am not saying in any sense that either (1) "Eurasiatic" as a narrow hypothesis is correct, nor (2) we have enough evidence to reject alternative explanations.

In short, the quality of explanation corresponds to the availability of data. There's nothing fundamentally wrong with having a working understanding of a problem as your current best guess, but there is something wrong with taking that to the next step. It's the difference between "maybe" or "I wonder", and "Scientists have discovered that Burushaski is Indo-European!"

It's fine to be interested in these questions, but it comes with the risk of never finding definite answers.

So, what do I think? I think I don't know. Actually, I know I don't know. And at this point for the case of Burushaski, the evidence is too weak to even be leaning one way or another as a working hypothesis. It's related to something, surely, and at some time depth (maybe very extreme, even undetectably so), but Indo-European is not really a better explanation, given available data, than anything else, at least not by much of a margin. The most likely alternative explanation, given even only the data in that paper, would be ancient contact between the families. So, at this point, we don't know. That explanation might turn out to eventually be correct, but I wouldn't bet on it yet.
Linguist's Lounge / Is Burushaski, at its core, an Indo-European language?
« Last post by Voynichologist on August 17, 2018, 05:50:27 AM »
So, guys, what do you think about the thesis that Burushaski is, at its core, an Indo-European language?
Semantics and Pragmatics / Re: Help me pick the right words
« Last post by Paul Basileus on August 16, 2018, 07:11:04 PM »
Well, many thanks to you, I'll follow your advice
Semantics and Pragmatics / Re: Help me pick the right words
« Last post by Daniel on August 16, 2018, 07:03:55 PM »
This is a forum for linguistics, which is the scientific study of language, not about language learning or proofreading. There are many English learning or English usage forums on the internet. In this case, maybe your best feedback would just come from asking (potential) players, on a gaming forum.

As for the question, I don't think it matters so much because players will get used to the names, which will have specific meanings in the game anyway.

As for a little semantic analysis, the words you have picked do seem a little odd to me with "quality" in the middle. Try listing them out to think about semantic sets of properties. You have old/modern, and rare/common as clear pairs of semantic sets, but the medium category doesn't fit. This isn't really a question of translation, but meaning. Once you know the right meaning you can find the right word. But again, a gaming forum is probably the better place for this discussion.
Semantics and Pragmatics / Help me pick the right words
« Last post by Paul Basileus on August 16, 2018, 06:25:15 PM »
Hello everyone! Well, I've made a game called Antiquaria (a hidden object game the plot of which revolves around antiques) and there are a few thousand items in it: their names are meant to begin with adjectives that must:
1. hint the players at the items’ price
2. make players understand that there are 3 groups of items that differ in price
3. sound good (that’s why we don’t use such words as cheap and broken)
What do you think, are the adjectives below well chosen? Propose your ideas, we will be glad to get your suggestions.
1.High price:
2. Medium price: Quality
3. Low price:

Need help of native english speakers to find out whether the words I chose are right

P.S.: I'm a newbie here and don't know if I chose the right place of this forum to place this post
Pages: [1] 2 3 ... 10