Specializations > Computational Linguistics

Cluster Analysis for Questionnaire data


My wife is writting her Master's thesis and needs to perform cluster analysis on her data. The data basically is a list of questionnaire responses. There are 7 questions, and each question may be answerd with discrete values from 0 to 4. In other words, there will be one 7D vector per questionnaire participant, and each dimension of the vector may be a natural number between 0 and 4. The distance metric between two vectors is the following:
d(v1, v2) = Sumi in [0, 6][v1(i) == v2(i) ? 0 : 1] -- putting it simply, 1 unit per differing dimension.
I am a computer scientist and I am familiar with clustering methods, but I am not sure what kind of cluster analysis is commonly used in linguistics for this kind of data. What method would you recommend?

Thanks for any comment.

You should ask this to statisticians, because that's fairly high level statistics compared to what is common in linguistics, and although some linguists might know the answer you probably won't find them here.

However, I do know a bit about statistics, and I would just ask why she is trying to do it that way. What is the point in quantifying variation in the responses overall? Wouldn't it be more useful to look for particular patterns?

Best practice for significance testing is to have a single hypothesis in mind and then test for exactly that hypothesis, rather than fishing for any sort of (probably coincidental) patterns in the data. You're more likely to find noise (coincidences) if you look at the data too broadly or let a computer find patterns for you.

There are some textbooks (and other resources) specifically for how to do statistics in linguistics. But first you need to figure out what question you're asking. Then you can figure out how to test it statistically.

I'm not sure what kind of background either of you has, but if you're not used to significance testing in general (for example, a T-test, an ANOVA, etc.) then you probably should start with an introductory class (or equivalent, even if that's reading a textbook or just Wikipedia on your own). Obviously you can find specific information in research papers with a similar methodology to the project. That's a fairly safe way to do it.

One approach linguists often use is Mixed Effects models, where you have one main target variable (for example, the pronunciation of a certain sound) but you can include in your model the other ways in which the individuals in your sample vary, in order to avoid any correlations in that data causing problems. (For example, if you have boys and girls, but the boys are ages 4-8 and the girls are ages 6-10, you could try to balance that out using a more complex model. Mixed Effects is sort of like that (but read about the details beyond my oversimplified example here!), and it allows you to build a complex model).

My best advice for you would honestly be to start over entirely (you can keep the data!) and find a much simpler specific question (or several questions) to test statistically.


[0] Message Index

Go to full version