Specializations > Computational Linguistics

Corpus linguistic research on rare phenomena


For a research project, I will have to analyze a corpus consisting of British Parliamentary debates for a rare phenomenon, which also cannot be found easily by looking for certain (POS)tags, words, etc. Does anyone have any tips for me with regard to how to construct the corpus (size-wise, taking into account that the phenomenon is rare but that I will probably have to identify it manually) or knows any articles dealing with this problem?

Semantics (and pragmatics)!

The primary limitation of any automated corpus searching is determining precise meanings of words in context. So if you can design a study where you look at a semantic issue in detail, it will be something you'd look at manually. One thing that can make something rare is if it applies only to one or several words.

There is actually quite a bit of research these days about small corpora, rare phenomena, etc., so you can find some if you search for it. But the details will vary based on the specific problem. (If you look beyond English that will be a real challenge of course. Or at dialects, historical data, acquisition data-- that's something I'm researching myself, in regards to small corpora.)

It's unclear to me from your post whether you're doing a class project, research for publication, etc., and whether you already have a topic in mind or if you are still deciding. The topic will determine some of how you search for it.

Thank you very much for your reply. I am doing research for publication. The phenomenon that I will study is criticism to a metaphorically presented argumentative moves in British parliamentary debates. While there is no shortage of metaphorical expressions in the debates, criticism to metaphors is quite hard to find. So the corpus cannot be too small, because than chances are that I wont find anything (besides, the debates themselves are long texts, and I want to study a number of them), but on the other hand, it will be difficult to process too much text because of the fact that I wont be able to search for criticism to metaphor (semi-)automatically... Do you think that establishing a list of relevant lexical items for the most relevant conceptual metaphors in politics, and only look for them might be a solution (although the downside will be that I might miss some very interesting cases)?

Interesting topic, and difficult project.

You actually have a very large corpus, even if you don't choose to use all of it. One way that some research is done is to just use a selection of results. This can be done using an automated search or a manual search. So you could simply read ALL of the transcripts until you find 10 or 100 or 1000 examples, and then stop. That wouldn't be comprehensive, but it would be randomized in a way so it would still be representative (at least of the section of texts you looked at).

I can't think of any other obvious ways to get examples like this because other searching methods will probably be biased toward either one type of metaphor or one type of reply. (For example you could look for instances of the word "metaphor" for metalinguistic commentary, but that would limit your results and discussion to only instances where the person replying was very direct and even technical in the commentary. Or you could try to find one speaker who used metaphors often and look for examples just from their speech, but that is also not representative, unless that in itself would be interesting if it was some relevant historical figure, for example.)

I think there are four reasonable approaches here:
1. Choose a small sample (either a sub-section of the corpus, or better just whatever it takes to reach a certain number of examples). [I have seen published research like this, even for less difficult phenomena to identify like a syntactic construction.]
2. Find any examples and then extrapolate until you find more similar examples. This would be fastest, but it wouldn't be very representative because of what you might miss.
3. Spend a LOT of time going through a LOT of data, or get funding to hire research assistants to do the same, etc.
4. Use a shortcut of political commentary or other source that helps you identify metaphors. Maybe there are some existing sources that document some instances. Still, it would be hard for these to be representative.

I'm assuming that (1) will be the best option, unless the others stand out to you as useful.

Note, of course, that genre and context will change the rate of metaphor usage. So you can probably skip over large portions of the corpus that are more procedural or where there is little interaction between speakers, if there is an easy way to identify that. You could attempt to automate a search for passages where there is a medium-sized turn by one speaker followed by a reply from another speaker, within a more flexible part of the proceedings. And from there read each instance to see what comes up. But I don't know how much time that would really save you.


[0] Message Index

Go to full version