Specializations > Computational Linguistics
establishing a thematically-framed corpus: HOW?
hi fellow linguists,
Say, I want to come up with a corpus of ca 200 headlines about global warming taken from various popular science magazines (New Scientist, MIT Technology Review etc): which method of collection do you recommend? (besides the tedious option of collecting them "by hand"). would you recommend using a webcrawler?
any help is welcome!
200? Just collect them by hand. You won't save any time writing code to do it for you.
200,000? Then you would save time.
Headlines are also much easier to collect than full text articles, and probably easier to deal with in terms of copyright/access issues.
Regardless, a "corpus of 200 headlines" is a very limited amount of data, at least compared to how the term "corpus" is usually used in research. So I'm not sure I fully understand the question.
 Message IndexGo to full version