Linguist Forum

Specializations => Computational Linguistics => Topic started by: Jess on May 01, 2018, 05:01:54 AM

Title: establishing a thematically-framed corpus: HOW?
Post by: Jess on May 01, 2018, 05:01:54 AM
hi fellow linguists,

Say, I want to come up with a corpus of ca 200 headlines about global warming taken from various popular science magazines (New Scientist, MIT Technology Review etc): which method of collection do you recommend? (besides the tedious option of collecting them "by hand"). would you recommend using a webcrawler?

any help is welcome!
cheers, jess
Title: Re: establishing a thematically-framed corpus: HOW?
Post by: Daniel on May 01, 2018, 09:24:06 AM
200? Just collect them by hand. You won't save any time writing code to do it for you.
200,000? Then you would save time.

Headlines are also much easier to collect than full text articles, and probably easier to deal with in terms of copyright/access issues.

Regardless, a "corpus of 200 headlines" is a very limited amount of data, at least compared to how the term "corpus" is usually used in research. So I'm not sure I fully understand the question.