Author Topic: Interlinear gloss data format standards?  (Read 1734 times)

Offline MalFet

  • Global Moderator
  • Serious Linguist
  • *****
  • Posts: 282
  • Country: us
Interlinear gloss data format standards?
« on: July 30, 2014, 10:27:24 PM »
Dear hivemind (especially jkpate!):

Is there anything resembling an open standard for corpus data, especially with interlinear glossing?

After years of frustration, I've finally reached my breaking point with the language documentation tools on the market. I think I can find a bit of money to get some new software made, but I'll still need to make most of the major structural decisions myself. To that end, I'd prefer not to reinvent the wheel. But, this certainly isn't my area of expertise, and I find myself (unsurprisingly) disoriented in a sea of proposals and data ontologies. I can read an XML schema well enough, but I don't have any sense of what people actually working with this stuff have and haven't been using successfully.

I suspect my needs aren't that specialized...quite the contrary; I'm surprised there's not already a clear standard here. Linguists and herding cats, I suppose. If anyone has had positive experiences with a data model, though, I'd love to hear about it. Especially important are:
* Comprehensive markup, including time alignment with audio files
* Compatibility with modern tools (xml beats the weird latex-like command tags that SIL uses, for example)
* Flexibility (These tools will be used for both linguistic analysis and more conventional translation projects, so it would be nice to have multiple tiers of analysis that can be used flexibly)
* Human readability (we'll probably use something like git for data versioning/collaboration, and at least early on most of the commits will be handled manually off of raw data)

I can get more specific as required. All help much appreciated!

Offline jkpate

  • Forum Regulars
  • Linguist
  • *
  • Posts: 130
  • Country: us
    • American English
    • jkpate.net
Re: Interlinear gloss data format standards?
« Reply #1 on: July 30, 2014, 11:56:36 PM »
Have you looked at the Nite XML Toolkit out of Edinburgh? It allows you to annotate several different layers of structure, and each annotation can point to times in an audio or video files, or even other annotations. It also comes with a nice query language in first-order logic (although, with pointers, the annotation graph is an arbitrary graph, so some queries can be quite slow). There is pretty good documentation (and the project is still being maintained and developed), and several well-known corpora use this toolkit.

I don't have any personal experience creating annotations using these tools, but, as an end-user of data in this format, the produced annotations are very nice.
« Last Edit: July 31, 2014, 12:05:20 AM by jkpate »
All models are wrong, but some are useful - George E P Box

Offline MalFet

  • Global Moderator
  • Serious Linguist
  • *****
  • Posts: 282
  • Country: us
Re: Interlinear gloss data format standards?
« Reply #2 on: July 31, 2014, 03:18:15 AM »
The Nite XML toolkit looks absolutely beautiful. I'll dig into the nitty-gritty over the next few days, but the design overview sounds like exactly what I wanted...better than what I wanted, actually, because there's all sorts of stuff I would have never come up with on my own. There's even a query syntax!

Thanks for your help.

Offline freknu

  • Forum Regulars
  • Serious Linguist
  • *
  • Posts: 397
  • Country: fi
    • Ostrobothnian (Norse)
Re: Interlinear gloss data format standards?
« Reply #3 on: July 31, 2014, 03:23:42 AM »
If you're going to use something like XML — and although I don't personally like XML very much, it is indeed a stable and mature format that is used widely — then it's all about the structure ... you can throw any idea about making any software (right now) straight out the window!

Making a graphical interface to work with your XML formats is going to be the easy part.

However, I'm afraid this is not something I'm that educated in. Standards in general when it comes to linguistics or science in general is a bit sketchy for me, it's not exactly that easy to find — and even if there are standards, they often vary from school to school or person to person.

Certain aspects of data can be much more standardised, e.g. GIS and so on, while other things barely have any standards at all.

Searching for anything that could be useful I found:

XCES
http://xml.coverpages.org/xces.html

Natural Language Processing with Python
http://www.nltk.org/book/

ULex
http://www.lrec-conf.org/proceedings/lrec2012/pdf/905_Paper.pdf

But I have no idea how well any of them work or if it's what you need.

(EDIT)

Or just go with jkpate's suggestion :D

Offline MalFet

  • Global Moderator
  • Serious Linguist
  • *****
  • Posts: 282
  • Country: us
Re: Interlinear gloss data format standards?
« Reply #4 on: July 31, 2014, 08:14:15 PM »
If you're going to use something like XML — and although I don't personally like XML very much, it is indeed a stable and mature format that is used widely — then it's all about the structure ... you can throw any idea about making any software (right now) straight out the window!

Making a graphical interface to work with your XML formats is going to be the easy part.

Indeed, at least relatively speaking, which is why I'm hoping somebody else has done the hard part for me already!

XML can be cumbersome because of how verbose it is, but that actually solves exactly the problem I'm dealing with here. There is no shortage of language documentation tools, but each one has a lot of baggage, ossified from the cognitive and methodological habits of its creators. I have no intention of designing tools that will work for all people at all times, but if I keep the UI relatively thin and build on flexible data models, others can more easily package their own (extensible and interchangeable) solutions as they see fit.

The reality is that most of the tools that currently exist were designed before most people had heard of the internet. These days, my typical informant is a subsistence goat herder with a smart phone.

Thanks for the links!