Dear hivemind (especially jkpate!):
Is there anything resembling an open standard for corpus data, especially with interlinear glossing?
After years of frustration, I've finally reached my breaking point with the language documentation tools on the market. I think I can find a bit of money to get some new software made, but I'll still need to make most of the major structural decisions myself. To that end, I'd prefer not to reinvent the wheel. But, this certainly isn't my area of expertise, and I find myself (unsurprisingly) disoriented in a sea of proposals and data ontologies. I can read an XML schema well enough, but I don't have any sense of what people actually working with this stuff have and haven't been using successfully.
I suspect my needs aren't that specialized...quite the contrary; I'm surprised there's not already a clear standard here. Linguists and herding cats, I suppose. If anyone has had positive experiences with a data model, though, I'd love to hear about it. Especially important are:
* Comprehensive markup, including time alignment with audio files
* Compatibility with modern tools (xml beats the weird latex-like command tags that SIL uses, for example)
* Flexibility (These tools will be used for both linguistic analysis and more conventional translation projects, so it would be nice to have multiple tiers of analysis that can be used flexibly)
* Human readability (we'll probably use something like git for data versioning/collaboration, and at least early on most of the commits will be handled manually off of raw data)
I can get more specific as required. All help much appreciated!