Thursday, February 23, 2012

Boosting corpus size for endangered languages

I've written several previous posts about methods for wrangling Bible translation material into Translation Editor and then into FLEx.  For an endangered language where there is an existing Bible translation, the amount of corpus material that can be added is fairly astonishing.

I ran the stats today on the Copala Triqui project today, and I think the numbers tell the story.   Within about 1/2 of the New Testament 'wrangled' into FLEx, the current project has around 100,000 words in it.


If we don't want to look at the translated material, we can filter it out by the Choose Texts menu:


(BTW, I made some mistake in genre type for the first few bits of Matthew that I experimented with, so don't show up in the Bible genre.  I could easily uncheck these as well, and at some point I will figure out what I have done wrong :-) )

If you filter out all the Bible material, the corpus is about 9,000 words:


A brief note -- Triqui is written in two orthographies (a practical one and a phonetic one).  The stats keep track of how many words are in each orthography as well as the total overall.

No comments: