Wednesday, May 30, 2012

Bulk editing Cordova Zapotec entries

Continuing on my efforts to make the enormous Cordova dictionary of colonial Zapotec.

My general goal is to have the Lexeme Form of each verb show the verb root.  Cordova's usual practice was to cite a verb in the habitual aspect for the 1st person singular.

The habitual has several allomorphs, written in the following way by Cordova (with my best guess at the intended phonemic?
  • <to> /ru-/
  • <ti> /ri-/
  • <t> /r-/
Also sometime as <te>, though I'm not sure about /re-/ as an allomorph of the habitual in modern Valley Zapotec.

The first person is usually written as 
  • <a> /=a/  after a consonant
  • <ya> /=ya/ after a vowel


The version of the database that we have inherited from Thom Smith Stark often has these separated from the root as follows:


to+chìba-ya ticha-pitào,  'bendezir algo o consagrar' ('bless something or consecrate')




So the stem should be

chiba


with /to-/ and /-ya/ stripped away.  This entry also shows that the /-ya/ is not necessarily final in the entry, since Cordova often includes a typical object along with the verb.  Here the object is ticha pitào
'word of God'.

I've worked through most of the verbs in the 5000 imported Cordova entries at this point.  

My first step was to copy all the information in the original entry to the Citation Form field, so that I always have the original form available.  Then I word on the Lexeme Form field to a.) remove the habitual aspect prefixes b.) remove the 1st singular suffix, b.) put the information about completive and potential aspects into special fields.  

The procedure uses the Bulk Edit function of FLEx, generally searching for various allomorphs of the habitual and 1sg and replacing them with nothing.  This is easiest for the entries where Thom Smith Stark's analysis, where + separates the prefix , - precedes the suffix.  I can search for entries with to+, ti+, t+, -a,  and -ya pretty easily.

Here are some screen shots, first filtering to find all the examples of the pattern.  The search uses regular expressions, so ^ means at the beginning of the record and the \ makes the following + be interpreted literally as + (not some function).

Here is the bulk replace setup screen:

And here is an example of an entry after the Bulk Replace has removed the to- prefix.


(I also changed the part of speech for all items with the to+ pattern to Verb.) More difficult are the entries where Thom did not do the analysis.   It's not correct to remove every initial to sequence, since some of the resulting items are just nouns that start with to.   For example, the noun tola  'sin'  shouldn't be changed to la with a to prefix.


What I tried here was searching for the pattern to...a or to...ya, then inspecting the results to make sure that the Spanish gloss seems to be a verbal form (generally cited in either the infinitive or the past participle form).  I changed the part of speech for all of the good instances to Verb, and make a few manual changes to other parts of speech when I could figure it out. 


After the valid instances of verbal to...(y)a  were identified, I filtered the data to show only verbs and then used the same Bulk Replace method to delete the prefixes and suffixes from the Lexeme Form.

No comments: