Friday, March 1, 2013

More sophisticated regular expression searching in the Cordova

I am not a computational linguist, by any means, but I have slowly been learning enough about regular expressions to be able to do some useful things with FLEx.  One aspect just learned is how to use regular expressions in search and replace operations in FLEx.  (The FLEx help menus are not really very explicit on this.)

In a FLEx search and replace function — in Bulk Edit, for example — each thing that is enclosed in parentheses will return some set of results, called a capture.  You can refer to this capture with the variable $.  So the material in the first set of parentheses is $1.  The material in the second capture is $2, and so on.   Here is an example of how I used this information in the Colonial Valley Zapotec database.


Córdova normally cites a verb in the 1st person habitual.  Depending on the allomorph of the verb, the habitual of the verb might be /ti, to, te/.  The verb root will usually be four to eight letters long.  And the first person will end in /a/.

So if the form cited is tichapa, I would like it to be segmented ti+chap-a.

The "Find what" on the first line sets up a first capture group, which is the prefix, made up of t plus either e, i, or o. (Elements between square brackets are options.)  Because this whole first unit is between parentheses, it is capture group one, which I can refer to as $1 in the "Replace With" line below.

I want to replace it with the same thing, followed by a + to show the boundary.

The next capture group is a group of letters (shown by \w — meaning any wordforming character), and I have shown the number as between 4 and 8.  (On second thought, perhaps the lower number should have been three…)

Since this is the second capture group, I can refer to it by $2, in the "Replace With" line, and this time I replace it with the same thing, followed by a hyphen.

This is a first attempt at using the regular expressions with FLEx, but I think I can already see how they are going to make it possible to accomplish more sophisticated data manipulation as we try to get the Córdova diccionario into a format that we can understand.

No comments: