Morphological Inflection Engine

The project of a computer morphological inflection engine has started with commercial purposes, i.e. generating a list of all the inflected forms for each lemma of a Greek or Latin vocabulary. Anyway, due to the nature of the Classical languages and to the many applications a more scientific work might offer, I've conceived a more sophisticated structure for it, adopting from the ground up a diachronic approach. Currently the general structure has been implemented and I'm working on Latin as a prototype for the future product, but this software as for all the other areas of the system has been designed from the ground up to be ready for both Greek and Latin.

The Traditional 'Grid' Model And A New Perspective

Usually, automatic inflection systems are built for modern languages with strictly synchronic structures, by defining a sort of ideal "grid" where each row is a word form and each column a part of the inflectional paradigm: for instance, we could imagine the following rows for English nouns:

lemma singular plural
sample sample (+0) samples (+s)
ax ax (+0) axes (+es)
city city (+0) cities (y > i + es)
wolf wolf (+0) wolves (f > v + es)
wharf wharf (+0) wharves / wharfs
... ... ...

In this structure we could link to each cell a set of rules which take the word as input and transform it into the required inflected form: e.g. from lemma add zero for singular, add -s for most words, but -es for words ending in ch, s, sh, x, z, y (additionally transforming -y into i and -f into -v), etc.

This kind of "flat" structure is often enough for modern languages in a synchronic perspective, but even a practical approach to Classical languages require much more complex inflection strategies. Every student of Greek or Latin has dealt with dialectal, stylistic or historical variants, classified as 'exceptions' by a purely synchronic approach, often so many that their lists fill many pages of a grammar, giving the impression of a wild or even nonsensical variety. A simple 'grid' structure could not prove and scale well with so many 'exceptions' and a much richer set of forms (just a single Latin adjective typically includes 36 forms against the unique form of English, and a standard Latin verb counts about 600 forms). You can get the idea by interactively playing with one of the first stages of my morphological engine, the Template Generation Rules, which define the ideal 'grid' to be filled with inflected forms.

Further, each 'cell' of the grid would be required to include not only all the different formations for that paradigm position, but also all the special treatments of peculiar words, which usually a student finds annotated in a vocabulary (e.g. the defective paradigm of virus, pluralia or singularia tantum, vestigia like familias in fixed formulas, etc.). With so many inflections branching from a single 'cell' a software program based on a grid structure would quickly become a nightmare. This effort could be even less attractive when considering the redundancy it would imply in its procedures, due to the fact that often the same phenomena recur in many different 'cells'. To make a trivial example, think about the variants with o/u in words like servos/servus, servom/servum, donom/donum, volgus/vulgus, relinquont/relinquunt, etc.: for each cell and each branch in it we would have to replicate the same variants. Of course, this depends on the fact that historically the same processes operate throughout the system of the language, thus affecting various morphological classes; but in a flat grid structure limited to surface forms this means that such hidden forces operating under the cover are not recognized as such but only by means of their various effects.

mosaic With a simple analogy, you could think of a shape drawn in an ancient black and white mosaic: the artist's tracts define a coherent shape, but the ideal grid superimposed to the drawing fragments it into small pieces: if one looks at just one piece at a time without getting the full shape it's part of, it would be hard to understand why sometimes these pieces even when adjacent come out black instead of white, or vice-versa. In this analogy the historical processes shaping the surface word forms are the artist's drawing, and the small tesserae are the single cells of the 'grid' model described above: of course, it would be much easier to predict if a tessera should be black or white from the full drawing rather than by defining empirical rules from the observation of the single cells; these indeed might occur in some detectable pattern rather than randomly, being the effect of such hidden processes, but a change in perspective would be much more effective.

Thus, when asked to build a system capable of inflecting all the lemmata of a Greek or Latin vocabulary I've decided to adopt this change of perspective, not only for its practical benefits (this started as a commercial project), but also because of its much higher potential in scientific applications. The 'grid' model might seem simpler to implement at start, but implementing it in a Classical scenario would quick make us regret this apparent simplicity; conversely, the model I've conceived and I'm implementing is much more complex when we have to define its general structure, but once operational it will prove much more powerful and easier to build and mantain; and, last but not least, it is surely much more attractive from a scholar's point of view!

Highlights

powered by ParaScroller