Morphological Inflection Engine
The project of a computer morphological inflection engine has started with commercial
purposes, i.e. generating a list of all the inflected forms for each lemma of a
Greek or Latin vocabulary. Anyway, due to the nature of the Classical languages
and to the many applications a more scientific work might offer, I've conceived
a more sophisticated structure for it, adopting from the ground up a diachronic
approach. Currently the general structure has been implemented and I'm working on
Latin as a prototype for the future product, but this software as for all the other
areas of the system has been designed from the ground up to be ready for both Greek
and Latin.
The Traditional 'Grid' Model And A New Perspective
Usually, automatic inflection systems are built for modern languages with strictly
synchronic structures, by defining a sort of ideal "grid" where each row is a word
form and each column a part of the inflectional paradigm: for instance, we could
imagine the following rows for English nouns:
|
lemma |
singular |
plural |
|
sample |
sample (+0) |
samples (+s) |
|
ax |
ax (+0) |
axes (+es) |
|
city |
city (+0) |
cities (y > i + es) |
|
wolf |
wolf (+0) |
wolves (f > v + es) |
|
wharf |
wharf (+0) |
wharves / wharfs |
|
... |
... |
... |
In this structure we could link to each cell a set of rules which take the word
as input and transform it into the required inflected form: e.g. from lemma add
zero for singular, add -s for most words, but -es for words ending in ch, s, sh,
x, z, y (additionally transforming -y into i and -f into -v), etc.
This kind of "flat" structure is often enough for modern languages in a synchronic
perspective, but even a practical approach to Classical languages require much more
complex inflection strategies. Every student of Greek or Latin has dealt with dialectal,
stylistic or historical variants, classified as 'exceptions' by a purely synchronic
approach, often so many that their lists fill many pages of a grammar, giving the
impression of a wild or even nonsensical variety. A simple 'grid' structure could
not prove and scale well with so many 'exceptions' and a much richer set of forms
(just a single Latin adjective typically includes 36 forms against the unique form
of English, and a standard Latin verb counts about 600 forms). You can get the idea
by interactively playing with one of the first stages of my morphological engine,
the Template Generation Rules, which define the ideal
'grid' to be filled with inflected forms.
Further, each 'cell' of the grid would be required to include not only all the different
formations for that paradigm position, but also all the special treatments of peculiar
words, which usually a student finds annotated in a vocabulary (e.g. the defective
paradigm of virus, pluralia or singularia tantum, vestigia like familias in fixed
formulas, etc.). With so many inflections branching from a single 'cell' a software
program based on a grid structure would quickly become a nightmare. This effort
could be even less attractive when considering the redundancy it would imply in
its procedures, due to the fact that often the same phenomena recur in many different
'cells'. To make a trivial example, think about the variants with o/u in words like
servos/servus, servom/servum, donom/donum, volgus/vulgus, relinquont/relinquunt,
etc.: for each cell and each branch in it we would have to replicate the same variants.
Of course, this depends on the fact that historically the same processes operate
throughout the system of the language, thus affecting various morphological classes;
but in a flat grid structure limited to surface forms this means that such hidden
forces operating under the cover are not recognized as such but only by means of
their various effects.
With a simple analogy, you could think of a shape drawn in an ancient black and
white mosaic: the artist's tracts define a coherent shape, but the ideal grid superimposed
to the drawing fragments it into small pieces: if one looks at just one piece at
a time without getting the full shape it's part of, it would be hard to understand
why sometimes these pieces even when adjacent come out black instead of white, or
vice-versa. In this analogy the historical processes shaping the surface word forms
are the artist's drawing, and the small tesserae are the single cells of the 'grid'
model described above: of course, it would be much easier to predict if a tessera
should be black or white from the full drawing rather than by defining empirical
rules from the observation of the single cells; these indeed might occur in some
detectable pattern rather than randomly, being the effect of such hidden processes,
but a change in perspective would be much more effective.
Thus, when asked to build a system capable of inflecting all the lemmata of a Greek
or Latin vocabulary I've decided to adopt this change of perspective, not only for
its practical benefits (this started as a commercial project), but also because
of its much higher potential in scientific applications. The 'grid' model might
seem simpler to implement at start, but implementing it in a Classical scenario
would quick make us regret this apparent simplicity; conversely, the model I've
conceived and I'm implementing is much more complex when we have to define its general
structure, but once operational it will prove much more powerful and easier to build
and mantain; and, last but not least, it is surely much more attractive from a scholar's
point of view!