Description Tables and Heuristic Conversion
A description table in Theuth is the building block of textual conversions;
it is just an XML file describing the encoding used by an arbitrarily encoded
font or any other similar encoding system.
Font description tables do not tell something like "convert code X of font A into
code Y of font B", as they are designed for minimal effort, and a similar approach would
require you to create a table for each conversion (from A to B, from B to A, etc.).
Instead, a table just describes each font code into entities: for instance, you
describe Unicode +03AC as Greek letter alpha (a 'segmental' entity) + acute accent
(a 'suprasegmental' entity, essentially a diacritic). Now say we have another encoding,
for instance the one corresponding to the SuperGreek font, defining code 97 as alpha
and code 118 as acute accent (a zero-width character superposed to the previous character);
Theuth will be able to convert between the two in both directions:
- Unicode to SuperGreek: when Theuth finds Unicode +03AC it first decomposes
it into its entities according to its description table for Unicode, thus getting
alpha and acute accent; then it looks in the target font description
table (here SuperGreek) for the best character(s) representing the same entities: it
does not find a precombined character, so it just collects the one representing alpha
(97) and the one representing acute accent (118). Thus, it effectively converts
code 940 (Unicode alpha + acute) into codes 97, 118 (SuperGreek alpha and acute).
Notice that in this approach we do not have a conversion table telling us to convert
940 into 97 + 118, but two independent tables which just describe each character of
both encodings: it is the converter software which finds out the best conversion for
each specific case.
- SuperGreek to Unicode: when Theuth finds codes 97 and 118 it knows it must
grab them together as 118 is a 'superposable entity' (a diacritic): it thus decomposes
both into entities getting alpha from 97 and acute accent from 118,
and then looks into Unicode description table for the best match. It finds 2 matches,
the precomposed +03AC and its equivalent sequence +03B1 +0301, and opts for the former.
Thus, SuperGreek codes 97, 118 become Unicode 940.
In both samples the same entity, alpha + acute accent, is converted into its proper encoding.
It may also be the case that not all the entities match: for instance a font A may have a
precomposed code for alpha + acute + short, a font B may have
a precomposed code only for alpha + acute but a superposable character
code for short; a font C may only have codes for alpha and acute.
In any case the converter will find the best match so that even in the worst case (from A to C)
the conversion will produce a partial match rather than losing the original character at all
(alpha + acute + short becomes alpha + acute
and only short rather than losing the whole set of entities).