Description Tables and Heuristic Conversion

A description table in Theuth is the building block of textual conversions; it is just an XML file describing the encoding used by an arbitrarily encoded font or any other similar encoding system.

Font description tables do not tell something like "convert code X of font A into code Y of font B", as they are designed for minimal effort, and a similar approach would require you to create a table for each conversion (from A to B, from B to A, etc.). Instead, a table just describes each font code into entities: for instance, you describe Unicode +03AC as Greek letter alpha (a 'segmental' entity) + acute accent (a 'suprasegmental' entity, essentially a diacritic). Now say we have another encoding, for instance the one corresponding to the SuperGreek font, defining code 97 as alpha and code 118 as acute accent (a zero-width character superposed to the previous character); Theuth will be able to convert between the two in both directions:

  • Unicode to SuperGreek: when Theuth finds Unicode +03AC it first decomposes it into its entities according to its description table for Unicode, thus getting alpha and acute accent; then it looks in the target font description table (here SuperGreek) for the best character(s) representing the same entities: it does not find a precombined character, so it just collects the one representing alpha (97) and the one representing acute accent (118). Thus, it effectively converts code 940 (Unicode alpha + acute) into codes 97, 118 (SuperGreek alpha and acute). Notice that in this approach we do not have a conversion table telling us to convert 940 into 97 + 118, but two independent tables which just describe each character of both encodings: it is the converter software which finds out the best conversion for each specific case.
  • SuperGreek to Unicode: when Theuth finds codes 97 and 118 it knows it must grab them together as 118 is a 'superposable entity' (a diacritic): it thus decomposes both into entities getting alpha from 97 and acute accent from 118, and then looks into Unicode description table for the best match. It finds 2 matches, the precomposed +03AC and its equivalent sequence +03B1 +0301, and opts for the former. Thus, SuperGreek codes 97, 118 become Unicode 940.

In both samples the same entity, alpha + acute accent, is converted into its proper encoding. It may also be the case that not all the entities match: for instance a font A may have a precomposed code for alpha + acute + short, a font B may have a precomposed code only for alpha + acute but a superposable character code for short; a font C may only have codes for alpha and acute. In any case the converter will find the best match so that even in the worst case (from A to C) the conversion will produce a partial match rather than losing the original character at all (alpha + acute + short becomes alpha + acute and only short rather than losing the whole set of entities).

Highlights

powered by ParaScroller