Text Extraction - Sample
Here you can see how a PHI cd-rom text is extracted and reencoded to feed Chiron analysis
using Proteus, the software I have created for browsing, reading and extracting such cd-roms.
In this sample I first open the Thesaurus Linguae Graecae database in the left Corpus pane,
where you cn see all the files in the cd. I then locate Homer, browse its works and read
Iliad in the main pane. The text encoding conversion components are the same used
by several other software tools of mine like Theuth,
Ibis or Cadmus.
Here you can see the text both in its original Beta coded text and in its converted form (a Unicode
text in this case).
Just to show some other features, I then browse the word forms index of TLG cd-rom and find a specific
word into it; once found, double clicking on it gets you to the details of its occurrences for each
work as stored in the TLG index. This feature is useful when studying the correlation between
textual frequency and word phonological extent (cf. the well-known observations made by Zipf), which
was used as one of the several parameters in defining the behaviour of
syntactical analysis by Chiron.
Also, to show the usage of other indexes I browse the bottom Classification pane with the
several different classification types of all the works included in the TLG cd.
Finally I proceed to extract a work, in this sample the Iliad. I choose an output
folder, the output format (RTF: specifying no options as here will apply Unicode output encoding,
but you could also choose any other even non-standard encoding as typically conveyed by some
fonts like Greek, SuperGreek, etc. Leaving out partial export and hyphenated word forms joining
(which is not required for a poetic work like this), I then choose to split the work so that
each output file will contain a single book. Finally I choose to add some markings to the output
text: without delving into details, I just pick the "poetry" preset which adds a couple of
marks with a step of 5 lines. I then click the Export button and after some seconds I'll have
my RTF texts with Unicode encoding. I'll just have to open these texts and save them as plain
Unicode text and Chiron will be able to parse and analyze them.