Text Extraction - Sample

Here you can see how a PHI cd-rom text is extracted and reencoded to feed Chiron analysis using Proteus, the software I have created for browsing, reading and extracting such cd-roms.

In this sample I first open the Thesaurus Linguae Graecae database in the left Corpus pane, where you cn see all the files in the cd. I then locate Homer, browse its works and read Iliad in the main pane. The text encoding conversion components are the same used by several other software tools of mine like Theuth, Ibis or Cadmus. Here you can see the text both in its original Beta coded text and in its converted form (a Unicode text in this case).

Just to show some other features, I then browse the word forms index of TLG cd-rom and find a specific word into it; once found, double clicking on it gets you to the details of its occurrences for each work as stored in the TLG index. This feature is useful when studying the correlation between textual frequency and word phonological extent (cf. the well-known observations made by Zipf), which was used as one of the several parameters in defining the behaviour of syntactical analysis by Chiron.

Also, to show the usage of other indexes I browse the bottom Classification pane with the several different classification types of all the works included in the TLG cd.

Finally I proceed to extract a work, in this sample the Iliad. I choose an output folder, the output format (RTF: specifying no options as here will apply Unicode output encoding, but you could also choose any other even non-standard encoding as typically conveyed by some fonts like Greek, SuperGreek, etc. Leaving out partial export and hyphenated word forms joining (which is not required for a poetic work like this), I then choose to split the work so that each output file will contain a single book. Finally I choose to add some markings to the output text: without delving into details, I just pick the "poetry" preset which adds a couple of marks with a step of 5 lines. I then click the Export button and after some seconds I'll have my RTF texts with Unicode encoding. I'll just have to open these texts and save them as plain Unicode text and Chiron will be able to parse and analyze them.