Words Words Words

Tagging word etymology with Python and HTML .

 

French Old Fr. Greek Latin Sanskrit
Norse Old Norse German Germanic
English American Eng. Old Eng. Welsh Irish Dutch
Old Dutch Old Saxon Old Frisian Russian Arabic Spanish Italian Slavonic Polish Turkish

Missives: Words Words Words Blog

 


 

Update to Word Root Searches

In my last post, I covered the techniques I was using to deal with failed lookups - removing suffixes and looking for root words.

My initial list of suffixes was modest:

  • -ed
  • -ing
  • -ly
  • -es
  • -ies

But even this got complicated, as I was checking for suffixes preceding consonants and vowels, and still led to a lot of misses.

I expanded this, after watching the script roll through a whole block of text and taking note of similarities in words that were not being found in the Online Etymology Dictionary. These included words like:

  • genealogical (root: genealogy)
  • observing (root: observe)
  • shuffling (root: shuffle)

and so on. From each cluster of words I derived the missing suffix checks that I needed to add to my code. The (significantly expanded) list is as follows:

  • -ed
  • -ing
  • -ly
  • -es
  • -ies
  • -er
  • -XXed
  • -en
  • -s
  • -est
  • -ied
  • -ail
  • -ation
  • -ian
  • -ist
  • -sim
  • -ual
  • -iness
  • -liness

Seeing this horrible nest of if/elseif/else statements gave me a renewed sense of appreciation for the complexity of English. Seeing how many "special case" suffixes led to words falling through the cracks of the case statement, in spite of its complexity, led me to realize just how complicated the language mechanism in our brains can be.

To add to the complication, I had to add checks for the length of the word, to make sure that the word was longer than the suffix! (checking for a five-letter suffix on a four-letter word would raise exceptions...)

Here is the full suffix check as it currently stands:

def get_root_word(self,the_word):

    print ""
    print "Looking for roots of %s..."%(the_word)


    # pick out synonyms from the synset that are basically the same word (share first 3 letters)
    # (and also strip them of part of speech,
    #  n., v., and so on...)
    #
    # synsets look like:
    # swing.n.04
    # 
    # so use .split('.')[0] (first token before .)
    #
    try:
        full_synset = Word(the_word).synsets
        full_synset_strings = [syn.name().split('.')[0] for syn in full_synset]
    except:
        full_synset_strings = []

    # only keep the suggested synset 
    # if the first three letters of the synset
    # match the first three letters of the original word
    # (synset returns lots of diff words...)
    synset = []
    for sss in zip(full_synset_strings):
        if sss[:3] == the_word[:3]:
            synset.append(sss)


    # first try removing any common suffixes
    if len(the_word)>4:

        # -ed
        if the_word[-2:]=='ed':

            # -XXed to -X
            # wrapped to wrap, begged to beg
            if the_word[-4]==the_word[-3]:
                synset.insert(0,the_word[:-3])

            # -ied 
            # occupied to occupy
            elif the_word[-3]=='ied':
                synset.insert(0,the_word[-3]+"y")

            else:

                # -ed to -
                # consonant, more likely, so prepend
                synset.insert(0,the_word[:-2])

                # -ed to -e
                # tired to tire
                synset.append(the_word[:-1])

        # -en
        if the_word[-2:]=='en':
            # -en to -
            # quicken to quick
            synset.insert(0,the_word[:-2])

            # -en to -e
            # shaven to shave
            synset.append(the_word[:-1])

        if the_word[-2:]=='er':
            # -er to -
            # thicker to thick
            synset.insert(0,the_word[:-2])

            # -er to -e
            # shaver to shave
            synset.append(the_word[:-1])

        # -est
        if the_word[-3:]=='est':
            # -est to -
            # brightest to bright
            synset.insert(0,the_word[:-3])

            # -est to -e
            # widest to wide
            synset.append(the_word[:-2])

        # -ing
        if the_word[-3:]=='ing':
            # -ing to -
            synset.insert(0,the_word[:-3])
            # -gging to -g
            # -nning to -n
            synset.append(the_word[:-4])
            # -ing to -e
            synset.append(the_word[:-3]+"e")

        # -ly
        if the_word[-2:]=='ly':
            # -ly to -
            synset.insert(0,the_word[:-2])


    # end if len>4


    # -s/-es
    if the_word[-1:]=='s':

        # -liness
        if len(the_word)>6:
            if the_word[-6:]=='liness':
                # -liness to -
                # friendliness to friend
                synset.insert(0,the_word[:-6])

            # -iness
            elif the_word[-5:]=='iness':
                # -iness to -y
                # happiness to happy
                synset.insert(0,the_word[:-5]+"y")

        # -ies 
        # -es
        if the_word[-2:]=='es':
            if the_word[-3:]=='ies':
                # -ies to -y
                synset.insert(0,the_word[:-3]+"y")
            else:
                # -es to -
                synset.insert(0,the_word[:-2])
                # -es to -e
                synset.append(the_word[:-1])

        # -s to -
        else: 
            synset.insert(0,the_word[:-1])


    if len(the_word)>5:
        if the_word[-5:]=='ation':
            # -ation to -ate
            # accumulation to accumulate
            synset.insert(0,the_word[:-5]+"ate")


    if synset<>[]:
        print "  Trying these: %s"%( ", ".join(synset) )

    return synset

Creating New Pelican Templates

The Words Words Words (WWW) library uses the Online Etymology Dictionary and HTML to color-tag each word in a body of text based on its etymology and root language. This means that the output of the WWW scripts is usually a directory full of HTML files (one HTML file per chapter).

I had to figure out what to do with this HTML, and how to embed it in Pelican pages. (Pelican is the Python static content generator that I am using to create the WWW project page.)

The Plugin Approach

I began by writing a Pelican plugin that would allow me to use Liquid tags to include HTML files into a Markdown document, like this:

{% include_html 'some_html_file.html' %}

The problem with this approach, though, is that the plugin basically imports the HTML file as one big string in the final Markdown document, storing it in memory as the rest of the document is constructed. But this makes the plugin approach unbearably slow.
Generating the website content with more than one book would take upwards of an hour. Minor changes required re-making the site each time. This was not an acceptable or scalable solution, since my goal was to add a large number of books.

The Template Approach

I hit upon a solution when I delved into the template features of Pelican. Pelican allows you to define new templates, and you can inject large HTML documents directly using Jinja templating syntax. This means you can create a template like dummy.html in the templates/ directory of your theme, and insert HTML documents using include statements:

{% include '_includes/some_html_file.html' %}

Then you can edit your pelicanconfig.py file to tell Pelican about your new template:

TEMPLATE_PAGES = {}
TEMPLATE_PAGES['dummy.html'] = 'custom/path/to/dummy/index.html'

The key is the name of the template file; the value is the custom URL path that you want the template to have.

The Pelican Solution

I still had a problem, however, that a given book might have upwards of 50 chapters, meaning 50 HTML files. Still not a scalable solution.

But Python came to the rescue! I was able to use Python to accomplish three tasks:

  • Create a Python script to automatically create a new HTML template file for each book chapter;
  • Create a central index file for each book, with buttons for each book chapter;
  • Populate the TEMPLATE_PAGES dictionary automatically.

This last bullet is possible, because the config file is written in... Python!

Creating HTML Template Files

In my templates directory, I have a script that automatically creates an HTML file that has the theme's header and footer, and populates the page content with the HTML files generated by WWW scripts. Here is my script make_dubliners.py, for creating pages for each chapter of James Joyce's Dubliners:

for im1 in range(15):
    i = im1+1
    filename = "dubliners%d.html"%(i)

    content = ""
    content += "{% extends 'base.html' %}\n"
    content += "{% block title %}Dubliners - &mdash; {{ SITENAME }}{% endblock %}\n"
    content += "{% block content %}\n\n"

    content += "{% include '_includes/"
    content += filename 
    content += "' %}\n\n"

    content += "{% endblock %}\n"


    print "writing html file %s..."%(filename)
    with open(filename,'w') as f:
        f.write(content)

print "done"

Populating TEMPLATE_PAGES

Here is how I automatically populated the TEMPLATE_PAGES variable for James Joyce's book Dubliners:

TEMPLATE_PAGES['dubliners.html'] = 'dubliners/index.html'
for im1 in range(15):
    i = im1+1
    key = 'dubliners%d.html'%(i)
    val = 'dubliners/%d/index.html'%(i)
    TEMPLATE_PAGES[key] = val

Searching for Word Roots

Some of the most recent improvements to the Words Words Words (WWW) code have been in how it deals with a failure to find a word in the Online Etymology Dictionary. Some of these failures are due to a lack of etymology information (the name "Eliza", for example). But other failures are because we are looking for a conjugated verb, or a past tense form, or a noun-made-adverb, etc.

For this reason, we can greatly improve our tag coverage with a few tricks. This code is in the file etymology/EtymologyCSV.py in the repository, and is in the method EtymologyCSV::get_root_word.

def get_root_word(self,the_word):

    print ""
    print "Looking for roots of %s..."%(the_word)

I use two methods in my code:

  • TextBlob Synsets - this uses the TextBlob library to look for similar words, which often include root words.

  • Common Suffixes - this tests for common suffixes, removes them, and creates a list of the resulting (possible) root words

TextBlob Synsets

The first thing to do is to use TextBlob, a Python library, to search for its "synsets" - sets of similar words. While these synsets are often scattershot and include a wide range of dissimilar words, they can sometimes contain the unconjugated form of a verb, or a form without a suffix.

To get the synsets, you have to create a TextBlob word:

In [4]: from textblob import Word

In [7]: w = Word("looking")

In [8]: print w.synsets
[Synset('look.n.02'), Synset('looking.n.02'), Synset('look.v.01'), Synset('look.v.02'), Synset('look.v.03'), Synset('search.v.02'), Synset('front.v.01'), Synset('attend.v.02'), Synset('look.v.07'), Synset('expect.v.03'), Synset('look.v.09'), Synset('count.v.08'), Synset('looking.s.01')]

You can see the format of the synsets from the output. We can get the word by itself using .split('.')[0]:

    try:
        full_synset = Word(the_word).synsets
        full_synset_strings = [syn.name().split('.')[0] for syn in full_synset]
    except:
        full_synset_strings = []

Now we need a way of discarding irrelevant words in the synset. I found the criteria of the first three letters matching was sufficient for almost every case.

    synset = []
    for sss in zip(full_synset_strings):
        if sss[:3] == the_word[:3]:
            synset.append(sss)

Common Suffixes

The next task to accomplish with the code was removing common suffixes to create additional (possible) root words, which could then be looked up in lieu of the original at the Online Etymology Dictionary.

A list of suffixes I checked for:

  • -ed
  • -ing
  • -ly
  • -es
  • -ies

There are two cases for removing suffixes: preceded by a consonant, and preceded by a vowel. The consonant case is more common, so these are added to the beginning of the list of possible root words. The vowel cases are added to the end.

This may seem hacky and may generate a few false positives, but it works surprisingly well without being overly intricate.

    # first try removing any common suffixes

    # -ed
    if the_word[-2:]=='ed':
        # -ed to -
        # consonant, more likely, so prepend
        synset.insert(0,the_word[:-2])

        # -ed to -e
        synset.append(the_word[:-1])

    # -ing
    if the_word[-3:]=='ing':
        # -ing to -
        synset.insert(0,the_word[:-3])
        # -gging to -g
        # -nning to -n
        synset.append(the_word[:-4])

    # -ly
    if the_word[-2:]=='ly':
        # -ly to -
        synset.insert(0,the_word[:-2])

    # -es
    if the_word[-2:]=='es':
        if the_word[-3:]=='ies':
            # -ies to -y
            synset.insert(0,the_word[:-3]+"y")
        else:
            # -es to -
            synset.insert(0,the_word[:-2])
            # -es to -e
            synset.append(the_word[:-1])

The Final Step

Once these lists of possible root words have been assembled, they are returned to the main portion of the code, where they are each looked up on the Online Etymology Dictionary.

    if synset<>[]:
        print "  Trying these: %s"%( ", ".join(synset) )

    return synset