Missives: Words Words Words Blog

Parsing Gutenberg with Beautiful Soup

Sat 31 January 2015

The starting point for the Words Words Words (WWW) project was the book - the original goal was to tag the etymology of each word of an entire book. Gutenberg was a natural choice for a starting point.

In my procedure, I was parsing each Gutenberg book using a two-pass strategy.

In the first pass of the book, I extracted all text from the file, and assembled wordcounts using the TextBlob library. This gives me a list of unique words appearing in that book.

This list was then used to look up the etymology of each word on the Online Etymology Dictionary.

In the second pass of the book, I was adding formatting tags to each word, based on its root language. I was then saving that modified HTML document to a new file, which I could then put on the WWW project website.

Pass 1: Extracting Text from a Gutenberg Book

The process of extracting all text from a Gutenberg book involves using BeautifulSoup to locate all paragraph tags, extract the text from these paragraph tags, and count the words of the resulting string.

Here is some code to do that, without counting words in the table of contents:

# ------
# page text
# <p>
# the texttags contain the text
print "Turning HTML into text..."
texttags_all = [tt for tt in soup.findAll('p',text=True)]
texttags = []
for tta in texttags_all:
    if 'class' in tt.attrs.keys():
        if tt.attrs['class']=='toc':
            pass
    texttags.append(tta)
print "len(texttags) =",len(texttags)

all_text = []
for tt in texttags:
    all_text.append(tt.string)

Now the variable all_text is a list of strings, one list element for each paragraph of the text. We can join each of these together into a single string, and create a TextBlob object from that large string.

s = " ".join(all_text)
s = unicode(s)

t = TextBlob(s)
print "done"

TextBlob is a text processing library that will give us access to some convenient functionality.

Now that list of words can be put in a data container (I am using a Pandas DataFrame):

print "Getting word counts..."
wc = t.word_counts
print "done"

words = pd.DataFrame([])

print "Populating words..."
for the_word in wc.keys():
    the_word = the_word.lower()
    d = {}
    d['word'] = the_word.encode('utf-8')
    d['word count'] = wc[the_word]

    words = words.append([d])

print "done"

Optionally, you can sort words according to their word counts:

print "Reindex according to word count ranking..."
words = words.sort(columns=['word count','word'],ascending=[False,True])
words.index = range(1,len(words)+1)
print "done"

Finally, I export the DataFrame to a CSV file. This CSV file then provides a convenient starting point for picking up where this script leaves off, which we will need to do for our second pass of the Gutenberg document.

print "Exporting to file..."
words.to_csv(self.words_csv_file,index=False,na_rep="")
print "done"

Intermediate Step: Looking Up Word Etymology

This step, which will be covered in a later post, involves iterating through each unique word, determining if the word has a root word, removing suffixes, finding unconjugated forms of verbs, etc.

This populates elements in the list with etymology information, when it can be found.

Pass 2: Modifying Gutenberg HTML to Add Tags

Now that we have a list of each unique word in the book and its root language, we iterate through each paragraph again, but this time check each word to see if etymology information is available for it. If so, it wraps the word in span tags.

My script works one chapter at a time.

First, I look for all h2 tags, throwing out the first one (which is the name of the author):

h2tags = [tt for tt in soup.findAll('h2')]
h2tags = h2tags[1:]

ich=1
for h2tag in h2tags:
    print "Tagging chapter heading",ich

    h2txt = h2tag.string

    new_body = []

    new_body.append(unicode(h2tag))

    chapter_file = self.name_+str(ich)+".html"

We are at the chapter heading. We search for the next sibling tag, which should either be a paragraph tag or another h2 tag for the next chapter heading :

    nexttag = h2tag.findNextSibling(['p','h2'])
    if nexttag is None:
        # nothing left in the document, 
        # so exit this loop
        break

    ip = 0
    while True:

        if nexttag.name=='h2':
            # we're done with all the <p> tags 
            # in this chapter
            break
        if nexttag is None:
            break

        ip += 1
        if ip%25==0:
            print "Paragraph",ip

Now we process the text in the paragraph. The strategy is to split sentences like

"Tagging word etymology with python."

into something like:

["Tagging","word","etymology","with","python"]

If a word etymology is found, the word is wrapped in div tags, like so:

[ ..., 'div class="latin">python</div>']

First, split the paragraph string.

Iterate through each word, and search for a root language in the word DataFrame.

If something is found, create the appropriate span tag.

try:
    split = nexttag.string.split()
except AttributeError:
    # no text
    split = []

if split <> []:
    for _,word_row in words_w_lang.iterrows():

        word = word_row['word']
        full_lang = word_row['root language']
        lang = languages_key[full_lang]

        for it,token in enumerate(split):
            if token.lower() == word.lower():
                split[it] = '<span class="' + lang + '">' + token + '</span>'

    new_html = ' '.join(split)

    new_body.append( new_html )

Once we've finished, we check to see whether there are any remaining p or h2 tags to process.

    # increment the tag now, 
    # and do a null check 
    # (if no tag, bail out)
    nexttag = nexttag.findNextSibling(['p','h2'])
    if nexttag is None:
        # nothing left in the document, 
        # so exit this loop
        break

Finally, this will bring us to the end of our Gutenberg book. At this point, we use all of the strings we have been assembling and appending to new_body to create a new BeautifulSoup document,

print "done with chapter"

print "Making some soup"
soup = BeautifulSoup(' '.join(new_body))
print "done"

We then write that soup to the chapter HTML file, and proceed to the next chapter:

print "Writing to file",chapter_file
with open(self.dest_dir+"/"+chapter_file,'w') as f:
    f.write(soup.prettify().encode('utf-8'))
print "done"

ich += 1

All there is to it.

Using Python to Crack the Etymology of Dubliners

Fri 30 January 2015

The short story collection Dubliners by Irish author James Joyce has always been special to me. When I was a freshman in high school, my English teacher, Dr. Miller - Doctor Mike - assigned us Dubliners as our required reading. Joyce can be bewildering to graduate students, let alone a high school freshman, so needless to say, it was over my head. But the stories were seared into my memory, and I have returned to them again and again throughout my life. My changing understanding of each story is a reflection of the evolution of my perspective.

I was inspired by this post from the Ideas Illustrated blog, referenced in a post on the Johnson (language) blog in the Economist, to try my hand at tagging word etymology in the same way. My aim was to apply the technique to an entire book - or rather, to a set of short stories.

My aim was to tag the word etymologies of James Joyce's Dubliners.

I set to work, and created code and a web page for the Words Words Words project on GitHub. Here's a description of the project from its web page:

Words Words Words uses a couple of Python libraries to do its primary tasks: parse text, look up words on a web page, extract and process the result, and convert the original text into HTML, color-coding each word in the process with its etymological root language.

To parse the text and extract unique words, I'm using the Natural Language Toolkit.
To scrape the web, I'm using Mechanize.
To obtain etymological root languages for words, I'm using the Online Etymology Dictionary.
To process the resulting HTML, I'm using Beautiful Soup.
To deal with all the data resulting from these tasks, I'm using Pandas.
To tag each word, I'm just using Python's built-in list and string types.
To pull all of the tagged HTML, CSS stylesheets, and JS together, I'm using Pelican (my preferred Python alternative to Ruby's Jekyll)

The result is something like what you see at the header of the landing page. Python is used to tag words with root languages using the Online Etymology Dictionary. Each of Joyce's short stories are tagged in this way, and each story is on its own page. Here is the table of contents for the tagged version of Dubliners.

Right away it's clear that the dominant language in Joyce's writing is French - the entire text is awash in purple words. And unlike the smatterings of green German words or pale yellow English words, the French words are not commonly-recurring articles or prepositions; they are the more complex words, like "priesthood" and "chalice," "scrupulous" and "mercy," containing the intellectual meat of the story.

There's also a steady smattering of Old Norse ("kitchen," "hand," "door," "road"), as well as the occasional word reaching way back in time to recall Old French roots, like "bazaar."

Somewhat surprisingly (to me, anyway) was the infrequency of Latin and Greek words. I suppose that tagging text about Stephen Dedalus might dip more heavily into those (Ulysses is on the list of books to tag next), but in Dubliners at least, words with Latin and Greek roots are somewhat rare.

Even Sanksrit shows up in Dubliners: in the roots of the word "tobacco."

Update to Word Root Searches

Sun 25 January 2015

In my last post, I covered the techniques I was using to deal with failed lookups - removing suffixes and looking for root words.

My initial list of suffixes was modest:

-ed
-ing
-ly
-es
-ies

But even this got complicated, as I was checking for suffixes preceding consonants and vowels, and still led to a lot of misses.

I expanded this, after watching the script roll through a whole block of text and taking note of similarities in words that were not being found in the Online Etymology Dictionary. These included words like:

genealogical (root: genealogy)
observing (root: observe)
shuffling (root: shuffle)

and so on. From each cluster of words I derived the missing suffix checks that I needed to add to my code. The (significantly expanded) list is as follows:

-ed
-ing
-ly
-es
-ies
-er
-XXed
-en
-s
-est
-ied
-ail
-ation
-ian
-ist
-sim
-ual
-iness
-liness

Seeing this horrible nest of if/elseif/else statements gave me a renewed sense of appreciation for the complexity of English. Seeing how many "special case" suffixes led to words falling through the cracks of the case statement, in spite of its complexity, led me to realize just how complicated the language mechanism in our brains can be.

To add to the complication, I had to add checks for the length of the word, to make sure that the word was longer than the suffix! (checking for a five-letter suffix on a four-letter word would raise exceptions...)

Here is the full suffix check as it currently stands:

def get_root_word(self,the_word):

    print ""
    print "Looking for roots of %s..."%(the_word)


    # pick out synonyms from the synset that are basically the same word (share first 3 letters)
    # (and also strip them of part of speech,
    #  n., v., and so on...)
    #
    # synsets look like:
    # swing.n.04
    # 
    # so use .split('.')[0] (first token before .)
    #
    try:
        full_synset = Word(the_word).synsets
        full_synset_strings = [syn.name().split('.')[0] for syn in full_synset]
    except:
        full_synset_strings = []

    # only keep the suggested synset 
    # if the first three letters of the synset
    # match the first three letters of the original word
    # (synset returns lots of diff words...)
    synset = []
    for sss in zip(full_synset_strings):
        if sss[:3] == the_word[:3]:
            synset.append(sss)


    # first try removing any common suffixes
    if len(the_word)>4:

        # -ed
        if the_word[-2:]=='ed':

            # -XXed to -X
            # wrapped to wrap, begged to beg
            if the_word[-4]==the_word[-3]:
                synset.insert(0,the_word[:-3])

            # -ied 
            # occupied to occupy
            elif the_word[-3]=='ied':
                synset.insert(0,the_word[-3]+"y")

            else:

                # -ed to -
                # consonant, more likely, so prepend
                synset.insert(0,the_word[:-2])

                # -ed to -e
                # tired to tire
                synset.append(the_word[:-1])

        # -en
        if the_word[-2:]=='en':
            # -en to -
            # quicken to quick
            synset.insert(0,the_word[:-2])

            # -en to -e
            # shaven to shave
            synset.append(the_word[:-1])

        if the_word[-2:]=='er':
            # -er to -
            # thicker to thick
            synset.insert(0,the_word[:-2])

            # -er to -e
            # shaver to shave
            synset.append(the_word[:-1])

        # -est
        if the_word[-3:]=='est':
            # -est to -
            # brightest to bright
            synset.insert(0,the_word[:-3])

            # -est to -e
            # widest to wide
            synset.append(the_word[:-2])

        # -ing
        if the_word[-3:]=='ing':
            # -ing to -
            synset.insert(0,the_word[:-3])
            # -gging to -g
            # -nning to -n
            synset.append(the_word[:-4])
            # -ing to -e
            synset.append(the_word[:-3]+"e")

        # -ly
        if the_word[-2:]=='ly':
            # -ly to -
            synset.insert(0,the_word[:-2])


    # end if len>4


    # -s/-es
    if the_word[-1:]=='s':

        # -liness
        if len(the_word)>6:
            if the_word[-6:]=='liness':
                # -liness to -
                # friendliness to friend
                synset.insert(0,the_word[:-6])

            # -iness
            elif the_word[-5:]=='iness':
                # -iness to -y
                # happiness to happy
                synset.insert(0,the_word[:-5]+"y")

        # -ies 
        # -es
        if the_word[-2:]=='es':
            if the_word[-3:]=='ies':
                # -ies to -y
                synset.insert(0,the_word[:-3]+"y")
            else:
                # -es to -
                synset.insert(0,the_word[:-2])
                # -es to -e
                synset.append(the_word[:-1])

        # -s to -
        else: 
            synset.insert(0,the_word[:-1])


    if len(the_word)>5:
        if the_word[-5:]=='ation':
            # -ation to -ate
            # accumulation to accumulate
            synset.insert(0,the_word[:-5]+"ate")


    if synset<>[]:
        print "  Trying these: %s"%( ", ".join(synset) )

    return synset

Searching for Word Roots

Sat 24 January 2015

Some of the most recent improvements to the Words Words Words (WWW) code have been in how it deals with a failure to find a word in the Online Etymology Dictionary. Some of these failures are due to a lack of etymology information (the name "Eliza", for example). But other failures are because we are looking for a conjugated verb, or a past tense form, or a noun-made-adverb, etc.

For this reason, we can greatly improve our tag coverage with a few tricks. This code is in the file etymology/EtymologyCSV.py in the repository, and is in the method EtymologyCSV::get_root_word.

def get_root_word(self,the_word):

    print ""
    print "Looking for roots of %s..."%(the_word)

I use two methods in my code:

TextBlob Synsets - this uses the TextBlob library to look for similar words, which often include root words.
Common Suffixes - this tests for common suffixes, removes them, and creates a list of the resulting (possible) root words

TextBlob Synsets

The first thing to do is to use TextBlob, a Python library, to search for its "synsets" - sets of similar words. While these synsets are often scattershot and include a wide range of dissimilar words, they can sometimes contain the unconjugated form of a verb, or a form without a suffix.

To get the synsets, you have to create a TextBlob word:

In [4]: from textblob import Word

In [7]: w = Word("looking")

In [8]: print w.synsets
[Synset('look.n.02'), Synset('looking.n.02'), Synset('look.v.01'), Synset('look.v.02'), Synset('look.v.03'), Synset('search.v.02'), Synset('front.v.01'), Synset('attend.v.02'), Synset('look.v.07'), Synset('expect.v.03'), Synset('look.v.09'), Synset('count.v.08'), Synset('looking.s.01')]

You can see the format of the synsets from the output. We can get the word by itself using .split('.')[0]:

    try:
        full_synset = Word(the_word).synsets
        full_synset_strings = [syn.name().split('.')[0] for syn in full_synset]
    except:
        full_synset_strings = []

Now we need a way of discarding irrelevant words in the synset. I found the criteria of the first three letters matching was sufficient for almost every case.

    synset = []
    for sss in zip(full_synset_strings):
        if sss[:3] == the_word[:3]:
            synset.append(sss)

Common Suffixes

The next task to accomplish with the code was removing common suffixes to create additional (possible) root words, which could then be looked up in lieu of the original at the Online Etymology Dictionary.

A list of suffixes I checked for:

-ed
-ing
-ly
-es
-ies

There are two cases for removing suffixes: preceded by a consonant, and preceded by a vowel. The consonant case is more common, so these are added to the beginning of the list of possible root words. The vowel cases are added to the end.

This may seem hacky and may generate a few false positives, but it works surprisingly well without being overly intricate.

    # first try removing any common suffixes

    # -ed
    if the_word[-2:]=='ed':
        # -ed to -
        # consonant, more likely, so prepend
        synset.insert(0,the_word[:-2])

        # -ed to -e
        synset.append(the_word[:-1])

    # -ing
    if the_word[-3:]=='ing':
        # -ing to -
        synset.insert(0,the_word[:-3])
        # -gging to -g
        # -nning to -n
        synset.append(the_word[:-4])

    # -ly
    if the_word[-2:]=='ly':
        # -ly to -
        synset.insert(0,the_word[:-2])

    # -es
    if the_word[-2:]=='es':
        if the_word[-3:]=='ies':
            # -ies to -y
            synset.insert(0,the_word[:-3]+"y")
        else:
            # -es to -
            synset.insert(0,the_word[:-2])
            # -es to -e
            synset.append(the_word[:-1])

The Final Step

Once these lists of possible root words have been assembled, they are returned to the main portion of the code, where they are each looked up on the Online Etymology Dictionary.

    if synset<>[]:
        print "  Trying these: %s"%( ", ".join(synset) )

    return synset

End of Article List

Words Words Words