Parsing Gutenberg with Beautiful Soup

The starting point for the Words Words Words (WWW) project was the book - the original goal was to tag the etymology of each word of an entire book. Gutenberg was a natural choice for a starting point.

In my procedure, I was parsing each Gutenberg book using a two-pass strategy.

In the first pass of the book, I extracted all text from the file, and assembled wordcounts using the TextBlob library. This gives me a list of unique words appearing in that book.

This list was then used to look up the etymology of each word on the Online Etymology Dictionary.

In the second pass of the book, I was adding formatting tags to each word, based on its root language. I was then saving that modified HTML document to a new file, which I could then put on the WWW project website.

Pass 1: Extracting Text from a Gutenberg Book

The process of extracting all text from a Gutenberg book involves using BeautifulSoup to locate all paragraph tags, extract the text from these paragraph tags, and count the words of the resulting string.

Here is some code to do that, without counting words in the table of contents:

# ------
# page text
# <p>
# the texttags contain the text
print "Turning HTML into text..."
texttags_all = [tt for tt in soup.findAll('p',text=True)]
texttags = []
for tta in texttags_all:
    if 'class' in tt.attrs.keys():
        if tt.attrs['class']=='toc':
            pass
    texttags.append(tta)
print "len(texttags) =",len(texttags)

all_text = []
for tt in texttags:
    all_text.append(tt.string)

Now the variable all_text is a list of strings, one list element for each paragraph of the text. We can join each of these together into a single string, and create a TextBlob object from that large string.

s = " ".join(all_text)
s = unicode(s)

t = TextBlob(s)
print "done"

TextBlob is a text processing library that will give us access to some convenient functionality.

Now that list of words can be put in a data container (I am using a Pandas DataFrame):

print "Getting word counts..."
wc = t.word_counts
print "done"

words = pd.DataFrame([])

print "Populating words..."
for the_word in wc.keys():
    the_word = the_word.lower()
    d = {}
    d['word'] = the_word.encode('utf-8')
    d['word count'] = wc[the_word]

    words = words.append([d])

print "done"

Optionally, you can sort words according to their word counts:

print "Reindex according to word count ranking..."
words = words.sort(columns=['word count','word'],ascending=[False,True])
words.index = range(1,len(words)+1)
print "done"

Finally, I export the DataFrame to a CSV file. This CSV file then provides a convenient starting point for picking up where this script leaves off, which we will need to do for our second pass of the Gutenberg document.

print "Exporting to file..."
words.to_csv(self.words_csv_file,index=False,na_rep="")
print "done"

Intermediate Step: Looking Up Word Etymology

This step, which will be covered in a later post, involves iterating through each unique word, determining if the word has a root word, removing suffixes, finding unconjugated forms of verbs, etc.

This populates elements in the list with etymology information, when it can be found.

Pass 2: Modifying Gutenberg HTML to Add Tags

Now that we have a list of each unique word in the book and its root language, we iterate through each paragraph again, but this time check each word to see if etymology information is available for it. If so, it wraps the word in span tags.

My script works one chapter at a time.

First, I look for all h2 tags, throwing out the first one (which is the name of the author):

h2tags = [tt for tt in soup.findAll('h2')]
h2tags = h2tags[1:]

ich=1
for h2tag in h2tags:
    print "Tagging chapter heading",ich

    h2txt = h2tag.string

    new_body = []

    new_body.append(unicode(h2tag))

    chapter_file = self.name_+str(ich)+".html"

We are at the chapter heading. We search for the next sibling tag, which should either be a paragraph tag or another h2 tag for the next chapter heading :

    nexttag = h2tag.findNextSibling(['p','h2'])
    if nexttag is None:
        # nothing left in the document, 
        # so exit this loop
        break

    ip = 0
    while True:

        if nexttag.name=='h2':
            # we're done with all the <p> tags 
            # in this chapter
            break
        if nexttag is None:
            break

        ip += 1
        if ip%25==0:
            print "Paragraph",ip

Now we process the text in the paragraph. The strategy is to split sentences like

"Tagging word etymology with python."

into something like:

["Tagging","word","etymology","with","python"]

If a word etymology is found, the word is wrapped in div tags, like so:

[ ..., 'div class="latin">python</div>']

First, split the paragraph string.

Iterate through each word, and search for a root language in the word DataFrame.

If something is found, create the appropriate span tag.

try:
    split = nexttag.string.split()
except AttributeError:
    # no text
    split = []

if split <> []:
    for _,word_row in words_w_lang.iterrows():

        word = word_row['word']
        full_lang = word_row['root language']
        lang = languages_key[full_lang]

        for it,token in enumerate(split):
            if token.lower() == word.lower():
                split[it] = '<span class="' + lang + '">' + token + '</span>'

    new_html = ' '.join(split)

    new_body.append( new_html )

Once we've finished, we check to see whether there are any remaining p or h2 tags to process.

    # increment the tag now, 
    # and do a null check 
    # (if no tag, bail out)
    nexttag = nexttag.findNextSibling(['p','h2'])
    if nexttag is None:
        # nothing left in the document, 
        # so exit this loop
        break

Finally, this will bring us to the end of our Gutenberg book. At this point, we use all of the strings we have been assembling and appending to new_body to create a new BeautifulSoup document,

print "done with chapter"

print "Making some soup"
soup = BeautifulSoup(' '.join(new_body))
print "done"

We then write that soup to the chapter HTML file, and proceed to the next chapter:

print "Writing to file",chapter_file
with open(self.dest_dir+"/"+chapter_file,'w') as f:
    f.write(soup.prettify().encode('utf-8'))
print "done"

ich += 1

All there is to it.