Using Python to Crack the Etymology of Dubliners

The short story collection Dubliners by Irish author James Joyce has always been special to me. When I was a freshman in high school, my English teacher, Dr. Miller - Doctor Mike - assigned us Dubliners as our required reading. Joyce can be bewildering to graduate students, let alone a high school freshman, so needless to say, it was over my head. But the stories were seared into my memory, and I have returned to them again and again throughout my life. My changing understanding of each story is a reflection of the evolution of my perspective.

I was inspired by this post from the Ideas Illustrated blog, referenced in a post on the Johnson (language) blog in the Economist, to try my hand at tagging word etymology in the same way. My aim was to apply the technique to an entire book - or rather, to a set of short stories.

My aim was to tag the word etymologies of James Joyce's Dubliners.

I set to work, and created code and a web page for the Words Words Words project on GitHub. Here's a description of the project from its web page:

Words Words Words uses a couple of Python libraries to do its primary tasks: parse text, look up words on a web page, extract and process the result, and convert the original text into HTML, color-coding each word in the process with its etymological root language.

  • To parse the text and extract unique words, I'm using the Natural Language Toolkit.
  • To scrape the web, I'm using Mechanize.
  • To obtain etymological root languages for words, I'm using the Online Etymology Dictionary.
  • To process the resulting HTML, I'm using Beautiful Soup.
  • To deal with all the data resulting from these tasks, I'm using Pandas.
  • To tag each word, I'm just using Python's built-in list and string types.
  • To pull all of the tagged HTML, CSS stylesheets, and JS together, I'm using Pelican (my preferred Python alternative to Ruby's Jekyll)

The result is something like what you see at the header of the landing page. Python is used to tag words with root languages using the Online Etymology Dictionary. Each of Joyce's short stories are tagged in this way, and each story is on its own page. Here is the table of contents for the tagged version of Dubliners.

Right away it's clear that the dominant language in Joyce's writing is French - the entire text is awash in purple words. And unlike the smatterings of green German words or pale yellow English words, the French words are not commonly-recurring articles or prepositions; they are the more complex words, like "priesthood" and "chalice," "scrupulous" and "mercy," containing the intellectual meat of the story.

There's also a steady smattering of Old Norse ("kitchen," "hand," "door," "road"), as well as the occasional word reaching way back in time to recall Old French roots, like "bazaar."

Somewhat surprisingly (to me, anyway) was the infrequency of Latin and Greek words. I suppose that tagging text about Stephen Dedalus might dip more heavily into those (Ulysses is on the list of books to tag next), but in Dubliners at least, words with Latin and Greek roots are somewhat rare.

Even Sanksrit shows up in Dubliners: in the roots of the word "tobacco."