In Part I, we applied NLTK to James Joyce's Ulysses and found some interesting features of Chapter 8, Lestrygonians. We started by analyzing characters and letter frequencies, and then moved on to words. In this notebook, we'll be looking at phrases.
In particular, we'll try and improve the part of speech tagger by looking at the text at the phrase level, and we'll also apply some chunking algorithms to the text to chunk words into phrases based on their parts of speech.
Let's start by importing our libraries.
# In case we want to plot something:
%matplotlib inline
from __future__ import division
import nltk, re
import numpy as np
# The io module makes unicode easier to deal with
import io
def p():
print "-"*20
file_contents = io.open('txt/08lestrygonians.txt','r').read()
print type(file_contents)
# Tokenize the chapter using the Punkt Tokenizer:
sentences = nltk.sent_tokenize(file_contents)
print len(sentences)
print sentences[:21]
Now that we've tokenized the text by sentence, we can set to work. The first useful task we'll want to be able to do is to print out a sentence if it contains a given word. We can use a Text
object and use the concordance('word')
function, but this only prints out the context - and does not return the sentence or its context.
Suppose we want to search for a word, like "eye", and we want to return the setnence that contains it, along with two sentences of context (the sentence before, and the sentence after).
We can do this by looping through each sentence, breaking it apart using a word tokenizer, and searching for the word of interest. If we find it, we add the prior sentence, current sentence, and next sentence to the list of instances.
small_sentences = sentences[:21]
def word_with_context(word,sentences):
final_list = []
for i,sentence in enumerate(sentences):
if i>0 and i<(len(sentences)-1):
words = nltk.word_tokenize(sentence)
if word in words:
final_list.append( [re.sub('\n',' ',sentences[i-1] ),
re.sub('\n',' ',sentences[i] ),
re.sub('\n',' ',sentences[i+1] ) ]
)
return final_list
for i in word_with_context('eyes',sentences):
p()
print '\n'.join(i)
This is a useful function that we can combine with some other conditions - such as searching a wordlist for words matching a certain pattern. Then we can pass a pattern, and get back each word matching our pattern, with three sentences of context. We'll need a wordlist first, which we can obtain by tokenizing each of our sentences.
wordlist = nltk.word_tokenize(file_contents)
wordlist = [w.lower() for w in wordlist]
english_words = [w for w in nltk.corpus.words.words('en') if w.islower()]
z1 = set(wordlist)
z2 = set(english_words)
print "Number of words in Chapter 8:",len(wordlist)
print "Number of unique words in Ch. 8:",len(z1)
print "Number of words in English dictionary:",len(z2)
print "Numer of words in Ch. 8 in English dictionary:",len( z1.intersection(z2) )
intersection = z1.intersection(z2)
non_dictionary_words = z1.symmetric_difference(intersection)
print len(non_dictionary_words)
non_dictionary_words = sorted(list(non_dictionary_words))
print non_dictionary_words[110:125]
We now have a list of words that aren't found in an English dictionary provided by the NLTK corpus, so these have the potential to be interesting words. We'll use these results to print out some context for each word.
While we're at it, we can also get word counts of each of these words using a Text object:
text = nltk.Text(wordlist)
print "Number of occurences of",non_dictionary_words[115],":",text.count(non_dictionary_words[113])
result = word_with_context(non_dictionary_words[115],sentences)
print '\n'.join(result[0])
The phrase "woman's breasts full" is reminiscent of Lady Macbeth's speech from Macbeth, Act 1 Scene 5, when she discovers Duncan is staying the night (it has a somewhat, uh, different tone):
Stop up the access and passage to remorse,
That no compunctious visitings of nature
Shake my fell purpose, nor keep peace between
The effect and it! Come to my woman’s breasts,
And take my milk for gall, you murd'ring ministers,
Wherever in your sightless substances
You wait on nature’s mischief.
- Macbeth, Act 1, Scene 5
If instead we wanted to search for words matching a regular expression, we could write a function that takes a regular expression, searches for words matching that expression, and passes them to the word_with_context()
function.
def re_with_context(rex,sentences):
final_list = []
for i,sentence in enumerate(sentences):
if i>0 and i<(len(sentences)-1):
words = nltk.word_tokenize(sentence)
for word in words:
if len(re.findall(rex,word))>0:
final_list.append( [re.sub('\n',' ',sentences[i-1] ),
re.sub('\n',' ',sentences[i] ),
re.sub('\n',' ',sentences[i+1] ) ]
)
return final_list
for i,ss in enumerate(re_with_context(r'ood\b',sentences)):
if i<25:
p()
print '\n'.join(ss)
Now, we are able to pass words and regular expressions, and get a few sentences of context back in return. We can use various techniques to identify keywords, or provide keywords from a file, or from a list. We could iterate through a file containing any of the following things:
We can also look for particular phonetic sounds, which often occur in groups (as we can see from the word searches above, many of the sentences are repeated because the "ood" pattern often shows up repeatedly over a few sentences.
We can also look for patterns across the chapters - something we haven't done yet, since we've been focusing on Chapter 8 alone, as a smaller and more manageable body of text.
First, let's expand on that context function, to print out N sentences of context:
def re_with_context(rex,sentences,n_sentences):
final_list = []
half = (int)(np.floor(n_sentences/2))
for i,sentence in enumerate(sentences):
if i>=half and i<(len(sentences) - half):
words = nltk.word_tokenize(sentence)
for word in words:
if len(re.findall(rex,word))>0:
short_list = []
for s in sentences[i-half:i] + sentences[i:i+half+1]:
short_list.append( re.sub(r'[\n\t]',' ',s) )
final_list.append(short_list)
return final_list
for group in re_with_context('eyes',sentences,5):
p()
print '\n'.join(group)
If we want to start analyzing Ulysses as a whole and look for connections across chapters, we'll need objects to store data about each chapter, objects that will encapsulate much of the functionality laid out in Part I and Part II of these notebooks.
To design such an object, a Lestrygonians
object, we would first want to define a UlyssesChapter object. The constructor would take a text file representing the chapter. There would be a number of methods to get useful lists, dictionaries, or sets.
Useful lists:
Useful dictionaries:
Useful sets:
Building wordlists:
['orange','yellow','green','blue','indigo','rose','violet']
becomes [u'blue', u'greenhouses', u'greens', u'penrose', u'orangepeels', u'bluecoat', u'orangegroves', u'greeny', u'yellow', u'bluey', u'yellowgreen', u'green', u'rose', u'blues', u'greenwich']
If we were to tag various sentences, based on the nouns, verbs, and actions they contained, their neighbor sentences, the chapter they're in, etc., it would be possible to tag different themes (e.g., the appearance of the bar of soap, Paddy Dignam, ghosts) and plot their appearance temporally throughout the novel.
The appearance and disappearance of various characters (and indeed simply a list of characters extracted from the novel) would be marvelous.