NOTE: The code covered in this post uses Python 3. The scripts can be converted to Python 2 with minimal effort, but the author would encourage any user of Python 2 to "put on your big kid pants" and make the switch to Python 3. Let's all make this painful, drawnout switch from Python 2 to Python 3 a thing of the past, shall we?
Table of Contents
 About the Five Letter Words
 The Usefulness of Five Letter Words
 WarmUp Exercises
 Get Words Function
 Euclidean Distance
 Mo Words, Mo Problems
 References
About the FiveLetter Words
In Volume 4, Facsimile 0 of Donald Knuth's Art of Computer Programming, in which Knuth covers graph theory, he introduces a list of fiveletter words as part of a data set useful in exploring graph theory and graph algorithms.
The list of words is part of the Stanford Graph Base, a set of data sets that are useful for studying graph theory and networks.
The first few words in the list are:
 which
 there
 their
 about
 would
 these
 other
 words
 could
 write
 first
 water
 after
and so on. There are 5,757 total words in the data set, including some common words (as the first few listed), as well as some less common words:
 osier
 roble
 rumba
 biffy
 pupal
This post is an introduction to the five letter words, and will give a few useful algorithms for analyzing the set of words.
The Usefulness of Five Letter Words
We are surrounded, always and everywhere, by language  the principal mechanism of thought, communication, and expression. Our latent familiarity with language makes data sets involving language extremely useful  unlike a data set about football scores, property crime, or human fatalities, we don't expend much effort understanding the nature of the data. Studying language also gives us a deeper understanding and appreciation for the complexity of language, for through our very familiarity with language, it can come to seem deceptively simple.
Five letter words, in particular, are short enough that they are familiar, and surround us, and yet long enough to have variety and lead to some very interesting properties. Five also happens to be a computationally convenient length.
WarmUp Exercises
In Knuth's AOCP, he presents the reader with several warmup exercises to get familiar with the list of words. We cover solutions to several of these exercises. Many of these exercises are creating algorithms that, while not utilizing graph theory themselves, can be utilized to construct interesting graphs. These exercises are written in Python.
Let us begin.
Get Words Function
Before starting any analysis of the five letter words, it is a good idea to create a function that will load the data set form a text file and load the result as a Python list. This function is given below:
"""
get_words.py
Utility method to load the SBG words
and retun them as a list of strings.
"""
def get_words():
# Load the file.
with open('sgbwords.txt','r') as f:
## This includes \n at the end of each line:
#words = f.readlines()
# This drops the \n at the end of each line:
words = f.read().splitlines()
return words
This is a straightforward use of the read()
and splitlines()
functions in Python.
Euclidean Distance
We begin with a calculation of the Eulcidean distance between words. We define the distance between two words, commonly called the "edit distance," based on the notion of a unit change, which is incrementing or decrementing a letter by one. Thus, the edit distance between "a" and "b" is 1, the edit distance between "e" and "g" is 2, and so on.
Euclidean Distance Code
Let's start with the code that does the calculation of the edit distance between two words:
import random, math, operator
from pprint import pprint
from get_words import get_words
random.seed(1337)
"""
euclidean_dist.py
Compute euclidean distance between 5letter words.
"""
def euclidean_distance(word1, word2):
v1 = word2vec(word1)
v2 = word2vec(word2)
return l2norm(v1,v2)
def l2norm(vec1, vec2):
radicand = [(v2v1)*(v2v1) for (v1,v2) in zip(vec1,vec2)]
return math.sqrt(sum(radicand))
def word2vec(word):
charvec = []
vec = []
for c in word:
charvec.append(c)
vec.append(ord(c)ord('a'))
return vec
def print_tuple(e):
print("Distance between {0:s} and {1:s} = {2:f}".format(*e))
if __name__=="__main__":
words = get_words()
eds = []
for i in range(100):
w1 = words[random.randint(1,5757)]
w2 = words[random.randint(1,5757)]
ed = euclidean_distance(w1,w2)
eds.append((w1,w2,ed))
sorted_eds = sorted(eds, key=operator.itemgetter(2))
for e in reversed(sorted_eds):
print_tuple(e)
Note that this script shares much in common with codes to create
Caesar ciphers, Affine ciphers, and the like  the heart of the script is
the word2vec()
function, which converts a fiveletter word into a fivecomponent
vector of numbers from 0 to 25. This is done using Python's ord()
function,
which returns the ordinal value of a character. The letter 'a' is 0, 'b' is 1, 'c' is 2,
and so on.
The code also implements an L2 norm calculation, which is the mathematical term for a Euclidean distance calculation. It computes the square root of the sum of the squares of the differences between each component of the vector. This is the standard distance formula from your high school algebra class, extended to higher dimensions:
Or, for the physicists out there, the dot product of two vectors. The L2 norm between two vectors \(\mathbf{v}_1\) and \(\mathbf{v}_2\) is commonly denoted:
Examples
To better illustrate what the Euclidean distance calculation looks like, let's look at some concrete examples of words that have an edit distance of 1:
there, these
right, sight
sound, round
might, night
might, light
along, among
In each case, we increment or decrement a single letter by 1, and the result is another fiveletter word in the list. Perhaps the most surprising result is how many pairs of common words have an edit distance of 1:
$ python diff_by_one.py
1075 words have a Euclidean distance of +/1.
That means nearly 20% of the words are within a single edit of another word.
If we look at words that have an edit distance of more than 1, we can see that some pairs of words have a single letter that changes by 2 units, while other pairs have two letters that differ by a single unit:
would, wound
right, tight
years, wears
never, lever
along, alone
night, light
paper, oboes
The last pair is an example of the latter.
Here are more examples of pairs of words with larger edit distances:

Distance of 3
there, where
would, world
words, woods
sound, pound
those, whose
house, horse

Distance of 4
about, cents
after, birds
right, night
think, thing
sound, wound
small, smell

Distance of 5
which, weigh
there, theme
other, steer
right, might
years, tears
place, space
place, piece
DifferentbyN Code
The code that performs the above calculations includes
diff_by_one.py
and diff_by_n.py
. Here is the former:
"""
diff_by_one.py
Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Exercise #28
Find pairs of SGB word vectors that differ by +/1.
"""
from get_words import get_words
from euclidean_distance import euclidean_distance
if __name__=="__main__":
words = get_words()
## To limit the output:
#words = words[:1000]
k = 0
off_by_one = []
for i in range(len(words)):
for j in range(i,len(words)):
d = euclidean_distance(words[i],words[j])
if(abs(d)==1):
k += 1
off_by_one.append((words[i],words[j]))
print("{0:s}, {1:s}".format(words[i],words[j]))
print("{0:d} words have a Euclidean distance of +/1.".format(k))
This is a nested for loop that examines all pairs of words. Note that
we want to avoid the pair (B,A) if we have already found/printed the pair
(A,B), so we use a nested for loop where the inner index starts at the
outer index. The core of the script is the euclidean_distance()
function,
covered above.
This algorithm takes \(O(N^2)\) time due to the nested for loops.
Likewise, here is code to generate pairs that differ by some amount \(n\). This code will only print 10 pairs for each \(n\), to cut down on running time.
"""
diff_by_n.py
Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Variation on Exercise #28
Find pairs of SGB word vectors that differ by +/n.
"""
from get_words import get_words
from euclidean_distance import euclidean_distance
def diff_by_n(n):
k = 0
off_by_one = []
for i in range(len(words)):
for j in range(i,len(words)):
d = euclidean_distance(words[i],words[j])
if(abs(d)==n):
k += 1
off_by_one.append((words[i],words[j]))
print("{0:s}, {1:s}".format(words[i],words[j]))
if k>10:
break
print("{0:d} words have a Euclidean distance of +/{0:d}.".format(k,n))
if __name__=="__main__":
words = get_words()
for n in [2,3,4,5]:
print(""*40)
print("Distance of {0:d}".format(n))
diff_by_n(n)
Mo Words, Mo Problems
We have a number of other interesting problems and codes to cover, including:
 Palindromes
 Number of unique words
 Word/letter statistics
 Words consisting of lexicographically sorted letters
 Words consisting of distinct letters
Each of these will be covered in a series of posts to follow. Then we'll move on to the business of graph theory, implementation of these five letter words as graphs, and utilization of software libraries designed for graphs and networks (expect some code in Java using Google's excellent Guava library).
You can follow all of this in my fiveletterwords repository on Github, and/or in the fiveletterwords repository on git.charlesreid1.com.
We have also provided additional information on the charlesreid1 wiki, at Five Letter Words, along with a whole bundle of notes from working through Donald Knuth's Art of Computer Programming at the Art of Computer Programming page of the wiki.
Expect more soon!
References

Knuth, Donald. The Art of Computer Programming. Upper Saddle River, NJ: AddisonWesley, 2008.

Knuth, Donald. The Stanford GraphBase: A Platform for Combinatorial Computing. New York: ACM Press, 1994. <http://wwwcsfaculty.stanford.edu/~knuth/sgb.html>

"Five Letter Words." Git repository, git.charlesreid1.com. Charles Reid. Updated 1 September 2017. <http://git.charlesreid1.com/cs/fiveletterwords>