charlesreid1.com blog

Five Letter Words: Part 2: More Five-Word Algorithms

Posted in Computer Science

permalink

NOTE: The code covered in this post uses Python 3. The scripts can be converted to Python 2 with minimal effort, but the author would encourage any user of Python 2 to "put on your big kid pants" and make the switch to Python 3. Let's all make this painful, drawn-out switch from Python 2 to Python 3 a thing of the past, shall we?

Table of Contents

Introduction

As mentioned in Five Letter Words: Part 1, we covered Donald Knuth's list of five letter words, one of the data sets in the Stanford Graph Base that is covered in greater detail in Knuth's coverage of graph theory in Volume 4, Fascicle 0 of his magnum opus, The Art of Computer Programming.

In the section where Knuth introduces the set of words, he also gives readers several exercises to get to know the list of words. This multi-part series of posts (also see Five Letter Words: Part 1) is covering some of the solutions to these exercises, and expanding on them to illustrate some of the interesting and surprising properties of this data set.

Five-Letter Words with k Distinct Letters

In Exercise 27, Knuth asks the reader to make a list of words composed of a specific number of distinct letters (1, 2, 3, 4, or 5).

In the list of five-letter words, there are 0 words composed of a single letter, 4 words with two distinct letters (0.07%), 163 words with three distinct letters (2.8%), 1756 words with four distinct letters (30.5%), and 3834 words with five distinct letters (66.5%).

Here are a few examples: * Two distinct letters: mamma, ahhhh, esses, ohhhh * Three distinct letters: added, seems, sense, level, teeth * Four distinct letters: which, there, these, where, three * Five distinct letters: their, about, would, other, words

To find these, we can design an algorithm that does the following: split each string into characters, add them to a set data type (a set discards any duplicates), and get the size of the set. This will give us the number of unique letters in a given word, and we can use a list of lists to store all words with a specified number of unique letters.

Once again, we're using our get_words function, which we covered in Part 1. See get_words.py for that script.

distinct.py

"""
distinct.py

Donald Knuth, Art of Computer Programming, Volume 4 Fascicle 0
Exercise #27

How many SGB words contain exactly k distinct letters, for 1 <= k <= 5?
"""
from pprint import pprint
from get_words import get_words

if __name__=="__main__":
    words = get_words()

    lengths = [[] for i in range(5+1)]

    for word in words:
        k = len(set(word))
        lengths[k].append(word)

    for i in range(1,5+1):
        print("-"*40)
        print("Number of words with {0:d} letters: {1:d}".format(i, len(lengths[i])))
        print(", ".join(lengths[i][0:5]))

The principal operation here is the statement that gets the length, k:

k = len(set(word))
lengths[k].append(word)

The operation of turning a word into a set is \(O(M)\), where M is the number of letters in the word, and the algorithm performs this operation on each word in sequence, so overall, the algorithm is \(O(N)\), where N is the number of words.

The storage space used by the algorithm is also \(O(N)\), since for each word, the number of distinct letters \(k \in \{ 0 \dots 5 \}\).

If we were dealing with a lot of words, and needed to save some space, we could represent the list of words with \(k\) distinct letters using five bit vectors, where each bit vector represents the words that are composed of \(k\) distinct letters, and has a length of \(N\), the number of words. A 0 would indicate the word is not in the set (is not composed of \(k\) letters), and a 1 would indicate the opposite.

But here, we keep it simple, since we have a small, known set of words.

Examining a Variation

While that's essentially all there is to this algorithm, and it takes all of 10 seconds to come up with the idea, there are some nuances and some bookkeeping details, as there are with the design of any algorithm.

For example, compare the following two approaches; Approach 1 is used in the program, Approach 2 is a less efficient approach:

    # Approach 1:
    for word in words:
        k = len(set(word))
        lengths[k].append(word)


    # Approach 2:
    for k in range(1,5+1):
        if(len(set(word))==k):
            lengths[k].append(word)

While these are both \(O(N)\) runtime, the latter approach is inefficient: we loop over each word five times, and each time we perform the same operation (turning the letters of a word into a set).

Is there ever a case where we would want an approach like #2?

The short answer is, never.

To give a longer answer, let's consider a case where approach #2 might provide an advantage. Suppose we were considering a case where \(k\) could be larger - a list of 15-letter words, for example, so k could be up to 15 - and we were only interested in a particular value, or small set of values, of \(k\), like 3 and 4.
Approach 1 would store unnecessary intermediate results (the values of k for all words) and therefore use extra space, compared with approach #2 where we could change the for loop to for k in [3,4]:.

Even here, though, approach #2 results in unnecessary work, because approach #1 is still computationally more efficient by looping over the list of words only once, compared with approach #2, which would loop over the list of words twice.

We may further consider a case where approach #2 would give us an advantage, and that is the case where we are copying data into the list lengths, instead of just storing a reference to a string. Because we only deal with references in Python, we aren't making copies in the code given above. But because strings are immutable, we could conceivably be making copies if we stored word.upper() instead of word. Approach #2 would use less space, because it only considers the values of k that are of interest.

But even here, approach #1 requires only a small modification to wipe out the space advantage of approach #2: add an if statement before calling the append function: if k in [3,4]. Now the calculation of turning a word into a set of characters is performed only once for approach #1, and we don't end up storing unnecessary intermediate results.

The take-home lesson: even if the core idea behind an algorithm is straightforward, there are still many ways to do it better or worse.

Lexicographic Ordering of Letters

Knuth points out that the word "first" contains letters that occur in lexicograhpic order. Exercise #30 of AOCP Volume 4 Fascicle 0 asks us to find the first and last such word that occurs in Knuth's set of five letter words.

To do this, we'll take each word and turn it into a list of characters. We'll then sort the characters, and turn the sorted list of characters back into a string. If the string constructed from sorted characters equals the original string, we have our word, formed from lexicographically ordered letters.

We can also perform the reverse - search for words whose letters are in reverse lexicographic order. One such word is "spied". Implementing this task requires a bit more care, because of the fact that Python 3 returns generators where Python 2 would return lists, but we can get around this with the list() function, as we shall see shortly.

Five-Letter Words with Lexicographically Ordered Letters

Exercise 30 asks us to find the first and last word in the set of five letter words whose letters occur in sorted lexicographic order. We begin by sorting all of the words, and we find the first such word is "abbey", while the last such word is "pssst".

There are 105 total words that fit this description. As we might expect, a majority of them begin with letters at the beginning of the alphabet:

  • abbey
  • abbot
  • abhor
  • abort
  • abuzz
  • achoo
  • adder
  • adept
  • adios
  • adopt
  • aegis
  • affix
  • afoot
  • aglow
  • ahhhh
  • allot
  • allow
  • alloy
  • ammos
  • annoy
  • beefs
  • beefy
  • beeps
  • beers
  • beery
  • befit
  • begin
  • begot
  • bells
  • belly
  • below
  • berry
  • bills
  • billy
  • bitty
  • blowy
  • boors
  • boost
  • booty
  • bossy
  • ceils
  • cello
  • cells

The full output is here:

lexico output

The code to find these words is given below:

lexico.py

"""
lexico.py

Donald Knuth, Art of Computer Programming, Volume 4 Fascicle 0
Exercise #30

Each letter of the word "first" appears in correct lexicographic order.
Find the first and last such words in the SGB words.
"""
from get_words import get_words

def in_sorted_order(word):
    chars = list(word)
    if(str(chars)==str(sorted(chars))):
        return True
    else:
        return False

if __name__=="__main__":

    words = get_words()
    words = sorted(words)

    count = 0
    print("-"*40)
    print("ALL lexicographically sorted words:")
    for word in words:
        if(in_sorted_order(word)):
            print(word)
            count += 1
    print("{0:d} total.".format(count))

    print("-"*40)
    for word in words:
        if(in_sorted_order(word)):
            print("First lexicographically sorted word:")
            print(word)
            break

    words.reverse()

    print("-"*40)
    for word in words:
        if(in_sorted_order(word)):
            print("Last lexicographically sorted word:")
            print(word)
            break

The heart of the method here is the in_sorted_order() method: this performs the task, as described above. We take the word passed to the function (a string), and turn it into a list using the list() function. We then turn this list back into a string (which is the same as the variable word), and compare it to the sorted list of characters, turned back into a string, using the call str(sorted(chars)).

If the two match, we have not affected the order of characters by sorting them in lexicographic (alphabetic) order, and therefore the original string was in sorted order, and we return True. Otherwise, we return False.

Here's that method one more time:

def in_sorted_order(word):
    chars = list(word)
    if(str(chars)==str(sorted(chars))):
        return True
    else:
        return False

Five-Letter Words with Lexicographically Reversed Letters

There are significantly fewer five-letter words whose letters are in reverse lexicographic order - 37, compared to the 105 in sorted order. Here is the full list:

  • mecca
  • offed
  • ohhhh
  • plied
  • poked
  • poled
  • polka
  • skied
  • skiff
  • sniff
  • soled
  • solid
  • sonic
  • speed
  • spied
  • spiff
  • spoke
  • spoof
  • spook
  • spool
  • spoon
  • toked
  • toned
  • tonic
  • treed
  • tried
  • troll
  • unfed
  • upped
  • urged
  • vroom
  • wheee
  • wooed
  • wrong
  • yoked
  • yucca
  • zoned

The code to do this requires only minor modifications to the original, sorted order code.

To reverse the procedure, we just need to modify the in_sorted_order() function to reverse the sorted list of characters before we reassemble it into a string. We can feed the output of the call to sorted() to the reversed() function. However, in Python 3, this returns a generator object, which is lazy - it does not automatically enumerate every character. Unless, of course, we force it to.

That's where the call to list() comes in handy - by passing a generator to list(), we force Python to enumerate the output of the reversed, sorted list generator. Then we turn the reversed, sorted list into a reversed, sorted string:

def in_reverse_sorted_order(word):
    chars = list(word)
    # Note: reversed returns a generator,
    # so we have to pass it to list()
    # to explicitly enumerate the reversed results.
    if(str(chars)==str(list(reversed(sorted(chars))))):
        return True
    else:
        return False

Meanwhile, the rest of the script can stay virtually the same.

reverse_lexico.py

"""
reverse_lexico.py

Donald Knuth, Art of Computer Programming, Volume 4 Fascicle 0
Variation on Exercise #30

Each letter of the word "spied" appears in reversed lexicographic order.
Find more words whose letters appear in reverse lexicographic order.
"""
from get_words import get_words

def in_reverse_sorted_order(word):
    chars = list(word)
    # Note: reversed returns a generator, 
    # so we have to pass it to list() 
    # to explicitly enumerate the reversed results.
    if(str(chars)==str(list(reversed(sorted(chars))))):
        return True
    else:
        return False

if __name__=="__main__":

    words = get_words()
    words = sorted(words)

    count = 0
    print("-"*40)
    print("ALL lexicographically reversed words:")
    for word in words:
        if(in_reverse_sorted_order(word)):
            print(word)
            count += 1
    print("{0:d} total.".format(count))

    print("-"*40)
    for word in words:
        if(in_reverse_sorted_order(word)):
            print("First reverse lexicographically sorted word:")
            print(word)
            break

    words.reverse()

    print("-"*40)
    for word in words:
        if(in_reverse_sorted_order(word)):
            print("Last lexicographically sorted word:")
            print(word)
            break

Finding Palindromes

Palindromes are words or sets of words that have a reflective property, namely, they spell the same thing forward and reverse (e.g., "race car", or "Ere I was able, I saw Malta", or "Doc, note I dissent - a fast never prevents a fatness. I diet on cod.").

In Exercise 29, Knuth asks the reader to perform a straightforward task - find the palindromes in the list of five letter words. (An example of one such word is "kayak".) But Knuth goes further, and points out that palindromes can also be formed from pairs of words. He gives the example "regal lager". He asks the reader to find all palindrome pairs as well.

When working on these exercises, we became curious about palindromic near-misses. How many words are almost palindromes? (Example: "going" is very close to a palindrome, if we just changed the n to an o or vice-versa.) In fact, we already have all the tools we need at our disposal, as we already covered a script to perform a Euclidean distance calculation.

We will cover Python code to find words that fit into each of these categories, and provide some interesting examples. (One of the most surprising things to us was just how many words meet these criteria!)

Palindromes

The first task is finding palindromes in the set of five letter words. There are 18 such words. They are given below:

  • level
  • refer
  • radar
  • madam
  • rotor
  • civic
  • sexes
  • solos
  • sagas
  • kayak
  • minim
  • tenet
  • shahs
  • stats
  • stets
  • kaiak
  • finif
  • dewed

The code to check if a word is a palindrome consists of two simple logical test: Is the character at position 0 equal to the character at position 4? Is the character at position 1 equal to the character at position 3? If both of these are true, the word is a palindrome. Here's the Python function to check if a word is a palindrome:

palindromes.py

def is_palindrome(word):
    test1 = word[0]==word[4]
    test2 = word[1]==word[3]
    if(test1 and test2):
        return True
    return False

and the main driver method, which actually runs the function on each word:

if __name__=="__main__":
    words = get_words()

    kp = 0
    palindromes = []

    # Check for palindromes
    for i in range(len(words)):
        if(is_palindrome(words[i])):
            kp += 1
            palindromes.append(words[i])

    print("-"*40)
    print("Palindromes: \n")
    print(", ".join(palindromes))
    print("There are {0:d} palindromes.".format(kp))

Palindrome Pairs

There are 34 palindromic pairs, if we disallow palindromes from being considered palindromic pairs with themselves. These are:

  • parts, strap
  • lived, devil
  • speed, deeps
  • sleep, peels
  • straw, warts
  • faced, decaf
  • spots, stops
  • fires, serif
  • lever, revel
  • smart, trams
  • ports, strop
  • pools, sloop
  • stool, loots
  • draws, sward
  • mined, denim
  • spins, snips
  • alley, yella
  • loops, spool
  • sleek, keels
  • repel, leper
  • snaps, spans
  • depot, toped
  • timed, demit
  • debut, tubed
  • laced, decal
  • stink, knits
  • regal, lager
  • tuber, rebut
  • remit, timer
  • pacer, recap
  • snoot, toons
  • namer, reman
  • hales, selah
  • tarps, sprat

The code to check for palindrome pairs is a little more involved, but also consists of a few logical tests to see if letters in one position of the first word match letters in another position of the second word:

palindromes.py

def is_palindrome_pair(word1,word2):
    test0 = word1[0]==word2[4]
    test1 = word1[1]==word2[3]
    test2 = word1[2]==word2[2]
    test3 = word1[3]==word2[1]
    test4 = word1[4]==word2[0]
    if(test0 and test1 and test2 and test3 and test4):
        return True
    return False

and the main driver method:

if __name__=="__main__":
    words = get_words()

    kpp = 0
    palindrome_pairs = []

    # Check for palindrome pairs
    for i in range(len(words)):
        for j in range(i,len(words)):
            if(is_palindrome_pair(words[i],words[j])):
                # Palindromes shouldn't count as palindrome pairs
                if(words[i] is not words[j]):
                    kpp += 1
                    palindrome_pairs.append((words[i],words[j]))

    print("-"*40)
    print("Palindrome Pairs: \n")
    for pair in palindrome_pairs:
        print(", ".join(pair))
    print("There are {0:d} palindrome pairs.".format(kpp))

Near Palindromes

A near-palindrome is a word that would be a palindrome, if one of its letters were slightly modified. We use a "tolerance" parameter to specify how much modification we are willing to live with to consider a word a near-palindrome.

There are several ways to do this, but we'll keep it simple: we consider the totla number of changes to all characters in the word required to make a word a palindrome, and test whether the changes required to make the word a palindrome are less than or equal to a specified parameter, tolerance.

For example, if our tolerance were 1, we would consider the words "going" and "moron" to be near-palindromes; if our tolerance were 2, we would consider the words "tsars" and "jewel" to be near-palindromes.

Here is the list of 37 off-by-one palindromes:

  • going
  • seeds
  • tight
  • trust
  • suits
  • sends
  • plump
  • slums
  • sighs
  • erase
  • serfs
  • soaps
  • sewer
  • soups
  • sever
  • slams
  • scabs
  • moron
  • ceded
  • scads
  • suets
  • fugue
  • seder
  • tryst
  • educe
  • twixt
  • tutus
  • shags
  • slims
  • abaca
  • anima
  • celeb
  • selfs
  • scuds
  • tikis
  • topos
  • rajas

and the list of off-by-two palindromes:

  • often
  • stars
  • sight
  • visit
  • towns
  • climb
  • flame
  • reads
  • sings
  • hatch
  • tends
  • naval
  • robot
  • reeds
  • cocoa
  • stout
  • spins
  • onion
  • sinks
  • edged
  • spurs
  • jewel
  • snaps
  • silks
  • nasal
  • theft
  • pagan
  • reefs
  • stirs
  • snips
  • tufts
  • truss
  • strut
  • spans
  • smelt
  • spars
  • flake
  • rusts
  • skims
  • sways
  • runts
  • tsars
  • tress
  • feted
  • rends
  • romps
  • cilia
  • ephod
  • fluke
  • reset
  • farad
  • peter
  • natal
  • thugs
  • newel
  • paean
  • emend
  • snoot
  • fiche
  • porno
  • flume
  • toons
  • roans
  • offen
  • klunk
  • feued
  • nihil
  • pavan
  • relet
  • heigh
  • revet
  • sicks
  • spoor

The check for near-palindromes follows the palindrome test fairly closely, except instead of checking if letters in two positions are equal, we check of those two letters are a certain specified distance from one another.

Here is the code for finding near-palindromes:

"""
near_palindromes.py

Donald Knuth, Art of Computer Programming, Volume 4 Fascicle 0
Variation on Exercise #29

Find SGB words that are near-palindromes
(edit distance of one or two letters away from a palindrome).
"""
from get_words import get_words
from euclidean_distance import euclidean_distance
from pprint import pprint

def is_near_palindrome(word,lo,hi):
    d1 = euclidean_distance(word[0],word[4])
    d2 = euclidean_distance(word[1],word[3])

    if( (d1+d2) > lo and (d1+d2) <= hi ):
        return True

    return False

if __name__=="__main__":
    words = get_words()

    knp = 0
    near_palindromes = []

    # Euclidean distance tolerance
    lo = 0.0
    hi = 1.0

    for i in range(len(words)):
        if(is_near_palindrome(words[i],lo,hi)):
            knp += 1
            near_palindromes.append(words[i])

    print("-"*40)
    print("Near Palindromes: \n")
    print(", ".join(near_palindromes))
    print("The number of near-palindromes is {0:d}".format(len(near_palindromes)))

References

  1. Knuth, Donald. The Art of Computer Programming. Upper Saddle River, NJ: Addison-Wesley, 2008.

  2. Knuth, Donald. The Stanford GraphBase: A Platform for Combinatorial Computing. New York: ACM Press, 1994. <http://www-cs-faculty.stanford.edu/~knuth/sgb.html>

  3. "Five Letter Words." Git repository, git.charlesreid1.com. Charles Reid. Updated 1 September 2017. <http://git.charlesreid1.com/cs/five-letter-words>

Five Letter Words: Part 1: Getting Familiar With The List

Posted in Computer Science

permalink

NOTE: The code covered in this post uses Python 3. The scripts can be converted to Python 2 with minimal effort, but the author would encourage any user of Python 2 to "put on your big kid pants" and make the switch to Python 3. Let's all make this painful, drawn-out switch from Python 2 to Python 3 a thing of the past, shall we?

Table of Contents

About the Five-Letter Words

In Volume 4, Facsimile 0 of Donald Knuth's Art of Computer Programming, in which Knuth covers graph theory, he introduces a list of five-letter words as part of a data set useful in exploring graph theory and graph algorithms.

The list of words is part of the Stanford Graph Base, a set of data sets that are useful for studying graph theory and networks.

The first few words in the list are:

  • which
  • there
  • their
  • about
  • would
  • these
  • other
  • words
  • could
  • write
  • first
  • water
  • after

and so on. There are 5,757 total words in the data set, including some common words (as the first few listed), as well as some less common words:

  • osier
  • roble
  • rumba
  • biffy
  • pupal

This post is an introduction to the five letter words, and will give a few useful algorithms for analyzing the set of words.

The Usefulness of Five Letter Words

We are surrounded, always and everywhere, by language - the principal mechanism of thought, communication, and expression. Our latent familiarity with language makes data sets involving language extremely useful - unlike a data set about football scores, property crime, or human fatalities, we don't expend much effort understanding the nature of the data. Studying language also gives us a deeper understanding and appreciation for the complexity of language, for through our very familiarity with language, it can come to seem deceptively simple.

Five letter words, in particular, are short enough that they are familiar, and surround us, and yet long enough to have variety and lead to some very interesting properties. Five also happens to be a computationally convenient length.

Warm-Up Exercises

In Knuth's AOCP, he presents the reader with several warm-up exercises to get familiar with the list of words. We cover solutions to several of these exercises. Many of these exercises are creating algorithms that, while not utilizing graph theory themselves, can be utilized to construct interesting graphs. These exercises are written in Python.

Let us begin.

Get Words Function

Before starting any analysis of the five letter words, it is a good idea to create a function that will load the data set form a text file and load the result as a Python list. This function is given below:

get_words.py

"""
get_words.py

Utility method to load the SBG words
and retun them as a list of strings.
"""

def get_words():
    # Load the file.
    with open('sgb-words.txt','r') as f:
        ## This includes \n at the end of each line:
        #words = f.readlines()

        # This drops the \n at the end of each line:
        words = f.read().splitlines()

    return words

This is a straightforward use of the read() and splitlines() functions in Python.

Euclidean Distance

We begin with a calculation of the Eulcidean distance between words. We define the distance between two words, commonly called the "edit distance," based on the notion of a unit change, which is incrementing or decrementing a letter by one. Thus, the edit distance between "a" and "b" is 1, the edit distance between "e" and "g" is 2, and so on.

Euclidean Distance Code

Let's start with the code that does the calculation of the edit distance between two words:

euclidean_distance.py

import random, math, operator
from pprint import pprint
from get_words import get_words

random.seed(1337)

"""
euclidean_dist.py

Compute euclidean distance between 5-letter words.
"""

def euclidean_distance(word1, word2):
    v1 = word2vec(word1)
    v2 = word2vec(word2)
    return l2norm(v1,v2)

def l2norm(vec1, vec2):
    radicand = [(v2-v1)*(v2-v1) for (v1,v2) in zip(vec1,vec2)]
    return math.sqrt(sum(radicand))

def word2vec(word):
    charvec = []
    vec = []
    for c in word:
        charvec.append(c)
        vec.append(ord(c)-ord('a'))
    return vec

def print_tuple(e):
    print("Distance between {0:s} and {1:s} = {2:f}".format(*e))

if __name__=="__main__":

    words = get_words()

    eds = []
    for i in range(100):
        w1 = words[random.randint(1,5757)]
        w2 = words[random.randint(1,5757)]
        ed = euclidean_distance(w1,w2)
        eds.append((w1,w2,ed))

    sorted_eds = sorted(eds, key=operator.itemgetter(2))

    for e in reversed(sorted_eds):
        print_tuple(e)

Note that this script shares much in common with codes to create Caesar ciphers, Affine ciphers, and the like - the heart of the script is the word2vec() function, which converts a five-letter word into a five-component vector of numbers from 0 to 25. This is done using Python's ord() function, which returns the ordinal value of a character. The letter 'a' is 0, 'b' is 1, 'c' is 2, and so on.

The code also implements an L2 norm calculation, which is the mathematical term for a Euclidean distance calculation. It computes the square root of the sum of the squares of the differences between each component of the vector. This is the standard distance formula from your high school algebra class, extended to higher dimensions:

$$ d = \sqrt{ (x_2 - x_1)^2 + (y_2 - y_1)^2} $$

Or, for the physicists out there, the dot product of two vectors. The L2 norm between two vectors \(\mathbf{v}_1\) and \(\mathbf{v}_2\) is commonly denoted:

$$ || \mathbf{v}_2 - \mathbf{v}_1 ||_2 $$

Examples

To better illustrate what the Euclidean distance calculation looks like, let's look at some concrete examples of words that have an edit distance of 1:

there, these
right, sight
sound, round
might, night
might, light
along, among

In each case, we increment or decrement a single letter by 1, and the result is another five-letter word in the list. Perhaps the most surprising result is how many pairs of common words have an edit distance of 1:

$ python diff_by_one.py
1075 words have a Euclidean distance of +/-1.

That means nearly 20% of the words are within a single edit of another word.

If we look at words that have an edit distance of more than 1, we can see that some pairs of words have a single letter that changes by 2 units, while other pairs have two letters that differ by a single unit:

would, wound
right, tight
years, wears
never, lever
along, alone
night, light
paper, oboes

The last pair is an example of the latter.

Here are more examples of pairs of words with larger edit distances:

----------------------------------------
Distance of 3
there, where
would, world
words, woods
sound, pound
those, whose
house, horse
----------------------------------------
Distance of 4
about, cents
after, birds
right, night
think, thing
sound, wound
small, smell
----------------------------------------
Distance of 5
which, weigh
there, theme
other, steer
right, might
years, tears
place, space
place, piece

Different-by-N Code

IMPORTANT NOTE: On 2019-03-09 we revisited the problem set and solution, and discovered that we had misinterpreted the (much more interesting) original problem posed by Knuth. Ths will be rectified in a follow-up blog post!

Briefly, the mistake we made here was to interpret this problem as asking for pairs of words "different by +/-1" to mean, find pairs with a total Hamming distance (or Euclidean distance) of exactly +/-1 total. This would produce pairs like "might" and "night".

In fact, the problem Knuth posed asks for pairs of words in the SGB that are "different by +/-1 at each position," meaning each letter must be different by one and exactly one. An example of such a pair would be "rover" and "spuds".

The code that performs the above calculations includes diff_by_one.py and diff_by_n.py. Here is the former:

diff_by_one.py

"""
diff_by_one.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Exercise #28

Find pairs of SGB word vectors that differ by +/-1.

(See IMPORTANT NOTE here: https://charlesreid1.github.io/five-letter-words-part-1-getting-familiar-with-the-list.html)
"""
from get_words import get_words
from euclidean_distance import euclidean_distance

if __name__=="__main__":
    words = get_words()

    ## To limit the output:
    #words = words[:1000]

    k = 0
    off_by_one = []
    for i in range(len(words)):
        for j in range(i,len(words)):
            d = euclidean_distance(words[i],words[j])
            if(abs(d)==1):
                k += 1
                off_by_one.append((words[i],words[j]))
                print("{0:s}, {1:s}".format(words[i],words[j]))

    print("{0:d} words have a Euclidean distance of +/-1.".format(k))

This is a nested for loop that examines all pairs of words. Note that we want to avoid the pair (B,A) if we have already found/printed the pair (A,B), so we use a nested for loop where the inner index starts at the outer index. The core of the script is the euclidean_distance() function, covered above.

This algorithm takes \(O(N^2)\) time due to the nested for loops.

Likewise, here is code to generate pairs that differ by some amount \(n\). This code will only print 10 pairs for each \(n\), to cut down on running time.

diff_by_n.py

"""
diff_by_n.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Variation on Exercise #28

Find pairs of SGB word vectors that differ by +/-n.

(See IMPORTANT NOTE here: https://charlesreid1.github.io/five-letter-words-part-1-getting-familiar-with-the-list.html)
"""
from get_words import get_words
from euclidean_distance import euclidean_distance

def diff_by_n(n):
    k = 0
    off_by_one = []
    for i in range(len(words)):
        for j in range(i,len(words)):
            d = euclidean_distance(words[i],words[j])
            if(abs(d)==n):
                k += 1
                off_by_one.append((words[i],words[j]))
                print("{0:s}, {1:s}".format(words[i],words[j]))
        if k>10:
            break

    print("{0:d} words have a Euclidean distance of +/-{0:d}.".format(k,n))


if __name__=="__main__":
    words = get_words()

    for n in [2,3,4,5]:
        print("-"*40)
        print("Distance of {0:d}".format(n))
        diff_by_n(n)

Mo Words, Mo Problems

We have a number of other interesting problems and codes to cover, including:

  • Palindromes
  • Number of unique words
  • Word/letter statistics
  • Words consisting of lexicographically sorted letters
  • Words consisting of distinct letters

Each of these will be covered in a series of posts to follow. Then we'll move on to the business of graph theory, implementation of these five letter words as graphs, and utilization of software libraries designed for graphs and networks (expect some code in Java using Google's excellent Guava library).

You can follow all of this in my five-letter-words repository on Github, and/or in the five-letter-words repository on git.charlesreid1.com.

We have also provided additional information on the charlesreid1 wiki, at Five Letter Words, along with a whole bundle of notes from working through Donald Knuth's Art of Computer Programming at the Art of Computer Programming page of the wiki.

Expect more soon!

References

  1. Knuth, Donald. The Art of Computer Programming. Upper Saddle River, NJ: Addison-Wesley, 2008.

  2. Knuth, Donald. The Stanford GraphBase: A Platform for Combinatorial Computing. New York: ACM Press, 1994. <http://www-cs-faculty.stanford.edu/~knuth/sgb.html>

  3. "Five Letter Words." Git repository, git.charlesreid1.com. Charles Reid. Updated 1 September 2017. <http://git.charlesreid1.com/cs/five-letter-words>

Euler's Theorem, the Totient Function, and Calculating Totients By Hand

Posted in Mathematics

permalink

Table of Contents

Introduction

Today we're going to delve into a little bit of number theory.

In number theory, we are usually dealing with modular arithmetic - expressions of the form:

$$ a \equiv b \mod m $$

or

$$ f(x) \equiv 0 \mod m $$

The mod indicates we're doing modular arithmetic, which is (formally) an algebraic system called a ring, which consists of the integers 0 through m.

An analogy to modular arithmetic is the way that the sine and cosine function "wrap around," and

$$ \sin \left( \dfrac{2 \pi}{3} \right) \equiv \sin \left( \dfrac{8 \pi}{3} \right) \equiv \sin \left( \dfrac{14 \pi}{3} \right) \equiv \dots $$

On a ring, this happens with the integers. So, for example,

$$ 2 \equiv 6 \equiv 10 \equiv \dots \mod 4 $$

Modular arithmetic uses the \(\equiv\) symbol, and not the \(=\) symbol, because we can't manipulate the left and right side using normal rules of algebra - solving equations on a ring requires some care.

The value of \(m\) need not be prime, generally, but if it is, we have some special properties that hold.

Complete and Reduced Residue Systems

Consider the ring of integers \(\mod 10\), which consists of the numbers

$$ \{ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 \} $$

This is called the complete residue system mod 10. If we want to solve an equation like

$$ 2x \equiv 8 \mod 10 $$

we would normally just divide both sides by 2. But because of the "mod 10" we have to be a bit more careful. Dividing by 2 is just a way of saying, we want to multiply 2 by some number that will make 2 into 1.

However, because 2 is a factor of 10, there is no number in the complete residue system that will yield 1 mod 10 when we multiply it by 2:

$$ \begin{eqnarray*} 0 * 2 & \equiv & 0 \mod 10 \\ 1 * 2 & \equiv & 2 \mod 10 \\ 2 * 2 & \equiv & 4 \mod 10 \\ 3 * 2 & \equiv & 6 \mod 10 \\ 4 * 2 & \equiv & 8 \mod 10 \\ 5 * 2 \equiv 10 & \equiv & 0 \mod 10 \\ 6 * 2 \equiv 12 & \equiv & 2 \mod 10 \\ 7 * 2 \equiv 14 & \equiv & 4 \mod 10 \\ 8 * 2 \equiv 16 & \equiv & 6 \mod 10 \\ 9 * 2 \equiv 18 & \equiv & 8 \mod 10 \end{eqnarray*} $$

The same difficulty appears if we try and solve an equation like

$$ 5x \equiv 8 \mod 10 $$

for the same reason - 5 is a factor of 10, so it has no inverse mod 10.

Contrast that with solving an equation like

$$ 3x \equiv 8 \mod 10 $$

which, because 3 does not share any factors with 10, means we can find a number such that 3 times that number yields 1 mod 10:

$$ 3 \times 7 \equiv 21 \equiv 1 \mod 10 $$

so we can solve the equation by multiplying both sides by 7, the inverse of 3:

$$ \begin{eqnarray*} 3 x & \equiv & 8 \mod 10 \\ (7 \times 3) x & \equiv & (7 \times 8) \mod 10 \\ x &=& 56 \mod 10 \\ x &=& 6 \mod 10 \end{eqnarray*} $$

We can resolve this by creating a reduced residue system, which is a set of integers that have inverses mod 10. The reduced residue system consists of integers that

  • Have no common factors with the ring size \(m\)
  • Have no two elements that are congruent \mod m

So a reduced residue system \(\mod 10\) could be, for example,

$$ \{ 1, 3, 7, 9 \} $$

(other reduced residue systems are possible as well).

It is important to note that we do not include 0, in general, because 0 shares all factors with \(m\) - that is, every number in the complete residue system divides 0, so the greatest common factor of \(0\) and \(m\) is \(m\) and not 1!

(The only exception to this rule is \(m=1\), but this is a trivial case, since every integer is congruent mod 1.)

The reduced residue system has the property that any number in the complete residue system can be generated from the reduced residue system via addition.

Further, the size of the reduced residue system can be expressed using a function called the Euler totient function, denoted \(\phi(m)\). The totient function quantifies the number of integers less than \(m\) that are relatively prime with \(m\) (that is, two numbers such that the greatest common factor, denoted with the shorthand \((a,m)\), is 1).

Euler's Totient Function

Euler's totient function, \(\phi(m)\), turns out to be an extremely useful quantity in number theory. It also provides a quantitative measure of how divisible a number is. Take the two numbers 960 and 961 as examples:

$$ \phi(960) = 256 \qquad \phi(961) = 930 $$

from this, we can see that 960 has many more factors than 961. Here are their prime factorizations:

$$ \begin{eqnarray*} 960 = 2^6 \times 3 \times 5 \\ 961 = 31^2 \end{eqnarray*} $$

This means that the ring of integers mod 960 will have far more congruences that cannot be solved compared to the ring of integers mod 961.

Historical Note: The notation \(\phi(n)\) was first used by the mathematician Carl Friedrich Gauss in his incredible book Disquisitiones Arithmeticae, an important historical textbook that focused on gathering all of the results known to that point about number theory in a single work. (Gauss did omit the parenthesis, however, writing the totient function simply as \(\phi n\).)

Calculating the Totient Function by Hand

It may be obvious that the totient function is simple to compute using a computer; but the question naturally arises: can we compute totient functions for large integers by hand?

It turns out we can - we just need to be able to factor the number in question. (Note that this requirement is true generally; see the section on RSA Cryptography in a post to follow).

Let's first illustrate some rules for computing the totient function of composite numbers with some simple examples.

Totient Property: Prime Power

The first useful property is computing the totient function of a number that is a prime number raised to some power. Let's take the simple example of \(81 = 9^2 = 3^4\). We know that any number that shares factors with 81 is a multiple of 3 less than or equal to 81, which is the set of numbers

$$ \{ 1 \times 3, 2 \times 3, 3 \times 3, \dots, 3^{4-1} \times 3 \} $$

and there are \(3^{4-1}\) of these numbers. Thus, of all of the integers from \(1\) to \(3^4\), there are \(3^3\) of them that are not relatively prime with \(3^4\). So the totient function can be written:

$$ \begin{eqnarray*} \phi(81) &=& \phi(3^4) = 3^{4} - 3^{4-1} \\ \phi(81) &=& 81 - 27 \\ \phi(81) &=& = 54 \end{eqnarray*} $$

In general, if we are considering the totient function of a prime power \(p^k\), we can write the totient function as

$$ \phi(p^k) = p^{k} - p^{k-1} $$

Totient Property: Product of Primes

If we take a composite number like 20, we can split the totient function of 20 into the product of the totient function of the factors of 20:

$$ \begin{eqnarray*} 20 = 5 \times 4 \\ \phi(20) = \phi(5) \times \phi(4) \\ \phi(20) = 4 \times 2 = 8 \end{eqnarray*} $$

which is, indeed, the value of \(\phi(20)\). However, this does not hold generally, as we can see from computing the totient of 50:

$$ \begin{eqnarray*} 50 = 5 \times 10 \\ \phi(50) \neq \phi(5) \times \phi(10) \phi(50) \neq 4 \times 4 \phi(50) \neq 16 \end{eqnarray*} $$

In fact, the value of \(\phi(50)\) is 20, not 16! The problem is with our choice of factors, 5 and 10. These two numbers share a common factor of 5, meaning their totient functions do not account for all of the numbers that will be relatively prime with 50.

To fix this, we can further break down 50 into its prime factorization, and compute the totient function of those primes:

$$ \begin{eqnarray*} 50 &=& 2 \times 5^2 \\ \phi(50) &=& \phi(2) \times \phi(5^2) \\ \phi(50) &=& 1 \times ( 5^2 - 5 ) \\ \phi(50) &=& 1 \times 20 \\ \phi(50) &=& 20 \end{eqnarray*} $$

which gives us the correct result of 20.

Generalizing this rule, we can say that we can break down the totient function of a product of two numbers \(s \times t\) into the product of totient functions \(\phi(s)\) and \(\phi(t)\) only if the greatest common factor between \(s\) and \(t\) is 1.

Totient Example

Let's consider the following example: suppose we wish to compute

$$ \phi( 280 ) $$

We can start by recognizing some factors of 280 - we can pull out the factors 7 and 4, and 2 and 5. These can be further factored to yield the prime factorization of 280:

$$ 280 = 2^3 \times 5 \times 7 $$

Now we can use the property that the totient function of an integer \(m\) can be expressed as the product of the totient functions of the factors of \(m\). So we can write \(\phi(280)\) as any of the following equivalent expressions:

$$ \begin{eqnarray*} \phi(280) &=& \phi(10) \times \phi(7) \times \phi(2^2) \\ \phi(280) &=& \phi( 2^3 ) \times \phi(5) \times \phi(7) \end{eqnarray*} $$

Using the second expression, we know that

$$ \begin{eqnarray*} \phi(2^3) &=& (2^3 - 2^2) = 4 \\ \phi(5) &=& 4 \\ \phi(7) &=& 6 \end{eqnarray*} $$

for an overall totient function value of

$$ \begin{eqnarray*} \phi(280) &=& 4 \times 4 \times 6 \\ \phi(280) &=& 96 \end{eqnarray*} $$

which is indeed the correct result:

Totient function calculation with Wolfram Alpha

Tags:    mathematics    factors    number theory    euler   

March 2022

How to Read Ulysses

July 2020

Applied Gitflow

September 2019

Mocking AWS in Unit Tests

May 2018

Current Projects