charlesreid1.com research & teaching

Five Letter Words: Part 3: Letter Coverage and Dynamic Programming

Posted in Computer Science

permalink

NOTE: The code covered in this post uses Python 3. The scripts can be converted to Python 2 with minimal effort, but the author would encourage any user of Python 2 to "put on your big kid pants" and make the switch to Python 3. Let's all make this painful, drawn-out switch from Python 2 to Python 3 a thing of the past, shall we?

Table of Contents

Introduction

The letter/word coverage problem, as presented by Donald Knuth in Volume 4, Facsimile 0 of his masterpiece Art of Computer Programming, is the problem of finding the minimum number of words from the collection of five letter words that "cover" (are drawn from) the first N letters of the alphabet.

The problem has a couple of variations:

  • Provided a set of letters, search for the smallest number of words that cover those particular letters.
  • Given an integer \(N \leq 26\), search for the smallest number of words that cover the first N letters of the alphabet.
  • The same problem as above, but drawing from the first \(M\) words of the 5757 total five-letter words.

For the sake of simplicity, we will focus on the simplest problem: considering the first \(N\) letters of the alphabet, find the shortest sequence of words that will provide coverage of the first \(N\) letters of the alphabet.

This is an example of a dynamic programming problem: a combinatorics problem that can be solved by breaking the overall down into smaller sub-problems, solving the sub-problems, and assembling solutions to the sub-problems into an overall problem solution.

The procedure is as follows:

  • For each word \(w_i\), we begin by assuming this word is the best solution on its own. This forms the base case/starting solution.
  • Next, examine all prior words \(w_j, j<i\), and compare each to using the word \(w_i\) by itself.
  • For each pair of words, take the union (OR) of the character coverage for word \(w_i\) and the solution bit vector for word \(w_j\) (that is, using the best-covered solution so far for word \(w_j\))
  • Note: for word \(w_i\), we need to store one of these unions as the best-covered solution so far for word \(w_i\), but we aren't sure which one yet.)
  • For the given pair of words \(w_j\) and \(w_i\), we are looking at word \(w_j\) and considering the possibility of extending that with word \(w_i\). Adding \(w_i\) to the best solution so far may or may not improve the best solution, so we need to decide whether to add \(w_i\) to the best solution so far.
  • Compute the number of letters covered in the union of \(w_i\) and the best solution so far (by, e.g., summing up the 1s in the bit vector of \(w_i\) added to the bit vector representing the best solution so far for word \(w_j\))
  • Compute the number of words in the best solution so far for word \(w_j\), and add one to it (representing the new word \(w_i\) being added)
  • We are searching for the prior solution for word \(w_j\) that will lead to the maximum number of 1s in the bit vector
  • We break ties by picking the word \(w_j\) that will minimize the number of total words
  • Once we find the best word \(w_j\), we save the union bit vector for word \(w_i\) and word \(w_j\) under the word \(w_i\) combined solution bit vector; we save the length of 1s in the combined solution bit vector; and we save the number of words so far in that solution.

Once we have gone through every word, we are ready to find the minimum. Do this by:

  • Searching through the solutions for every word, and pick out the one that maximizes the number of 1s in the solution bit vector (or, rather, that has the correct number of 1s in the bit vector) while also minimizing the total number of words.
  • To get the actual sequence of words, rather than just the minimum number of jwords, we need to save the prior word that leads to the maximum number of 1s in the solution bit vector and minimum number of words, for each word. Then, at the end, we can backtrack through the words that compose the solution.

This is a bit complicated to explain in words, so we'll give a small example, then some pseudocode. Then we'll present the actual Python program that accomplishes this task.

A Simple Manual Example

Let's walk through an example manually to illustrate the approach:

Suppose we are considering 2-letter words taken from a 5-letter alphabet abcde. We can represent a given word as a binary string or bit vector: for example, the two-letter word aa would be represented by the bit vector 10000, ab would be represented by the bit vector 11000, etc.

Now let's consider a set of words, and step through the algorithm with them.

W0 = aa = 10000
W1 = ab = 11000
W2 = bc = 01100
W3 = aa = 10000
W4 = dd = 00010
W5 = de = 00011
W6 = bb = 01000

Now, we wish to write a dynamic program that will find the smallest set of words such that taking the union of each bit vector for each of the words in the set will yield the bit vector 11111. At each step, we seek the words that will maximize the number of 1s in the union of the bit vectors, while minimizing the number of words. We take the union of the "longest sequence of 1s" bit vector from the prior step, plus the bit vector from the current step.

W0: aa

Start with word W0: this is the only bit vector, so it sets the starting "largest sequence of 1s" bit vector. We wish to maximize "largest sequence of 1s" and minimize number of words.

  • only W0 as solution is therefore \(10000\). The number of 1s is 1. The number of words is 1. (W0 SOLUTION)

W1: ab

Start with word W1: this is the only bit vector, so it sets the starting "largest sequence of 1s" bit vector. We wish to maximize "largest sequence of 1s" and minimize number of words.

  • only W1 as solution is therefore \(11000\). The number of 1s is 2. The number of words is 1. (W1 SOLUTION)
  • union of W0 solution and W1 \(10000 \bigcup 11000 = 11000\). The number of 1s is 2. The number of words is 2.

W2: bc

Next is word W2: the "largest sequence of 1s" bit vector is the union of the prior step's "largest sequence of 1s" bit vector and the current word's bit vector. One option:

  • only W2 as solution is \(01100\). The number of 1s is 2. The number of words is 1.
  • union of W0 solution and W2 \(10000 \bigcup 01100 = 11100\). The number of 1s is 3. The number of words is 2. (W2 SOLUTION)
  • union of W1 solution and W2 \(11000 \bigcup 01100 = 11100\). The number of 1s is 3. The number of words is 2.

W3: aa

Next is word W3: the "largest sequence of 1s" bit vector is the union that maximizes the number of 1s and minimizes the number of words. Two options:

  • only W3 as solution is \(10000\). The number of 1s is 1. The number of words is 1.
  • union of W0 solution and W3 \(10000 \bigcup 10000 = 10000\). The number of 1s is 1. The number of words is 2.
  • union of W1 solution and W3 \(11000 \bigcup 10000 = 11000\). The number of 1s is 2. The number of words is 2.
  • union of W2 solution and W3 \(11100 \bigcup 10000 = 11100\). The number of 1s is 3. The number of words is 3. (W3 SOLUTION)

W4: dd

Next is word W4: the "largest sequence of 1s" bit vector is the union that maximizes the number of 1s and minimizes the number of words. Three options:

  • only W4 as solution is \(00010\). The number of 1s is 1. The number of words is 1.
  • union of W0 solution and W4 \(10000 \bigcup 00010 = 10010\). The number of 1s is 2. The number of words is 2.
  • union of W1 solution and W4 \(11000 \bigcup 00010 = 11010\). The number of 1s is 3. The number of words is 2.
  • union of W2 solution and W4 \(11100 \bigcup 00010 = 11110\). The number of 1s is 4. The number of words is 3. (W4 SOLUTION)
  • union of W3 solution and W4 \(11100 \bigcup 00010 = 11110\). The number of 1s is 4. The number of words is 4.

W5: de

Next is word W5: the "largest sequence of 1s" bit vector is the union maximizing number of 1s and minimizing number of words. Four options:

  • only W5 as solution is \(00011\). The number of 1s is 2. The number of words is 1.
  • union of W0 solution and W5 \(10000 \bigcup 00010 = 10010\). The number of 1s is 2. The number of words is 2.
  • union of W1 solution and W5 \(11000 \bigcup 00011 = 11011\). The number of 1s is 4. The number of words is 2.
  • union of W2 solution and W5 \(11100 \bigcup 00011 = 11111\). The number of 1s is 5. The number of words is 3. (W5 SOLUTION)
  • union of W3 solution and W5 \(11100 \bigcup 00011 = 11111\). The number of 1s is 5. The number of words is 4.
  • union of W4 solution and W5 \(11110 \bigcup 00111 = 11111\). The number of 1s is 5. The number of words is 4.

W6:

Next is word W6: the "largest sequence of 1s" bit vector is the union maximizing number of 1s and minimizing number of words. Five options:

  • only W6 as solution is \(01000\). The number of 1s is 1. The number of words is 1.
  • union of W0 solution and W6 \(10000 \bigcup 01000 = 11000\). The number of 1s is 2. The number of words is 2.
  • union of W1 solution and W6 \(11000 \bigcup 01000 = 11000\). The number of 1s is 2. The number of words is 2.
  • union of W2 solution and W6 \(11100 \bigcup 01000 = 11100\). The number of 1s is 3. The number of words is 3.
  • union of W3 solution and W6 \(11100 \bigcup 01000 = 11100\). The number of 1s is 3. The number of words is 4.
  • union of W4 solution and W6 \(11110 \bigcup 01000 = 11110\). The number of 1s is 4. The number of words is 4.
  • union of W5 solution and W6 \(11111 \bigcup 01000 = 11111\). The number of 1s is 5. The number of words is 4. (W6 SOLUTION)

(NOTE: We don't need to consider every possible combination of W1, W2, W3, W4, W5, and W6; we only need to consider each word once, because each word's current solution can be written in terms of the prior word's solution, so we only need to consider solutions for each word. We've already considered the non-solutions and can therefore ignore them because they don't maximize number of 1s and minimize number of words.)

Thus far, we have found a ''local'' solution for each word. We can now compare all of these ''local'' solutions to find a ''global'' solution. The global solution will maximize the number of 1s found (meaning we can toss out any solutions that have less than 5 1s), and minimizes the total number of words (meaning, our W5 solution gives us the global optimum).

Therefore our global solution is the W5 solution: 5 1s, and 3 words. Thus, backtracking, we see that the words W1, W2, W5 cover all of the first five letters, with the minimum number of total words.

W0 = aa = 10000
W2 = bc = 01100
W5 = de = 00011

Pseudocode

Here is the pseudocode for the program. We utilize one function to compute the letter coverage bit vector for a single word, and the rest of the functionality will go in the main method:

function word2bitvector(word):
    initialize 26-element bit vector with 0s (one 0 per letter)
    for each letter in word:
        turn the bit for this letter to 1
    return bit vector

fuction main():

    // initialization step:
    initialize best coverage bit vector
    initialize maximum number of 1s (the number of letters N we wish to cover)
    initialize number of words in current solution
    initialize backtracking array (for constructing final solution)

    // outer loop fencepost step:
    set things up for word 0 (base case)

    // loop through each word
    for each word in words:
        // skip word 0 (base case)

        // inner loop fencepost step:
        initialize things for word (i-1)

        for each prior word j < i:
            compute the new potential best coverage bitvector
            compute the number of 1s in the bnew potential best coverage bit vector
            compute numbr of words in new potential best solution
            if this solution is better than current best solution:
                overwrite best solution with current solution

    // get solution:
    find maximum indices of vector of number of 1s 
    // (this is potentially multiple indices, representing multiple 
    //  solutions that satisfy the coverage we want)
    find minimum number of words corresponding to each of the coverage indices
    backtrack through solution indices

Python Code

The code for this solution can be found here: letter_coverage.py

This code is as follows:

Start with the word-to-bit vector function:

def word2bitvector(word,N):
    """
    Turns a five-letter word into a bit vector representing character coverage.
    Uses 26 letters by default.
    """
    bit_vector = [False,]*N
    for c in word:
        i = ord(c)-ord('a')
        try:
            bit_vector[i] = True
        except IndexError:
            pass
    return np.array(bit_vector)

We also implement a few helper methods: the first turns a boolean bit vector into a pretty string of 0s and 1s:

def printbv(bv):
    result = ""
    for bit in bv:
        if bit:
            result += "1"
        else:
            result += "0"
    return result

The second method is our all-important backtracking to obtain the actual sequence of words that leads to the minimum coverage, instead of just getting a count of the minimum number of words that it takes to cover the first \(N\) letters:

def btsolution(min_key, min_val, words, bt):
    """
    Reconstruct the sequence of words that gives maximum coverage and minimum word count.

    Input: minimum word key (last word), minimum value (number of words), backtrack (prior word)

    Output: list of words
    """
    solution = []
    solution.append(words[min_key])
    prior_key = bt[min_key]
    while prior_key != -1:
        solution.append(words[prior_key])
        prior_key = bt[prior_key]
    return reversed(solution)

Finally, we get to the meat of the method: the dynamic program. Start with some initialization. This is where we set the number of letters we want to cover, and limit the "vocabulary" if desired:

if __name__=="__main__":

    # Searching for words covering first N letters
    N = 13

    words = get_words()

    # If we want to restrict our search to the first M letters,
    #words = words[:1000]

We begin with the initialization step:

    # Initialization:
    # ----------------

    # Store best coverage bitvectors for each word
    bestcoverage_bv = [np.array([False]*N) for k in range(len(words))]

    # Store number of 1s for best coverage vector for each word
    ones_bv = [-1]*len(words)

    # Store number of words in best solution for each word
    ws = [0]*len(words)

    # Store prior word for backtracking
    bt = [-1]*len(words)

Next comes the fencepost initialization step, where we intiialize the solution for word 0:

    # Fencepost: Initial Step
    # Word 0
    # ----------------

    i = 0

    # Start with word 0
    wi = words[i]

    # Best letter coverage bit vector
    bestcoverage_bv[i] = word2bitvector(words[i],N)

    # Length of 1s
    ones_bv[i] = sum(bestcoverage_bv[i])

    # Number of words in best solution:
    ws[i] = 1

    # Backtracking: first word has no prior word
    bt[i] = -1

Next, we loop over each word \(w_i, i>0\):

    # Start by assuming the word by itself, 
    # and then examine each possible pairing
    for i in range(1,len(words)):
        wi = words[i]

        # Start with bitvector of word i's coverage
        wi_bv = word2bitvector(wi,N)

        # Fencepost: initial step
        # Word i by itself
        # Assume word i is the first word in the solution,
        # and if we find a better combination with prior word,
        # overwrite this solution.
        # ------------------------

        # Best coverage so far (first guess) is word i by itself
        bestcoverage_bv[i] = wi_bv

        # Count ones in (first guess) best bitvector
        ones_bv[i] = sum(bestcoverage_bv[i])

        # Number of words in new best solution:
        ws[i] = 1

        # Backtracking
        bt[i] = -1

        # Boolean: is this the first word in the sequence of solutions?
        first = True

We started by assuming that each word \(w_i\) provides a best solution by itself; the next step is to consider each pairing of \(w_i\) with prior words \(w_j\), and update our current solution if we find a better one:

        # Now loop over the rest of the words,
        # and look for a better solution.
        for j in reversed(range(0,i)):

            # Get the prior word
            wj = words[j]

            # Get best coverage bitvector 
            wj_bv = bestcoverage_bv[j]

            # (potential) new combined coverage vector
            bestcoverage_bv_i = np.logical_or(wi_bv, wj_bv)

            # Number of ones in (potential) new combined coverage vector
            ones_bv_i = sum(bestcoverage_bv_i)

            # Number of words in (potential) new best solution
            ws_i = ws[j]+1

            # If this solution is better than our current one,
            # overwrite the current solution.
            # (Better means, "more ones", or "same ones and fewer words".)

            #import pdb; pdb.set_trace();

            if( (ones_bv_i > ones_bv[i]) or (ones_bv_i==ones_bv[i] and ws_i < ws[i]) ):
                bestcoverage_bv[i] = bestcoverage_bv_i
                ones_bv[i] = ones_bv_i
                ws[i] = ws_i
                bt[i] = j

                # This word now follows another word in the sequence of solutions
                first = False

            # It's tempting to stop early,
            # but what if we find the perfect 
            # solution right at the end?!?

Now that we have found the coverage for each word, and the corresponding number of words in that coverage solution, we find the solution that achieves the desired coverage while minimizing the number of words, so that we can construct the actual solution:

    # Okay, now actually get the solution.
    # The solution is the maximum of ones_bv and the minimum of ws
    # 
    # Start by finding the maximum(s) of ones_bv
    # Then check each corresponding index of ws
    ones_bv_indices = [k for k,v in enumerate(ones_bv) if v==max(ones_bv)]

    min_key = ones_bv_indices[0]
    min_val = ones_bv[ones_bv_indices[0]]
    for ix in reversed(ones_bv_indices[1:]):
        if(ones_bv[ix] < min_key):
            min_key = ix
            min_val = ones_bv[ix]



    print("Min key: word "+str(min_key)+" = "+words[min_key])
    print("Min val: "+str(min_val)+" words to cover "+str(N)+" letters")

    pprint(list(btsolution(min_key, min_val, words, bt)))

Output and Timing

Let's take a look at some example output from the program. This program only considers the first 1,000 words in the five-letter word list:

$ time py letter_coverage.py
Takes 9 words to cover 15 letters
['which',
 'their',
 'about',
 'could',
 'after',
 'right',
 'think',
 'major',
 'level']

real    0m17.226s
user    0m17.090s
sys     0m0.087s

Here's the same program, considering all 5,757 words:

$ time py letter_coverage.py
akes 9 words to cover 15 letters
['which',
 'their',
 'about',
 'could',
 'after',
 'right',
 'think',
 'major',
 'level']

real    9m29.619s
user    9m24.360s
sys 0m1.958s

Note that the algorithm is \(O(N^2)\), since it iterates over each word, and for each word, it examines each possible pairing with a preceding word. Thus, if we increase the number of words by a factor of 6, we expect the runtime to increase by a factor of 36, for an estimated runtime of \(36 \times 17 \mbox{ seconds} \approx 10 \mbox{ minutes}\), which is pretty close to what we see above.

Five Letter Words: Part 2: More Five-Word Algorithms

Posted in Computer Science

permalink

NOTE: The code covered in this post uses Python 3. The scripts can be converted to Python 2 with minimal effort, but the author would encourage any user of Python 2 to "put on your big kid pants" and make the switch to Python 3. Let's all make this painful, drawn-out switch from Python 2 to Python 3 a thing of the past, shall we?

Table of Contents

Introduction

As mentioned in Five Letter Words: Part 1, we covered Donald Knuth's list of five letter words, one of the data sets in the Stanford Graph Base that is covered in greater detail in Knuth's coverage of graph theory in Volume 4, Facsimile 0 of his magnum opus, The Art of Computer Programming.

In the section where Knuth introduces the set of words, he also gives readers several exercises to get to know the list of words. This multi-part series of posts (also see Five Letter Words: Part 1) is covering some of the solutions to these exercises, and expanding on them to illustrate some of the interesting and surprising properties of this data set.

Five-Letter Words with k Distinct Letters

In Exercise 27, Knuth asks the reader to make a list of words composed of a specific number of distinct letters (1, 2, 3, 4, or 5).

In the list of five-letter words, there are 0 words composed of a single letter, 4 words with two distinct letters (0.07%), 163 words with three distinct letters (2.8%), 1756 words with four distinct letters (30.5%), and 3834 words with five distinct letters (66.5%).

Here are a few examples: Two distinct letters: mamma, ahhhh, esses, ohhhh Three distinct letters: added, seems, sense, level, teeth Four distinct letters: which, there, these, where, three Five distinct letters: their, about, would, other, words

To find these, we can design an algorithm that does the following: split each string into characters, add them to a set data type (a set discards any duplicates), and get the size of the set. This will give us the number of unique letters in a given word, and we can use a list of lists to store all words with a specified number of unique letters.

Once again, we're using our get_words function, which we covered in Part 1. See get_words.py for that script.

distinct.py

"""
distinct.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Exercise #27

How many SGB words contain exactly k distinct letters, for 1 <= k <= 5?
"""
from pprint import pprint
from get_words import get_words

if __name__=="__main__":
    words = get_words()

    lengths = [[] for i in range(5+1)]

    for word in words:
        k = len(set(word))
        lengths[k].append(word)

    for i in range(1,5+1):
        print("-"*40)
        print("Number of words with {0:d} letters: {1:d}".format(i, len(lengths[i])))
        print(", ".join(lengths[i][0:5]))

The principal operation here is the statement that gets the length, k:

k = len(set(word))
lengths[k].append(word)

The operation of turning a word into a set is \(O(M)\), where M is the number of letters in the word, and the algorithm performs this operation on each word in sequence, so overall, the algorithm is \(O(N)\), where N is the number of words.

The storage space used by the algorithm is also \(O(N)\), since for each word, the number of distinct letters \(k \in \{ 0 \dots 5 \}\).

If we were dealing with a lot of words, and needed to save some space, we could represent the list of words with \(k\) distinct letters using five bit vectors, where each bit vector represents the words that are composed of \(k\) distinct letters, and has a length of \(N\), the number of words. A 0 would indicate the word is not in the set (is not composed of \(k\) letters), and a 1 would indicate the opposite.

But here, we keep it simple, since we have a small, known set of words.

Examining a Variation

While that's essentially all there is to this algorithm, and it takes all of 10 seconds to come up with the idea, there are some nuances and some bookkeeping details, as there are with the design of any algorithm.

For example, compare the following two approaches; Approach 1 is used in the program, Approach 2 is a less efficient approach:

    # Approach 1:
    for word in words:
        k = len(set(word))
        lengths[k].append(word)


    # Approach 2:
    for k in range(1,5+1):
        if(len(set(word))==k):
            lengths[k].append(word)

While these are both \(O(N)\) runtime, the latter approach is inefficient: we loop over each word five times, and each time we perform the same operation (turning the letters of a word into a set).

Is there ever a case where we would want an approach like #2?

The short answer is, never.

To give a longer answer, let's consider a case where approach #2 might provide an advantage. Suppose we were considering a case where \(k\) could be larger - a list of 15-letter words, for example, so k could be up to 15 - and we were only interested in a particular value, or small set of values, of \(k\), like 3 and 4.
Approach 1 would store unnecessary intermediate results (the values of k for all words) and therefore use extra space, compared with approach #2 where we could change the for loop to for k in [3,4]:.

Even here, though, approach #2 results in unnecessary work, because approach #1 is still computationally more efficient by looping over the list of words only once, compared with approach #2, which would loop over the list of words twice.

We may further consider a case where approach #2 would give us an advantage, and that is the case where we are copying data into the list lengths, instead of just storing a reference to a string. Because we only deal with references in Python, we aren't making copies in the code given above. But because strings are immutable, we could conceivably be making copies if we stored word.upper() instead of word. Approach #2 would use less space, because it only considers the values of k that are of interest.

But even here, approach #1 requires only a small modification to wipe out the space advantage of approach #2: add an if statement before calling the append function: if k in [3,4]. Now the calculation of turning a word into a set of characters is performed only once for approach #1, and we don't end up storing unnecessary intermediate results.

The take-home lesson: even if the core idea behind an algorithm is straightforward, there are still many ways to do it better or worse.

Lexicographic Ordering of Letters

Knuth points out that the word "first" contains letters that occur in lexicograhpic order. Exercise #30 of AOCP Volume 4 Facsimile 0 asks us to find the first and last such word that occurs in Knuth's set of five letter words.

To do this, we'll take each word and turn it into a list of characters. We'll then sort the characters, and turn the sorted list of characters back into a string. If the string constructed from sorted characters equals the original string, we have our word, formed from lexicographically ordered letters.

We can also perform the reverse - search for words whose letters are in reverse lexicographic order. One such word is "spied". Implementing this task requires a bit more care, because of the fact that Python 3 returns generators where Python 2 would return lists, but we can get around this with the list() function, as we shall see shortly.

Five-Letter Words with Lexicographically Ordered Letters

Exercise 30 asks us to find the first and last word in the set of five letter words whose letters occur in sorted lexicographic order. We begin by sorting all of the words, and we find the first such word is "abbey", while the last such word is "pssst".

There are 105 total words that fit this description. As we might expect, a majority of them begin with letters at the beginning of the alphabet:

  • abbey
  • abbot
  • abhor
  • abort
  • abuzz
  • achoo
  • adder
  • adept
  • adios
  • adopt
  • aegis
  • affix
  • afoot
  • aglow
  • ahhhh
  • allot
  • allow
  • alloy
  • ammos
  • annoy
  • beefs
  • beefy
  • beeps
  • beers
  • beery
  • befit
  • begin
  • begot
  • bells
  • belly
  • below
  • berry
  • bills
  • billy
  • bitty
  • blowy
  • boors
  • boost
  • booty
  • bossy
  • ceils
  • cello
  • cells

The full output is here:

lexico output

The code to find these words is given below:

lexico.py

"""
lexico.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Exercise #30

Each letter of the word "first" appears in correct lexicographic order.
Find the first and last such words in the SGB words.
"""
from get_words import get_words

def in_sorted_order(word):
    chars = list(word)
    if(str(chars)==str(sorted(chars))):
        return True
    else:
        return False

if __name__=="__main__":

    words = get_words()
    words = sorted(words)

    count = 0
    print("-"*40)
    print("ALL lexicographically sorted words:")
    for word in words:
        if(in_sorted_order(word)):
            print(word)
            count += 1
    print("{0:d} total.".format(count))

    print("-"*40)
    for word in words:
        if(in_sorted_order(word)):
            print("First lexicographically sorted word:")
            print(word)
            break

    words.reverse()

    print("-"*40)
    for word in words:
        if(in_sorted_order(word)):
            print("Last lexicographically sorted word:")
            print(word)
            break

The heart of the method here is the in_sorted_order() method: this performs the task, as described above. We take the word passed to the function (a string), and turn it into a list using the list() function. We then turn this list back into a string (which is the same as the variable word), and compare it to the sorted list of characters, turned back into a string, using the call str(sorted(chars)).

If the two match, we have not affected the order of characters by sorting them in lexicographic (alphabetic) order, and therefore the original string was in sorted order, and we return True. Otherwise, we return False.

Here's that method one more time:

def in_sorted_order(word):
    chars = list(word)
    if(str(chars)==str(sorted(chars))):
        return True
    else:
        return False

Five-Letter Words with Lexicographically Reversed Letters

There are significantly fewer five-letter words whose letters are in reverse lexicographic order - 37, compared to the 105 in sorted order. Here is the full list:

  • mecca
  • offed
  • ohhhh
  • plied
  • poked
  • poled
  • polka
  • skied
  • skiff
  • sniff
  • soled
  • solid
  • sonic
  • speed
  • spied
  • spiff
  • spoke
  • spoof
  • spook
  • spool
  • spoon
  • toked
  • toned
  • tonic
  • treed
  • tried
  • troll
  • unfed
  • upped
  • urged
  • vroom
  • wheee
  • wooed
  • wrong
  • yoked
  • yucca
  • zoned

The code to do this requires only minor modifications to the original, sorted order code.

To reverse the procedure, we just need to modify the in_sorted_order() function to reverse the sorted list of characters before we reassemble it into a string. We can feed the output of the call to sorted() to the reversed() function. However, in Python 3, this returns a generator object, which is lazy - it does not automatically enumerate every character. Unless, of course, we force it to.

That's where the call to list() comes in handy - by passing a generator to list(), we force Python to enumerate the output of the reversed, sorted list generator. Then we turn the reversed, sorted list into a reversed, sorted string:

def in_reverse_sorted_order(word):
    chars = list(word)
    # Note: reversed returns a generator,
    # so we have to pass it to list()
    # to explicitly enumerate the reversed results.
    if(str(chars)==str(list(reversed(sorted(chars))))):
        return True
    else:
        return False

Meanwhile, the rest of the script can stay virtually the same.

reverse_lexico.py

"""
reverse_lexico.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Variation on Exercise #30

Each letter of the word "spied" appears in reversed lexicographic order.
Find more words whose letters appear in reverse lexicographic order.
"""
from get_words import get_words

def in_reverse_sorted_order(word):
    chars = list(word)
    # Note: reversed returns a generator, 
    # so we have to pass it to list() 
    # to explicitly enumerate the reversed results.
    if(str(chars)==str(list(reversed(sorted(chars))))):
        return True
    else:
        return False

if __name__=="__main__":

    words = get_words()
    words = sorted(words)

    count = 0
    print("-"*40)
    print("ALL lexicographically reversed words:")
    for word in words:
        if(in_reverse_sorted_order(word)):
            print(word)
            count += 1
    print("{0:d} total.".format(count))

    print("-"*40)
    for word in words:
        if(in_reverse_sorted_order(word)):
            print("First reverse lexicographically sorted word:")
            print(word)
            break

    words.reverse()

    print("-"*40)
    for word in words:
        if(in_reverse_sorted_order(word)):
            print("Last lexicographically sorted word:")
            print(word)
            break

Finding Palindromes

Palindromes are words or sets of words that have a reflective property, namely, they spell the same thing forward and reverse (e.g., "race car", or "Ere I was able, I saw Malta", or "Doc, note I dissent - a fast never prevents a fatness. I diet on cod.").

In Exercise 29, Knuth asks the reader to perform a straightforward task - find the palindromes in the list of five letter words. (An example of one such word is "kayak".) But Knuth goes further, and points out that palindromes can also be formed from pairs of words. He gives the example "regal lager". He asks the reader to find all palindrome pairs as well.

When working on these exercises, we became curious about palindromic near-misses. How many words are almost palindromes? (Example: "going" is very close to a palindrome, if we just changed the n to an o or vice-versa.) In fact, we already have all the tools we need at our disposal, as we already covered a script to perform a Euclidean distance calculation.

We will cover Python code to find words that fit into each of these categories, and provide some interesting examples. (One of the most surprising things to us was just how many words meet these criteria!)

Palindromes

The first task is finding palindromes in the set of five letter words. There are 18 such words. They are given below:

  • level
  • refer
  • radar
  • madam
  • rotor
  • civic
  • sexes
  • solos
  • sagas
  • kayak
  • minim
  • tenet
  • shahs
  • stats
  • stets
  • kaiak
  • finif
  • dewed

The code to check if a word is a palindrome consists of two simple logical test: Is the character at position 0 equal to the character at position 4? Is the character at position 1 equal to the character at position 3? If both of these are true, the word is a palindrome. Here's the Python function to check if a word is a palindrome:

palindromes.py

def is_palindrome(word):
    test1 = word[0]==word[4]
    test2 = word[1]==word[3]
    if(test1 and test2):
        return True
    return False

and the main driver method, which actually runs the function on each word:

if __name__=="__main__":
    words = get_words()

    kp = 0
    palindromes = []

    # Check for palindromes
    for i in range(len(words)):
        if(is_palindrome(words[i])):
            kp += 1
            palindromes.append(words[i])

    print("-"*40)
    print("Palindromes: \n")
    print(", ".join(palindromes))
    print("There are {0:d} palindromes.".format(kp))

Palindrome Pairs

There are 34 palindromic pairs, if we disallow palindromes from being considered palindromic pairs with themselves. These are:

  • parts, strap
  • lived, devil
  • speed, deeps
  • sleep, peels
  • straw, warts
  • faced, decaf
  • spots, stops
  • fires, serif
  • lever, revel
  • smart, trams
  • ports, strop
  • pools, sloop
  • stool, loots
  • draws, sward
  • mined, denim
  • spins, snips
  • alley, yella
  • loops, spool
  • sleek, keels
  • repel, leper
  • snaps, spans
  • depot, toped
  • timed, demit
  • debut, tubed
  • laced, decal
  • stink, knits
  • regal, lager
  • tuber, rebut
  • remit, timer
  • pacer, recap
  • snoot, toons
  • namer, reman
  • hales, selah
  • tarps, sprat

The code to check for palindrome pairs is a little more involved, but also consists of a few logical tests to see if letters in one position of the first word match letters in another position of the second word:

palindromes.py

def is_palindrome_pair(word1,word2):
    test0 = word1[0]==word2[4]
    test1 = word1[1]==word2[3]
    test2 = word1[2]==word2[2]
    test3 = word1[3]==word2[1]
    test4 = word1[4]==word2[0]
    if(test0 and test1 and test2 and test3 and test4):
        return True
    return False

and the main driver method:

if __name__=="__main__":
    words = get_words()

    kpp = 0
    palindrome_pairs = []

    # Check for palindrome pairs
    for i in range(len(words)):
        for j in range(i,len(words)):
            if(is_palindrome_pair(words[i],words[j])):
                # Palindromes shouldn't count as palindrome pairs
                if(words[i] is not words[j]):
                    kpp += 1
                    palindrome_pairs.append((words[i],words[j]))

    print("-"*40)
    print("Palindrome Pairs: \n")
    for pair in palindrome_pairs:
        print(", ".join(pair))
    print("There are {0:d} palindrome pairs.".format(kpp))

Near Palindromes

A near-palindrome is a word that would be a palindrome, if one of its letters were slightly modified. We use a "tolerance" parameter to specify how much modification we are willing to live with to consider a word a near-palindrome.

There are several ways to do this, but we'll keep it simple: we consider the totla number of changes to all characters in the word required to make a word a palindrome, and test whether the changes required to make the word a palindrome are less than or equal to a specified parameter, tolerance.

For example, if our tolerance were 1, we would consider the words "going" and "moron" to be near-palindromes; if our tolerance were 2, we would consider the words "tsars" and "jewel" to be near-palindromes.

Here is the list of 37 off-by-one palindromes:

  • going
  • seeds
  • tight
  • trust
  • suits
  • sends
  • plump
  • slums
  • sighs
  • erase
  • serfs
  • soaps
  • sewer
  • soups
  • sever
  • slams
  • scabs
  • moron
  • ceded
  • scads
  • suets
  • fugue
  • seder
  • tryst
  • educe
  • twixt
  • tutus
  • shags
  • slims
  • abaca
  • anima
  • celeb
  • selfs
  • scuds
  • tikis
  • topos
  • rajas

and the list of off-by-two palindromes:

  • often
  • stars
  • sight
  • visit
  • towns
  • climb
  • flame
  • reads
  • sings
  • hatch
  • tends
  • naval
  • robot
  • reeds
  • cocoa
  • stout
  • spins
  • onion
  • sinks
  • edged
  • spurs
  • jewel
  • snaps
  • silks
  • nasal
  • theft
  • pagan
  • reefs
  • stirs
  • snips
  • tufts
  • truss
  • strut
  • spans
  • smelt
  • spars
  • flake
  • rusts
  • skims
  • sways
  • runts
  • tsars
  • tress
  • feted
  • rends
  • romps
  • cilia
  • ephod
  • fluke
  • reset
  • farad
  • peter
  • natal
  • thugs
  • newel
  • paean
  • emend
  • snoot
  • fiche
  • porno
  • flume
  • toons
  • roans
  • offen
  • klunk
  • feued
  • nihil
  • pavan
  • relet
  • heigh
  • revet
  • sicks
  • spoor

The check for near-palindromes follows the palindrome test fairly closely, except instead of checking if letters in two positions are equal, we check of those two letters are a certain specified distance from one another.

Here is the code for finding near-palindromes:

"""
near_palindromes.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Variation on Exercise #29

Find SGB words that are near-palindromes
(edit distance of one or two letters away from a palindrome).
"""
from get_words import get_words
from euclidean_distance import euclidean_distance
from pprint import pprint

def is_near_palindrome(word,lo,hi):
    d1 = euclidean_distance(word[0],word[4])
    d2 = euclidean_distance(word[1],word[3])

    if( (d1+d2) > lo and (d1+d2) <= hi ):
        return True

    return False

if __name__=="__main__":
    words = get_words()

    knp = 0
    near_palindromes = []

    # Euclidean distance tolerance
    lo = 0.0
    hi = 1.0

    for i in range(len(words)):
        if(is_near_palindrome(words[i],lo,hi)):
            knp += 1
            near_palindromes.append(words[i])

    print("-"*40)
    print("Near Palindromes: \n")
    print(", ".join(near_palindromes))
    print("The number of near-palindromes is {0:d}".format(len(near_palindromes)))

References

  1. Knuth, Donald. The Art of Computer Programming. Upper Saddle River, NJ: Addison-Wesley, 2008.

  2. Knuth, Donald. The Stanford GraphBase: A Platform for Combinatorial Computing. New York: ACM Press, 1994. <http://www-cs-faculty.stanford.edu/~knuth/sgb.html>

  3. "Five Letter Words." Git repository, git.charlesreid1.com. Charles Reid. Updated 1 September 2017. <http://git.charlesreid1.com/cs/five-letter-words>

Tags:    python    computer science    graphs    algorithms    art of computer programming    knuth    language   

Five Letter Words: Part 1: Getting Familiar With The List

Posted in Computer Science

permalink

NOTE: The code covered in this post uses Python 3. The scripts can be converted to Python 2 with minimal effort, but the author would encourage any user of Python 2 to "put on your big kid pants" and make the switch to Python 3. Let's all make this painful, drawn-out switch from Python 2 to Python 3 a thing of the past, shall we?

Table of Contents

About the Five-Letter Words

In Volume 4, Facsimile 0 of Donald Knuth's Art of Computer Programming, in which Knuth covers graph theory, he introduces a list of five-letter words as part of a data set useful in exploring graph theory and graph algorithms.

The list of words is part of the Stanford Graph Base, a set of data sets that are useful for studying graph theory and networks.

The first few words in the list are:

  • which
  • there
  • their
  • about
  • would
  • these
  • other
  • words
  • could
  • write
  • first
  • water
  • after

and so on. There are 5,757 total words in the data set, including some common words (as the first few listed), as well as some less common words:

  • osier
  • roble
  • rumba
  • biffy
  • pupal

This post is an introduction to the five letter words, and will give a few useful algorithms for analyzing the set of words.

The Usefulness of Five Letter Words

We are surrounded, always and everywhere, by language - the principal mechanism of thought, communication, and expression. Our latent familiarity with language makes data sets involving language extremely useful - unlike a data set about football scores, property crime, or human fatalities, we don't expend much effort understanding the nature of the data. Studying language also gives us a deeper understanding and appreciation for the complexity of language, for through our very familiarity with language, it can come to seem deceptively simple.

Five letter words, in particular, are short enough that they are familiar, and surround us, and yet long enough to have variety and lead to some very interesting properties. Five also happens to be a computationally convenient length.

Warm-Up Exercises

In Knuth's AOCP, he presents the reader with several warm-up exercises to get familiar with the list of words. We cover solutions to several of these exercises. Many of these exercises are creating algorithms that, while not utilizing graph theory themselves, can be utilized to construct interesting graphs. These exercises are written in Python.

Let us begin.

Get Words Function

Before starting any analysis of the five letter words, it is a good idea to create a function that will load the data set form a text file and load the result as a Python list. This function is given below:

get_words.py

"""
get_words.py

Utility method to load the SBG words
and retun them as a list of strings.
"""

def get_words():
    # Load the file.
    with open('sgb-words.txt','r') as f:
        ## This includes \n at the end of each line:
        #words = f.readlines()

        # This drops the \n at the end of each line:
        words = f.read().splitlines()

    return words

This is a straightforward use of the read() and splitlines() functions in Python.

Euclidean Distance

We begin with a calculation of the Eulcidean distance between words. We define the distance between two words, commonly called the "edit distance," based on the notion of a unit change, which is incrementing or decrementing a letter by one. Thus, the edit distance between "a" and "b" is 1, the edit distance between "e" and "g" is 2, and so on.

Euclidean Distance Code

Let's start with the code that does the calculation of the edit distance between two words:

euclidean_distance.py

import random, math, operator
from pprint import pprint
from get_words import get_words

random.seed(1337)

"""
euclidean_dist.py

Compute euclidean distance between 5-letter words.
"""

def euclidean_distance(word1, word2):
    v1 = word2vec(word1)
    v2 = word2vec(word2)
    return l2norm(v1,v2)

def l2norm(vec1, vec2):
    radicand = [(v2-v1)*(v2-v1) for (v1,v2) in zip(vec1,vec2)]
    return math.sqrt(sum(radicand))

def word2vec(word):
    charvec = []
    vec = []
    for c in word:
        charvec.append(c)
        vec.append(ord(c)-ord('a'))
    return vec

def print_tuple(e):
    print("Distance between {0:s} and {1:s} = {2:f}".format(*e))

if __name__=="__main__":

    words = get_words()

    eds = []
    for i in range(100):
        w1 = words[random.randint(1,5757)]
        w2 = words[random.randint(1,5757)]
        ed = euclidean_distance(w1,w2)
        eds.append((w1,w2,ed))

    sorted_eds = sorted(eds, key=operator.itemgetter(2))

    for e in reversed(sorted_eds):
        print_tuple(e)

Note that this script shares much in common with codes to create Caesar ciphers, Affine ciphers, and the like - the heart of the script is the word2vec() function, which converts a five-letter word into a five-component vector of numbers from 0 to 25. This is done using Python's ord() function, which returns the ordinal value of a character. The letter 'a' is 0, 'b' is 1, 'c' is 2, and so on.

The code also implements an L2 norm calculation, which is the mathematical term for a Euclidean distance calculation. It computes the square root of the sum of the squares of the differences between each component of the vector. This is the standard distance formula from your high school algebra class, extended to higher dimensions:

$$ d = \sqrt{ (x_2 - x_1)^2 + (y_2 - y_1)^2} $$

Or, for the physicists out there, the dot product of two vectors. The L2 norm between two vectors \(\mathbf{v}_1\) and \(\mathbf{v}_2\) is commonly denoted:

$$ || \mathbf{v}_2 - \mathbf{v}_1 ||_2 $$

Examples

To better illustrate what the Euclidean distance calculation looks like, let's look at some concrete examples of words that have an edit distance of 1:

there, these
right, sight
sound, round
might, night
might, light
along, among

In each case, we increment or decrement a single letter by 1, and the result is another five-letter word in the list. Perhaps the most surprising result is how many pairs of common words have an edit distance of 1:

$ python diff_by_one.py
1075 words have a Euclidean distance of +/-1.

That means nearly 20% of the words are within a single edit of another word.

If we look at words that have an edit distance of more than 1, we can see that some pairs of words have a single letter that changes by 2 units, while other pairs have two letters that differ by a single unit:

would, wound
right, tight
years, wears
never, lever
along, alone
night, light
paper, oboes

The last pair is an example of the latter.

Here are more examples of pairs of words with larger edit distances:

----------------------------------------
Distance of 3
there, where
would, world
words, woods
sound, pound
those, whose
house, horse
----------------------------------------
Distance of 4
about, cents
after, birds
right, night
think, thing
sound, wound
small, smell
----------------------------------------
Distance of 5
which, weigh
there, theme
other, steer
right, might
years, tears
place, space
place, piece

Different-by-N Code

The code that performs the above calculations includes diff_by_one.py and diff_by_n.py. Here is the former:

diff_by_one.py

"""
diff_by_one.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Exercise #28

Find pairs of SGB word vectors that differ by +/-1.
"""
from get_words import get_words
from euclidean_distance import euclidean_distance

if __name__=="__main__":
    words = get_words()

    ## To limit the output:
    #words = words[:1000]

    k = 0
    off_by_one = []
    for i in range(len(words)):
        for j in range(i,len(words)):
            d = euclidean_distance(words[i],words[j])
            if(abs(d)==1):
                k += 1
                off_by_one.append((words[i],words[j]))
                print("{0:s}, {1:s}".format(words[i],words[j]))

    print("{0:d} words have a Euclidean distance of +/-1.".format(k))

This is a nested for loop that examines all pairs of words. Note that we want to avoid the pair (B,A) if we have already found/printed the pair (A,B), so we use a nested for loop where the inner index starts at the outer index. The core of the script is the euclidean_distance() function, covered above.

This algorithm takes \(O(N^2)\) time due to the nested for loops.

Likewise, here is code to generate pairs that differ by some amount \(n\). This code will only print 10 pairs for each \(n\), to cut down on running time.

diff_by_n.py

"""
diff_by_n.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Variation on Exercise #28

Find pairs of SGB word vectors that differ by +/-n.
"""
from get_words import get_words
from euclidean_distance import euclidean_distance

def diff_by_n(n):
    k = 0
    off_by_one = []
    for i in range(len(words)):
        for j in range(i,len(words)):
            d = euclidean_distance(words[i],words[j])
            if(abs(d)==n):
                k += 1
                off_by_one.append((words[i],words[j]))
                print("{0:s}, {1:s}".format(words[i],words[j]))
        if k>10:
            break

    print("{0:d} words have a Euclidean distance of +/-{0:d}.".format(k,n))


if __name__=="__main__":
    words = get_words()

    for n in [2,3,4,5]:
        print("-"*40)
        print("Distance of {0:d}".format(n))
        diff_by_n(n)

Mo Words, Mo Problems

We have a number of other interesting problems and codes to cover, including:

  • Palindromes
  • Number of unique words
  • Word/letter statistics
  • Words consisting of lexicographically sorted letters
  • Words consisting of distinct letters

Each of these will be covered in a series of posts to follow. Then we'll move on to the business of graph theory, implementation of these five letter words as graphs, and utilization of software libraries designed for graphs and networks (expect some code in Java using Google's excellent Guava library).

You can follow all of this in my five-letter-words repository on Github, and/or in the five-letter-words repository on git.charlesreid1.com.

We have also provided additional information on the charlesreid1 wiki, at Five Letter Words, along with a whole bundle of notes from working through Donald Knuth's Art of Computer Programming at the Art of Computer Programming page of the wiki.

Expect more soon!

References

  1. Knuth, Donald. The Art of Computer Programming. Upper Saddle River, NJ: Addison-Wesley, 2008.

  2. Knuth, Donald. The Stanford GraphBase: A Platform for Combinatorial Computing. New York: ACM Press, 1994. <http://www-cs-faculty.stanford.edu/~knuth/sgb.html>

  3. "Five Letter Words." Git repository, git.charlesreid1.com. Charles Reid. Updated 1 September 2017. <http://git.charlesreid1.com/cs/five-letter-words>

Tags:    python    computer science    graphs    algorithms    art of computer programming    knuth    language