charlesreid1.com research & teaching

Let's Generate Permutations!

Posted in Computer Science

permalink

Generating Permutations

In today's post we're going to discuss the generation of permutations.

Often, in combinatorics problems, we are interested in how many different instances or configurations of a particular thing we can have (what we'll call "enumeration" or "counting"). However, that is different from wanting to actually see all of those configurations. Indeed, if we are counting something with an astronomical number of configurations, we don't want to try to list all of them.

However, as usual, Donald Knuth, who covers the topic of permutation generation in Volume 4A of his classic work, The Art of Computer Programming, uncovers a problem that is much more complicated and subtle than it initially appears.

Background: Radix Method for Generating Permutations

In Volume 4A of his classic work, The Art of Computer Programming, Donald Knuth turns to the question of generating permutations for a given combinatoric system. The book opens with Knuth jumping immediately into the problem of generating permutations.

Algorith M is the first algorithm Knuth presents to generate all unique tuples. To pose the problem a little more clearly, consider a combinatoric system that has \(n\) independent parts, each part having a number of possible states. We can completely specify the state of the combinatoric system by specifying an \(n\)-tuple:

$$ (a_1, a_2, \dots, a_n) $$

where each independent variable takes on one of its possible values \(0 \leq a_i \leq m_i\).

Knuth's Algorithm M starts by setting all a's to 0, and incrementing the right-most entry of the tuple (carrying if necessary).

This is equivalent to counting in binary from 0 to \(N-1\), or to labeling every possible outcome with a number between 0 and \(N-1\).

This becomes more clear if we consider the \(n\)-tuple to be a number in a variable-radix system:

The number is \(\left[ a_1, a_2, \dots, a_n \right]\)

The radix is \(\left[ m_1, m_2, \dots, m_n \right]\)

By repeatedly adding 1 to the number \(\left[ a_1, a_2, \dots, a_n \right]\), we iterate through every possible tuple \((a_1, a_2, \dots, a_n)\) and therefore through every possible combination of the independent variables.

Knuth's Algorithm M

Algorithm M (Mixed-radix generation). This algorithm visits all \(n\)-tuples that satisfy the number/radix expressions above, by repeatedly adding 1 to the mixed-radix number until overflow occurs. (Aux. variables \(a_0\) and \(m_0\) introduced for convenience only.)

M1. [Initialize.] Set \(a_j \rightarrow 0\) for \(0 \leq j \leq n\), set \(m_0 \rightarrow 2\).

M2. [Visit.] Visit the \(n\)-tuple \((a_1, \dots, a_n)\). (The program that wants to examine all \(n\)-tulpes now does its thing.)

M3. [Prepare to add one.] Set \(j \rightarrow n\).

M4. [Carry if necessary.] If \(a_j = m_j - 1\), set \(a_j \rightarrow 0, j \rightarrow j-1\) and repeat this step.

M5. [Increase, unless done.] If \(j=0\), terminate algorithm. Otherwise set \(a_j = a_j + 1\) and go back to step \(M2\).

Implementing Algorithm M

Unfortunately, this pseudocode takes some work to translate. Fortunately, that's already done in the method below.

The method below implements Algorithm M in Python to generate random sequences on a 4x4 Rubik's Cube (called the Rubik's Revenge, or RR for short). The RR cube has six faces that can each be rotated clockwise or counterclockwise by a single layer, denoted by the uppercase letters U, D, B, F, L, R (up, down, back, front, left, right, respectively) for clockwise rotations and U', D', B', F', L', R' for counterclockwise rotations, for 12 total moves.

However, on a 4x4 cube, we can also rotate two layers at a time. (That's the limit; moving three layers at a time is equivalent to a reversed rotation of the remaining single layer.) This type of rotation is denoted with a "w".

Thus, rotating the top two layers of each of the six faces clockwise is denoted Uw, Dw, Bw, Fw, Lw, Rw, and counterclockwise rotations are denoted Uw', Dw', Bw', Fw', Lw', Rw', for 12 additional moves.

We have one more type of move, which is where the second layer only is removed. This type of move is denoted with 2, and the face whose second layer is being rotated, for six moves: 2U, 2D, 2B, 2F, 2L, 2R. The prime notation denotes again a counterclockwise rotation, for an additional six moves: 2U', 2D', 2B', 2F', 2L', 2R'. This yields another 12 moves.

There are 36 total moves that can be executed on the 4x4 Rubik's Revenge cube.

def algorithm_m(n):
    """
    Knuth's Algorithm M for permutation generation,
    via AOCP Volume 4 Fascile 2.
    This is a generator that returns permtuations 
    generated using the variable-radix method. 

    This generates ALL permutations.
    Many of these are rotations of one another,
    so use the get_rotations() function
    (defined later) to eliminate redundant sequences.
    """
    moves = ['A','B','C','D']

    # M1 - Initialize
    a = [0,]*n
    m = [len(moves),]*n

    j = n-1

    nvisits = 0
    while True:

        # M2 - visit
        move_sequence = " ".join([ moves[int(aj)] for aj in a])
        yield move_sequence 

        nvisits += 1

        # M3 - prepare to +1
        j = n-1

        # M4 - carry
        while( a[j] == m[j]-1):
            a[j] = 0
            j = j-1

        # M5 - increase unless done
        if(j<0):
            break
        else:
            a[j] = a[j] + 1

Test Drive

Let's take a look at how Algorithm M looks when it is applied. No surprises here: Algorithm M generates each of the possible permutations in sequence.

from pprint import pprint

# (Algorithm M goes here) 

if __name__=="__main__":
    pprint(list(algorithm_m(3)))

and the result:

    ['A A A',
     'A A B',
     'A A C',
     'A A D',
     'A B A',
     'A B B',
     'A B C',
     'A B D',
     'A C A',
     'A C B',
     'A C C',
     'A C D',
     'A D A',
     'A D B',
     'A D C',
     'A D D',
     'B A A',
     'B A B',
     'B A C',
     'B A D',
     'B B A',
     'B B B',
     'B B C',
     'B B D',
     'B C A',
     'B C B',
     'B C C',
     'B C D',
     'B D A',
     'B D B',
     'B D C',
     'B D D',
     'C A A',
     'C A B',
     'C A C',
     'C A D',
     'C B A',
     'C B B',
     'C B C',
     'C B D',
     'C C A',
     'C C B',
     'C C C',
     'C C D',
     'C D A',
     'C D B',
     'C D C',
     'C D D',
     'D A A',
     'D A B',
     'D A C',
     'D A D',
     'D B A',
     'D B B',
     'D B C',
     'D B D',
     'D C A',
     'D C B',
     'D C C',
     'D C D',
     'D D A',
     'D D B',
     'D D C',
     'D D D']

What Other Ways Are There?

All of this may seem obvious or uninteresting, if you don't realize there are other ways of generating all possible \(n\)-tuples.

It's a bit easier to think about for binary numbers. Imagine you're trying to generate every possible 10-digit binary number. This means generating all binary numbers between \(0\) and \(2^{10}-1\).

Algorithm M, as we saw above, just counts from 0 to \(2^{10}-1\). But this can involve changing a large number of bits (for example, adding 1 to 001111111 results in 010000000, changing 8 digits. Knuth presents an alternative algorithm that only requires changing one bit to generate the next permutation, making the algorithm much faster.

More on that algorithm in a future blog post...

Five Letter Words: Part 3: Letter Coverage and Dynamic Programming

Posted in Computer Science

permalink

NOTE: The code covered in this post uses Python 3. The scripts can be converted to Python 2 with minimal effort, but the author would encourage any user of Python 2 to "put on your big kid pants" and make the switch to Python 3. Let's all make this painful, drawn-out switch from Python 2 to Python 3 a thing of the past, shall we?

Table of Contents

Introduction

The letter/word coverage problem, as presented by Donald Knuth in Volume 4, Facsimile 0 of his masterpiece Art of Computer Programming, is the problem of finding the minimum number of words from the collection of five letter words that "cover" (are drawn from) the first N letters of the alphabet.

The problem has a couple of variations:

  • Provided a set of letters, search for the smallest number of words that cover those particular letters.
  • Given an integer \(N \leq 26\), search for the smallest number of words that cover the first N letters of the alphabet.
  • The same problem as above, but drawing from the first \(M\) words of the 5757 total five-letter words.

For the sake of simplicity, we will focus on the simplest problem: considering the first \(N\) letters of the alphabet, find the shortest sequence of words that will provide coverage of the first \(N\) letters of the alphabet.

This is an example of a dynamic programming problem: a combinatorics problem that can be solved by breaking the overall down into smaller sub-problems, solving the sub-problems, and assembling solutions to the sub-problems into an overall problem solution.

The procedure is as follows:

  • For each word \(w_i\), we begin by assuming this word is the best solution on its own. This forms the base case/starting solution.
  • Next, examine all prior words \(w_j, j<i\), and compare each to using the word \(w_i\) by itself.
  • For each pair of words, take the union (OR) of the character coverage for word \(w_i\) and the solution bit vector for word \(w_j\) (that is, using the best-covered solution so far for word \(w_j\))
  • Note: for word \(w_i\), we need to store one of these unions as the best-covered solution so far for word \(w_i\), but we aren't sure which one yet.)
  • For the given pair of words \(w_j\) and \(w_i\), we are looking at word \(w_j\) and considering the possibility of extending that with word \(w_i\). Adding \(w_i\) to the best solution so far may or may not improve the best solution, so we need to decide whether to add \(w_i\) to the best solution so far.
  • Compute the number of letters covered in the union of \(w_i\) and the best solution so far (by, e.g., summing up the 1s in the bit vector of \(w_i\) added to the bit vector representing the best solution so far for word \(w_j\))
  • Compute the number of words in the best solution so far for word \(w_j\), and add one to it (representing the new word \(w_i\) being added)
  • We are searching for the prior solution for word \(w_j\) that will lead to the maximum number of 1s in the bit vector
  • We break ties by picking the word \(w_j\) that will minimize the number of total words
  • Once we find the best word \(w_j\), we save the union bit vector for word \(w_i\) and word \(w_j\) under the word \(w_i\) combined solution bit vector; we save the length of 1s in the combined solution bit vector; and we save the number of words so far in that solution.

Once we have gone through every word, we are ready to find the minimum. Do this by:

  • Searching through the solutions for every word, and pick out the one that maximizes the number of 1s in the solution bit vector (or, rather, that has the correct number of 1s in the bit vector) while also minimizing the total number of words.
  • To get the actual sequence of words, rather than just the minimum number of jwords, we need to save the prior word that leads to the maximum number of 1s in the solution bit vector and minimum number of words, for each word. Then, at the end, we can backtrack through the words that compose the solution.

This is a bit complicated to explain in words, so we'll give a small example, then some pseudocode. Then we'll present the actual Python program that accomplishes this task.

A Simple Manual Example

Let's walk through an example manually to illustrate the approach:

Suppose we are considering 2-letter words taken from a 5-letter alphabet abcde. We can represent a given word as a binary string or bit vector: for example, the two-letter word aa would be represented by the bit vector 10000, ab would be represented by the bit vector 11000, etc.

Now let's consider a set of words, and step through the algorithm with them.

W0 = aa = 10000
W1 = ab = 11000
W2 = bc = 01100
W3 = aa = 10000
W4 = dd = 00010
W5 = de = 00011
W6 = bb = 01000

Now, we wish to write a dynamic program that will find the smallest set of words such that taking the union of each bit vector for each of the words in the set will yield the bit vector 11111. At each step, we seek the words that will maximize the number of 1s in the union of the bit vectors, while minimizing the number of words. We take the union of the "longest sequence of 1s" bit vector from the prior step, plus the bit vector from the current step.

W0: aa

Start with word W0: this is the only bit vector, so it sets the starting "largest sequence of 1s" bit vector. We wish to maximize "largest sequence of 1s" and minimize number of words.

  • only W0 as solution is therefore \(10000\). The number of 1s is 1. The number of words is 1. (W0 SOLUTION)

W1: ab

Start with word W1: this is the only bit vector, so it sets the starting "largest sequence of 1s" bit vector. We wish to maximize "largest sequence of 1s" and minimize number of words.

  • only W1 as solution is therefore \(11000\). The number of 1s is 2. The number of words is 1. (W1 SOLUTION)
  • union of W0 solution and W1 \(10000 \bigcup 11000 = 11000\). The number of 1s is 2. The number of words is 2.

W2: bc

Next is word W2: the "largest sequence of 1s" bit vector is the union of the prior step's "largest sequence of 1s" bit vector and the current word's bit vector. One option:

  • only W2 as solution is \(01100\). The number of 1s is 2. The number of words is 1.
  • union of W0 solution and W2 \(10000 \bigcup 01100 = 11100\). The number of 1s is 3. The number of words is 2. (W2 SOLUTION)
  • union of W1 solution and W2 \(11000 \bigcup 01100 = 11100\). The number of 1s is 3. The number of words is 2.

W3: aa

Next is word W3: the "largest sequence of 1s" bit vector is the union that maximizes the number of 1s and minimizes the number of words. Two options:

  • only W3 as solution is \(10000\). The number of 1s is 1. The number of words is 1.
  • union of W0 solution and W3 \(10000 \bigcup 10000 = 10000\). The number of 1s is 1. The number of words is 2.
  • union of W1 solution and W3 \(11000 \bigcup 10000 = 11000\). The number of 1s is 2. The number of words is 2.
  • union of W2 solution and W3 \(11100 \bigcup 10000 = 11100\). The number of 1s is 3. The number of words is 3. (W3 SOLUTION)

W4: dd

Next is word W4: the "largest sequence of 1s" bit vector is the union that maximizes the number of 1s and minimizes the number of words. Three options:

  • only W4 as solution is \(00010\). The number of 1s is 1. The number of words is 1.
  • union of W0 solution and W4 \(10000 \bigcup 00010 = 10010\). The number of 1s is 2. The number of words is 2.
  • union of W1 solution and W4 \(11000 \bigcup 00010 = 11010\). The number of 1s is 3. The number of words is 2.
  • union of W2 solution and W4 \(11100 \bigcup 00010 = 11110\). The number of 1s is 4. The number of words is 3. (W4 SOLUTION)
  • union of W3 solution and W4 \(11100 \bigcup 00010 = 11110\). The number of 1s is 4. The number of words is 4.

W5: de

Next is word W5: the "largest sequence of 1s" bit vector is the union maximizing number of 1s and minimizing number of words. Four options:

  • only W5 as solution is \(00011\). The number of 1s is 2. The number of words is 1.
  • union of W0 solution and W5 \(10000 \bigcup 00010 = 10010\). The number of 1s is 2. The number of words is 2.
  • union of W1 solution and W5 \(11000 \bigcup 00011 = 11011\). The number of 1s is 4. The number of words is 2.
  • union of W2 solution and W5 \(11100 \bigcup 00011 = 11111\). The number of 1s is 5. The number of words is 3. (W5 SOLUTION)
  • union of W3 solution and W5 \(11100 \bigcup 00011 = 11111\). The number of 1s is 5. The number of words is 4.
  • union of W4 solution and W5 \(11110 \bigcup 00111 = 11111\). The number of 1s is 5. The number of words is 4.

W6:

Next is word W6: the "largest sequence of 1s" bit vector is the union maximizing number of 1s and minimizing number of words. Five options:

  • only W6 as solution is \(01000\). The number of 1s is 1. The number of words is 1.
  • union of W0 solution and W6 \(10000 \bigcup 01000 = 11000\). The number of 1s is 2. The number of words is 2.
  • union of W1 solution and W6 \(11000 \bigcup 01000 = 11000\). The number of 1s is 2. The number of words is 2.
  • union of W2 solution and W6 \(11100 \bigcup 01000 = 11100\). The number of 1s is 3. The number of words is 3.
  • union of W3 solution and W6 \(11100 \bigcup 01000 = 11100\). The number of 1s is 3. The number of words is 4.
  • union of W4 solution and W6 \(11110 \bigcup 01000 = 11110\). The number of 1s is 4. The number of words is 4.
  • union of W5 solution and W6 \(11111 \bigcup 01000 = 11111\). The number of 1s is 5. The number of words is 4. (W6 SOLUTION)

(NOTE: We don't need to consider every possible combination of W1, W2, W3, W4, W5, and W6; we only need to consider each word once, because each word's current solution can be written in terms of the prior word's solution, so we only need to consider solutions for each word. We've already considered the non-solutions and can therefore ignore them because they don't maximize number of 1s and minimize number of words.)

Thus far, we have found a ''local'' solution for each word. We can now compare all of these ''local'' solutions to find a ''global'' solution. The global solution will maximize the number of 1s found (meaning we can toss out any solutions that have less than 5 1s), and minimizes the total number of words (meaning, our W5 solution gives us the global optimum).

Therefore our global solution is the W5 solution: 5 1s, and 3 words. Thus, backtracking, we see that the words W1, W2, W5 cover all of the first five letters, with the minimum number of total words.

W0 = aa = 10000
W2 = bc = 01100
W5 = de = 00011

Pseudocode

Here is the pseudocode for the program. We utilize one function to compute the letter coverage bit vector for a single word, and the rest of the functionality will go in the main method:

function word2bitvector(word):
    initialize 26-element bit vector with 0s (one 0 per letter)
    for each letter in word:
        turn the bit for this letter to 1
    return bit vector

fuction main():

    // initialization step:
    initialize best coverage bit vector
    initialize maximum number of 1s (the number of letters N we wish to cover)
    initialize number of words in current solution
    initialize backtracking array (for constructing final solution)

    // outer loop fencepost step:
    set things up for word 0 (base case)

    // loop through each word
    for each word in words:
        // skip word 0 (base case)

        // inner loop fencepost step:
        initialize things for word (i-1)

        for each prior word j < i:
            compute the new potential best coverage bitvector
            compute the number of 1s in the bnew potential best coverage bit vector
            compute numbr of words in new potential best solution
            if this solution is better than current best solution:
                overwrite best solution with current solution

    // get solution:
    find maximum indices of vector of number of 1s 
    // (this is potentially multiple indices, representing multiple 
    //  solutions that satisfy the coverage we want)
    find minimum number of words corresponding to each of the coverage indices
    backtrack through solution indices

Python Code

The code for this solution can be found here: letter_coverage.py

This code is as follows:

Start with the word-to-bit vector function:

def word2bitvector(word,N):
    """
    Turns a five-letter word into a bit vector representing character coverage.
    Uses 26 letters by default.
    """
    bit_vector = [False,]*N
    for c in word:
        i = ord(c)-ord('a')
        try:
            bit_vector[i] = True
        except IndexError:
            pass
    return np.array(bit_vector)

We also implement a few helper methods: the first turns a boolean bit vector into a pretty string of 0s and 1s:

def printbv(bv):
    result = ""
    for bit in bv:
        if bit:
            result += "1"
        else:
            result += "0"
    return result

The second method is our all-important backtracking to obtain the actual sequence of words that leads to the minimum coverage, instead of just getting a count of the minimum number of words that it takes to cover the first \(N\) letters:

def btsolution(min_key, min_val, words, bt):
    """
    Reconstruct the sequence of words that gives maximum coverage and minimum word count.

    Input: minimum word key (last word), minimum value (number of words), backtrack (prior word)

    Output: list of words
    """
    solution = []
    solution.append(words[min_key])
    prior_key = bt[min_key]
    while prior_key != -1:
        solution.append(words[prior_key])
        prior_key = bt[prior_key]
    return reversed(solution)

Finally, we get to the meat of the method: the dynamic program. Start with some initialization. This is where we set the number of letters we want to cover, and limit the "vocabulary" if desired:

if __name__=="__main__":

    # Searching for words covering first N letters
    N = 13

    words = get_words()

    # If we want to restrict our search to the first M letters,
    #words = words[:1000]

We begin with the initialization step:

    # Initialization:
    # ----------------

    # Store best coverage bitvectors for each word
    bestcoverage_bv = [np.array([False]*N) for k in range(len(words))]

    # Store number of 1s for best coverage vector for each word
    ones_bv = [-1]*len(words)

    # Store number of words in best solution for each word
    ws = [0]*len(words)

    # Store prior word for backtracking
    bt = [-1]*len(words)

Next comes the fencepost initialization step, where we intiialize the solution for word 0:

    # Fencepost: Initial Step
    # Word 0
    # ----------------

    i = 0

    # Start with word 0
    wi = words[i]

    # Best letter coverage bit vector
    bestcoverage_bv[i] = word2bitvector(words[i],N)

    # Length of 1s
    ones_bv[i] = sum(bestcoverage_bv[i])

    # Number of words in best solution:
    ws[i] = 1

    # Backtracking: first word has no prior word
    bt[i] = -1

Next, we loop over each word \(w_i, i>0\):

    # Start by assuming the word by itself, 
    # and then examine each possible pairing
    for i in range(1,len(words)):
        wi = words[i]

        # Start with bitvector of word i's coverage
        wi_bv = word2bitvector(wi,N)

        # Fencepost: initial step
        # Word i by itself
        # Assume word i is the first word in the solution,
        # and if we find a better combination with prior word,
        # overwrite this solution.
        # ------------------------

        # Best coverage so far (first guess) is word i by itself
        bestcoverage_bv[i] = wi_bv

        # Count ones in (first guess) best bitvector
        ones_bv[i] = sum(bestcoverage_bv[i])

        # Number of words in new best solution:
        ws[i] = 1

        # Backtracking
        bt[i] = -1

        # Boolean: is this the first word in the sequence of solutions?
        first = True

We started by assuming that each word \(w_i\) provides a best solution by itself; the next step is to consider each pairing of \(w_i\) with prior words \(w_j\), and update our current solution if we find a better one:

        # Now loop over the rest of the words,
        # and look for a better solution.
        for j in reversed(range(0,i)):

            # Get the prior word
            wj = words[j]

            # Get best coverage bitvector 
            wj_bv = bestcoverage_bv[j]

            # (potential) new combined coverage vector
            bestcoverage_bv_i = np.logical_or(wi_bv, wj_bv)

            # Number of ones in (potential) new combined coverage vector
            ones_bv_i = sum(bestcoverage_bv_i)

            # Number of words in (potential) new best solution
            ws_i = ws[j]+1

            # If this solution is better than our current one,
            # overwrite the current solution.
            # (Better means, "more ones", or "same ones and fewer words".)

            #import pdb; pdb.set_trace();

            if( (ones_bv_i > ones_bv[i]) or (ones_bv_i==ones_bv[i] and ws_i < ws[i]) ):
                bestcoverage_bv[i] = bestcoverage_bv_i
                ones_bv[i] = ones_bv_i
                ws[i] = ws_i
                bt[i] = j

                # This word now follows another word in the sequence of solutions
                first = False

            # It's tempting to stop early,
            # but what if we find the perfect 
            # solution right at the end?!?

Now that we have found the coverage for each word, and the corresponding number of words in that coverage solution, we find the solution that achieves the desired coverage while minimizing the number of words, so that we can construct the actual solution:

    # Okay, now actually get the solution.
    # The solution is the maximum of ones_bv and the minimum of ws
    # 
    # Start by finding the maximum(s) of ones_bv
    # Then check each corresponding index of ws
    ones_bv_indices = [k for k,v in enumerate(ones_bv) if v==max(ones_bv)]

    min_key = ones_bv_indices[0]
    min_val = ones_bv[ones_bv_indices[0]]
    for ix in reversed(ones_bv_indices[1:]):
        if(ones_bv[ix] < min_key):
            min_key = ix
            min_val = ones_bv[ix]



    print("Min key: word "+str(min_key)+" = "+words[min_key])
    print("Min val: "+str(min_val)+" words to cover "+str(N)+" letters")

    pprint(list(btsolution(min_key, min_val, words, bt)))

Output and Timing

Let's take a look at some example output from the program. This program only considers the first 1,000 words in the five-letter word list:

$ time py letter_coverage.py
Takes 9 words to cover 15 letters
['which',
 'their',
 'about',
 'could',
 'after',
 'right',
 'think',
 'major',
 'level']

real    0m17.226s
user    0m17.090s
sys     0m0.087s

Here's the same program, considering all 5,757 words:

$ time py letter_coverage.py
akes 9 words to cover 15 letters
['which',
 'their',
 'about',
 'could',
 'after',
 'right',
 'think',
 'major',
 'level']

real    9m29.619s
user    9m24.360s
sys 0m1.958s

Note that the algorithm is \(O(N^2)\), since it iterates over each word, and for each word, it examines each possible pairing with a preceding word. Thus, if we increase the number of words by a factor of 6, we expect the runtime to increase by a factor of 36, for an estimated runtime of \(36 \times 17 \mbox{ seconds} \approx 10 \mbox{ minutes}\), which is pretty close to what we see above.

Five Letter Words: Part 2: More Five-Word Algorithms

Posted in Computer Science

permalink

NOTE: The code covered in this post uses Python 3. The scripts can be converted to Python 2 with minimal effort, but the author would encourage any user of Python 2 to "put on your big kid pants" and make the switch to Python 3. Let's all make this painful, drawn-out switch from Python 2 to Python 3 a thing of the past, shall we?

Table of Contents

Introduction

As mentioned in Five Letter Words: Part 1, we covered Donald Knuth's list of five letter words, one of the data sets in the Stanford Graph Base that is covered in greater detail in Knuth's coverage of graph theory in Volume 4, Facsimile 0 of his magnum opus, The Art of Computer Programming.

In the section where Knuth introduces the set of words, he also gives readers several exercises to get to know the list of words. This multi-part series of posts (also see Five Letter Words: Part 1) is covering some of the solutions to these exercises, and expanding on them to illustrate some of the interesting and surprising properties of this data set.

Five-Letter Words with k Distinct Letters

In Exercise 27, Knuth asks the reader to make a list of words composed of a specific number of distinct letters (1, 2, 3, 4, or 5).

In the list of five-letter words, there are 0 words composed of a single letter, 4 words with two distinct letters (0.07%), 163 words with three distinct letters (2.8%), 1756 words with four distinct letters (30.5%), and 3834 words with five distinct letters (66.5%).

Here are a few examples: Two distinct letters: mamma, ahhhh, esses, ohhhh Three distinct letters: added, seems, sense, level, teeth Four distinct letters: which, there, these, where, three Five distinct letters: their, about, would, other, words

To find these, we can design an algorithm that does the following: split each string into characters, add them to a set data type (a set discards any duplicates), and get the size of the set. This will give us the number of unique letters in a given word, and we can use a list of lists to store all words with a specified number of unique letters.

Once again, we're using our get_words function, which we covered in Part 1. See get_words.py for that script.

distinct.py

"""
distinct.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Exercise #27

How many SGB words contain exactly k distinct letters, for 1 <= k <= 5?
"""
from pprint import pprint
from get_words import get_words

if __name__=="__main__":
    words = get_words()

    lengths = [[] for i in range(5+1)]

    for word in words:
        k = len(set(word))
        lengths[k].append(word)

    for i in range(1,5+1):
        print("-"*40)
        print("Number of words with {0:d} letters: {1:d}".format(i, len(lengths[i])))
        print(", ".join(lengths[i][0:5]))

The principal operation here is the statement that gets the length, k:

k = len(set(word))
lengths[k].append(word)

The operation of turning a word into a set is \(O(M)\), where M is the number of letters in the word, and the algorithm performs this operation on each word in sequence, so overall, the algorithm is \(O(N)\), where N is the number of words.

The storage space used by the algorithm is also \(O(N)\), since for each word, the number of distinct letters \(k \in \{ 0 \dots 5 \}\).

If we were dealing with a lot of words, and needed to save some space, we could represent the list of words with \(k\) distinct letters using five bit vectors, where each bit vector represents the words that are composed of \(k\) distinct letters, and has a length of \(N\), the number of words. A 0 would indicate the word is not in the set (is not composed of \(k\) letters), and a 1 would indicate the opposite.

But here, we keep it simple, since we have a small, known set of words.

Examining a Variation

While that's essentially all there is to this algorithm, and it takes all of 10 seconds to come up with the idea, there are some nuances and some bookkeeping details, as there are with the design of any algorithm.

For example, compare the following two approaches; Approach 1 is used in the program, Approach 2 is a less efficient approach:

    # Approach 1:
    for word in words:
        k = len(set(word))
        lengths[k].append(word)


    # Approach 2:
    for k in range(1,5+1):
        if(len(set(word))==k):
            lengths[k].append(word)

While these are both \(O(N)\) runtime, the latter approach is inefficient: we loop over each word five times, and each time we perform the same operation (turning the letters of a word into a set).

Is there ever a case where we would want an approach like #2?

The short answer is, never.

To give a longer answer, let's consider a case where approach #2 might provide an advantage. Suppose we were considering a case where \(k\) could be larger - a list of 15-letter words, for example, so k could be up to 15 - and we were only interested in a particular value, or small set of values, of \(k\), like 3 and 4.
Approach 1 would store unnecessary intermediate results (the values of k for all words) and therefore use extra space, compared with approach #2 where we could change the for loop to for k in [3,4]:.

Even here, though, approach #2 results in unnecessary work, because approach #1 is still computationally more efficient by looping over the list of words only once, compared with approach #2, which would loop over the list of words twice.

We may further consider a case where approach #2 would give us an advantage, and that is the case where we are copying data into the list lengths, instead of just storing a reference to a string. Because we only deal with references in Python, we aren't making copies in the code given above. But because strings are immutable, we could conceivably be making copies if we stored word.upper() instead of word. Approach #2 would use less space, because it only considers the values of k that are of interest.

But even here, approach #1 requires only a small modification to wipe out the space advantage of approach #2: add an if statement before calling the append function: if k in [3,4]. Now the calculation of turning a word into a set of characters is performed only once for approach #1, and we don't end up storing unnecessary intermediate results.

The take-home lesson: even if the core idea behind an algorithm is straightforward, there are still many ways to do it better or worse.

Lexicographic Ordering of Letters

Knuth points out that the word "first" contains letters that occur in lexicograhpic order. Exercise #30 of AOCP Volume 4 Facsimile 0 asks us to find the first and last such word that occurs in Knuth's set of five letter words.

To do this, we'll take each word and turn it into a list of characters. We'll then sort the characters, and turn the sorted list of characters back into a string. If the string constructed from sorted characters equals the original string, we have our word, formed from lexicographically ordered letters.

We can also perform the reverse - search for words whose letters are in reverse lexicographic order. One such word is "spied". Implementing this task requires a bit more care, because of the fact that Python 3 returns generators where Python 2 would return lists, but we can get around this with the list() function, as we shall see shortly.

Five-Letter Words with Lexicographically Ordered Letters

Exercise 30 asks us to find the first and last word in the set of five letter words whose letters occur in sorted lexicographic order. We begin by sorting all of the words, and we find the first such word is "abbey", while the last such word is "pssst".

There are 105 total words that fit this description. As we might expect, a majority of them begin with letters at the beginning of the alphabet:

  • abbey
  • abbot
  • abhor
  • abort
  • abuzz
  • achoo
  • adder
  • adept
  • adios
  • adopt
  • aegis
  • affix
  • afoot
  • aglow
  • ahhhh
  • allot
  • allow
  • alloy
  • ammos
  • annoy
  • beefs
  • beefy
  • beeps
  • beers
  • beery
  • befit
  • begin
  • begot
  • bells
  • belly
  • below
  • berry
  • bills
  • billy
  • bitty
  • blowy
  • boors
  • boost
  • booty
  • bossy
  • ceils
  • cello
  • cells

The full output is here:

lexico output

The code to find these words is given below:

lexico.py

"""
lexico.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Exercise #30

Each letter of the word "first" appears in correct lexicographic order.
Find the first and last such words in the SGB words.
"""
from get_words import get_words

def in_sorted_order(word):
    chars = list(word)
    if(str(chars)==str(sorted(chars))):
        return True
    else:
        return False

if __name__=="__main__":

    words = get_words()
    words = sorted(words)

    count = 0
    print("-"*40)
    print("ALL lexicographically sorted words:")
    for word in words:
        if(in_sorted_order(word)):
            print(word)
            count += 1
    print("{0:d} total.".format(count))

    print("-"*40)
    for word in words:
        if(in_sorted_order(word)):
            print("First lexicographically sorted word:")
            print(word)
            break

    words.reverse()

    print("-"*40)
    for word in words:
        if(in_sorted_order(word)):
            print("Last lexicographically sorted word:")
            print(word)
            break

The heart of the method here is the in_sorted_order() method: this performs the task, as described above. We take the word passed to the function (a string), and turn it into a list using the list() function. We then turn this list back into a string (which is the same as the variable word), and compare it to the sorted list of characters, turned back into a string, using the call str(sorted(chars)).

If the two match, we have not affected the order of characters by sorting them in lexicographic (alphabetic) order, and therefore the original string was in sorted order, and we return True. Otherwise, we return False.

Here's that method one more time:

def in_sorted_order(word):
    chars = list(word)
    if(str(chars)==str(sorted(chars))):
        return True
    else:
        return False

Five-Letter Words with Lexicographically Reversed Letters

There are significantly fewer five-letter words whose letters are in reverse lexicographic order - 37, compared to the 105 in sorted order. Here is the full list:

  • mecca
  • offed
  • ohhhh
  • plied
  • poked
  • poled
  • polka
  • skied
  • skiff
  • sniff
  • soled
  • solid
  • sonic
  • speed
  • spied
  • spiff
  • spoke
  • spoof
  • spook
  • spool
  • spoon
  • toked
  • toned
  • tonic
  • treed
  • tried
  • troll
  • unfed
  • upped
  • urged
  • vroom
  • wheee
  • wooed
  • wrong
  • yoked
  • yucca
  • zoned

The code to do this requires only minor modifications to the original, sorted order code.

To reverse the procedure, we just need to modify the in_sorted_order() function to reverse the sorted list of characters before we reassemble it into a string. We can feed the output of the call to sorted() to the reversed() function. However, in Python 3, this returns a generator object, which is lazy - it does not automatically enumerate every character. Unless, of course, we force it to.

That's where the call to list() comes in handy - by passing a generator to list(), we force Python to enumerate the output of the reversed, sorted list generator. Then we turn the reversed, sorted list into a reversed, sorted string:

def in_reverse_sorted_order(word):
    chars = list(word)
    # Note: reversed returns a generator,
    # so we have to pass it to list()
    # to explicitly enumerate the reversed results.
    if(str(chars)==str(list(reversed(sorted(chars))))):
        return True
    else:
        return False

Meanwhile, the rest of the script can stay virtually the same.

reverse_lexico.py

"""
reverse_lexico.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Variation on Exercise #30

Each letter of the word "spied" appears in reversed lexicographic order.
Find more words whose letters appear in reverse lexicographic order.
"""
from get_words import get_words

def in_reverse_sorted_order(word):
    chars = list(word)
    # Note: reversed returns a generator, 
    # so we have to pass it to list() 
    # to explicitly enumerate the reversed results.
    if(str(chars)==str(list(reversed(sorted(chars))))):
        return True
    else:
        return False

if __name__=="__main__":

    words = get_words()
    words = sorted(words)

    count = 0
    print("-"*40)
    print("ALL lexicographically reversed words:")
    for word in words:
        if(in_reverse_sorted_order(word)):
            print(word)
            count += 1
    print("{0:d} total.".format(count))

    print("-"*40)
    for word in words:
        if(in_reverse_sorted_order(word)):
            print("First reverse lexicographically sorted word:")
            print(word)
            break

    words.reverse()

    print("-"*40)
    for word in words:
        if(in_reverse_sorted_order(word)):
            print("Last lexicographically sorted word:")
            print(word)
            break

Finding Palindromes

Palindromes are words or sets of words that have a reflective property, namely, they spell the same thing forward and reverse (e.g., "race car", or "Ere I was able, I saw Malta", or "Doc, note I dissent - a fast never prevents a fatness. I diet on cod.").

In Exercise 29, Knuth asks the reader to perform a straightforward task - find the palindromes in the list of five letter words. (An example of one such word is "kayak".) But Knuth goes further, and points out that palindromes can also be formed from pairs of words. He gives the example "regal lager". He asks the reader to find all palindrome pairs as well.

When working on these exercises, we became curious about palindromic near-misses. How many words are almost palindromes? (Example: "going" is very close to a palindrome, if we just changed the n to an o or vice-versa.) In fact, we already have all the tools we need at our disposal, as we already covered a script to perform a Euclidean distance calculation.

We will cover Python code to find words that fit into each of these categories, and provide some interesting examples. (One of the most surprising things to us was just how many words meet these criteria!)

Palindromes

The first task is finding palindromes in the set of five letter words. There are 18 such words. They are given below:

  • level
  • refer
  • radar
  • madam
  • rotor
  • civic
  • sexes
  • solos
  • sagas
  • kayak
  • minim
  • tenet
  • shahs
  • stats
  • stets
  • kaiak
  • finif
  • dewed

The code to check if a word is a palindrome consists of two simple logical test: Is the character at position 0 equal to the character at position 4? Is the character at position 1 equal to the character at position 3? If both of these are true, the word is a palindrome. Here's the Python function to check if a word is a palindrome:

palindromes.py

def is_palindrome(word):
    test1 = word[0]==word[4]
    test2 = word[1]==word[3]
    if(test1 and test2):
        return True
    return False

and the main driver method, which actually runs the function on each word:

if __name__=="__main__":
    words = get_words()

    kp = 0
    palindromes = []

    # Check for palindromes
    for i in range(len(words)):
        if(is_palindrome(words[i])):
            kp += 1
            palindromes.append(words[i])

    print("-"*40)
    print("Palindromes: \n")
    print(", ".join(palindromes))
    print("There are {0:d} palindromes.".format(kp))

Palindrome Pairs

There are 34 palindromic pairs, if we disallow palindromes from being considered palindromic pairs with themselves. These are:

  • parts, strap
  • lived, devil
  • speed, deeps
  • sleep, peels
  • straw, warts
  • faced, decaf
  • spots, stops
  • fires, serif
  • lever, revel
  • smart, trams
  • ports, strop
  • pools, sloop
  • stool, loots
  • draws, sward
  • mined, denim
  • spins, snips
  • alley, yella
  • loops, spool
  • sleek, keels
  • repel, leper
  • snaps, spans
  • depot, toped
  • timed, demit
  • debut, tubed
  • laced, decal
  • stink, knits
  • regal, lager
  • tuber, rebut
  • remit, timer
  • pacer, recap
  • snoot, toons
  • namer, reman
  • hales, selah
  • tarps, sprat

The code to check for palindrome pairs is a little more involved, but also consists of a few logical tests to see if letters in one position of the first word match letters in another position of the second word:

palindromes.py

def is_palindrome_pair(word1,word2):
    test0 = word1[0]==word2[4]
    test1 = word1[1]==word2[3]
    test2 = word1[2]==word2[2]
    test3 = word1[3]==word2[1]
    test4 = word1[4]==word2[0]
    if(test0 and test1 and test2 and test3 and test4):
        return True
    return False

and the main driver method:

if __name__=="__main__":
    words = get_words()

    kpp = 0
    palindrome_pairs = []

    # Check for palindrome pairs
    for i in range(len(words)):
        for j in range(i,len(words)):
            if(is_palindrome_pair(words[i],words[j])):
                # Palindromes shouldn't count as palindrome pairs
                if(words[i] is not words[j]):
                    kpp += 1
                    palindrome_pairs.append((words[i],words[j]))

    print("-"*40)
    print("Palindrome Pairs: \n")
    for pair in palindrome_pairs:
        print(", ".join(pair))
    print("There are {0:d} palindrome pairs.".format(kpp))

Near Palindromes

A near-palindrome is a word that would be a palindrome, if one of its letters were slightly modified. We use a "tolerance" parameter to specify how much modification we are willing to live with to consider a word a near-palindrome.

There are several ways to do this, but we'll keep it simple: we consider the totla number of changes to all characters in the word required to make a word a palindrome, and test whether the changes required to make the word a palindrome are less than or equal to a specified parameter, tolerance.

For example, if our tolerance were 1, we would consider the words "going" and "moron" to be near-palindromes; if our tolerance were 2, we would consider the words "tsars" and "jewel" to be near-palindromes.

Here is the list of 37 off-by-one palindromes:

  • going
  • seeds
  • tight
  • trust
  • suits
  • sends
  • plump
  • slums
  • sighs
  • erase
  • serfs
  • soaps
  • sewer
  • soups
  • sever
  • slams
  • scabs
  • moron
  • ceded
  • scads
  • suets
  • fugue
  • seder
  • tryst
  • educe
  • twixt
  • tutus
  • shags
  • slims
  • abaca
  • anima
  • celeb
  • selfs
  • scuds
  • tikis
  • topos
  • rajas

and the list of off-by-two palindromes:

  • often
  • stars
  • sight
  • visit
  • towns
  • climb
  • flame
  • reads
  • sings
  • hatch
  • tends
  • naval
  • robot
  • reeds
  • cocoa
  • stout
  • spins
  • onion
  • sinks
  • edged
  • spurs
  • jewel
  • snaps
  • silks
  • nasal
  • theft
  • pagan
  • reefs
  • stirs
  • snips
  • tufts
  • truss
  • strut
  • spans
  • smelt
  • spars
  • flake
  • rusts
  • skims
  • sways
  • runts
  • tsars
  • tress
  • feted
  • rends
  • romps
  • cilia
  • ephod
  • fluke
  • reset
  • farad
  • peter
  • natal
  • thugs
  • newel
  • paean
  • emend
  • snoot
  • fiche
  • porno
  • flume
  • toons
  • roans
  • offen
  • klunk
  • feued
  • nihil
  • pavan
  • relet
  • heigh
  • revet
  • sicks
  • spoor

The check for near-palindromes follows the palindrome test fairly closely, except instead of checking if letters in two positions are equal, we check of those two letters are a certain specified distance from one another.

Here is the code for finding near-palindromes:

"""
near_palindromes.py

Donald Knuth, Art of Computer Programming, Volume 4 Facsimile 0
Variation on Exercise #29

Find SGB words that are near-palindromes
(edit distance of one or two letters away from a palindrome).
"""
from get_words import get_words
from euclidean_distance import euclidean_distance
from pprint import pprint

def is_near_palindrome(word,lo,hi):
    d1 = euclidean_distance(word[0],word[4])
    d2 = euclidean_distance(word[1],word[3])

    if( (d1+d2) > lo and (d1+d2) <= hi ):
        return True

    return False

if __name__=="__main__":
    words = get_words()

    knp = 0
    near_palindromes = []

    # Euclidean distance tolerance
    lo = 0.0
    hi = 1.0

    for i in range(len(words)):
        if(is_near_palindrome(words[i],lo,hi)):
            knp += 1
            near_palindromes.append(words[i])

    print("-"*40)
    print("Near Palindromes: \n")
    print(", ".join(near_palindromes))
    print("The number of near-palindromes is {0:d}".format(len(near_palindromes)))

References

  1. Knuth, Donald. The Art of Computer Programming. Upper Saddle River, NJ: Addison-Wesley, 2008.

  2. Knuth, Donald. The Stanford GraphBase: A Platform for Combinatorial Computing. New York: ACM Press, 1994. <http://www-cs-faculty.stanford.edu/~knuth/sgb.html>

  3. "Five Letter Words." Git repository, git.charlesreid1.com. Charles Reid. Updated 1 September 2017. <http://git.charlesreid1.com/cs/five-letter-words>

Tags:    python    computer science    graphs    algorithms    art of computer programming    knuth    language