charlesreid1.com blog

Five Letter Words: Part 4: Revisiting Diff by One

Posted in Computer Science

permalink

About the Five-Letter Words

In Volume 4, Facsimile 0 of Donald Knuth's Art of Computer Programming, in which Knuth covers graph theory, he introduces a list of five-letter words as part of a data set useful in exploring graph theory and graph algorithms.

The list of words is part of the Stanford Graph Base, a set of data sets that are useful for studying graph theory and networks.

See Five Letter Words on the charlesreid1.com wiki for details.

Different by 1, Revisited

This post is revisiting an exercise from the above volume, Exercise 28:

Find pairs of SGB word vectors that differ by +/- 1 in each component.

In a prior blog post (Part 1), we had inerpreted the question as finding word vectors whose Euclidean distance differed by 1 total, which is the same as a Hamming distance of 1.

However, on revisiting the (more interesting) question actually being posed by the author, we find a different and more difficult problem.

As an example of what Knuth is asking for:

rover -> spuds

Each letter of the words are within an edit distance of 1, at each position.

There are 38 such pairs:

$ python diff_by_one_fixed.py
abaft babes
absit baths
adder beefs
ambit blahs
anger boffs
anode boned
bider chefs
bidet chefs
biffs cheer
ghost hints
hobos inapt
holds inker
honed inode
hoods inner
hoofs inner
hoots input
hoped inode
ingot johns
needs odder
needs offer
rider sheds
rifer sheds
rinds shoer
robed spade
robot snaps
robot spans
rover spuds
ruffs steer
runts stout
rusts strut
sheer tiffs
sheet tiffs
shout tints
sides theft
sneer toffs
splat tombs
spuds toter
stuns tutor
Found 38 pairs of words that differ by +/-1 in each component.

The approach we used was as follows:

  • Iterate over each word in the wordlist (use the first 1,000 words to keep it shorter for testing)
  • For each word:
  • Generate all variations that are within +/-1 using recursive backtracking (could also use algorithm to generate all 32 binary codes of length 5, where 0 = -1, 1 = +1)
  • For each of the 32 variations,
    • Check if the word is in the wordset (O(1) cost if using a hash table/set)
    • If so, add ordered pair (word1,word2) to a set of solutions (to avoid dupes)

Different by N, Revisited

We went back and modified the code to take a distance parameter d, but storage and compute cost, as well as the sparsity of the graph of shared bigrams and trigrams among these 5,000 words, means the number of pairs increases exponentially.

| Distance  | Number of pairs   | Walltime       |
|-----------|-------------------|----------------|
| 1         | 38                |     0.26 s     |
| 2         | 525               |     5.26 s     |
| 3         | 4982              |    38.87 s     |
| 4         | ?? (10^5 est.)    |  10 min (est.) |

You can find the diff_by_n.py script here: https://git.charlesreid1.com/cs/five-letter-words/

The output:

$ python diff_by_n.py
abaft babes
absit baths
adder beefs
ambit blahs
anger boffs
anode boned
bider chefs
bidet chefs
biffs cheer
ghost hints
..
Found 38 pairs of words that differ by +/-1 in each component.
Time: 0.2673 s

aback babel
aback cabal
abaft babes
abash cacti
abide baked
abide caged
abide caked
abler bands
abler bangs
abode caned
...
Found 525 pairs of words that differ by +/-2 in each component.
Time: 5.2617 s

abaca ceded
abaci babel
abaci cabal
abaci decaf
abaci decal
aback babel
aback cabal
aback decal
abaft babes
abaft bedew
...
Found 4982 pairs of words that differ by +/-3 in each component.
Time: 38.8743 s

centillion: a document search engine

Posted in Centillion

permalink

We're excited to announce the public release of centillion, a document search engine.

centillion is a search tool that can be used by any individual or organization to index Github repositories (including the content of markdown files), Google Drive folders (including the content of .docx files), and Disqus comment threads.

centillion is tested using Travis CI.

centillion was originally written for the NIH Data Commons effort (which recently concluded). centillion was built to facilitate information-finding in a project with hundreds of people at dozens of institutions generating a sea of email threads, Google Drive folders, markdown files, websites, and Github repositories.

centillion provided a single comprehensive way of searching across All The Things and earned the author many thanks from members across the Data Commons. It is the author's hope that centillion can prove equally useful for other organizations.

Under the hood centillion uses Flask (a web server microframework) and Whoosh (a Python-based search engine tool).

You can get a copy of the latest centillion release here: https://github.com/dcppc/centillion

You can find the latest centillion documentation here: http://nih-data-commons.us/centillion/

Tags:    python    centillion    search    search engine    google drive    github    flask   

Any Color You Like, As Long As It's 00ADD8

Posted in Go

permalink

A short post with some thoughts on how writing Go code has helped me learn to stop worrying and love auto-formatting tools.

Go code is terse. Not Python-terse, but terse. And unlike Java, you don't find yourself constantly resorting to the security blanket of objects - something that Python (mercifully) can go either way on.

I used Java when I taught computer science at South Seattle College, and remember telling students once that one day, students taking CSC 142/143 would be using Go instead of Java. These days, I'm not as certain of that, but given that Go's strengths are asynchronous programming (critical for taking advantage of multicore hardware) and tasks suited for the web, it isn't hard to imagine a "Go 2.0" that becomes a de-facto standard in school curricula.

Something else I like about Go is the way there is a toolchain that adheres to the Unix tooling philosophy: do one thing and do it well. Take gofmt as an example - this is a tool that autoformats Go code to conform to the Go standard spec. gofmt is a simple tool that does just one thing. This tool can be connected to various text editors with hooks, a la vimgo.

gofmt has taught me the value, and convenience, of embracing the norms and standards set by a language's community. Go recommends using tabs, for example, which early on I found a bit repulsive. Before I had vimgo set up, I was stubbornly using spaces instead of tabs in my Go code.

But then I set up vimgo so that, every time I saved a buffer containing Go code, it would run gofmt on the code, replacing all of the nitpicky details (like how many spaces between parentheses and variables, or wether == should be surrounded be spaces) and it just makes an executive decision.

Sure, it uses tabs instead of spaces, but once you start to work on code and save it and you see all of these details just handled, you quickly learn not to worry about it.

And the surprising thing, to me, was just how much overhead I was spending on those things. It adds up.

The gofmt executive decision strategy is similar to black, "The uncompromising Python code formatter," whose slogan is "Any color you like, as long as it's black."

While I really like black and would love to let it handle all of my Python code the way gofmt handles all of my Go code, the unfortunate reality is that Python, unlike Go, does not have an official standard, and if you automatically apply black formatting to all Python code, you can quickly wreak havoc on version-controlled code. You have to tread more lightly with black. I apply black more selectively by only applying it to .py files that are in specific project subdirectories.)

A slogan for gofmt could be, "Any color you like, as long as it's #00ADD8."

Wait, what? Where did #00ADD8 come from?

It's in the Go Brand Book. Prior to discovering this (the link was dropped in an unrelated discussion on the gonuts mailing list), I had no idea waht a brand book was. Turns out, this is very much a thing in marketing. Companies, projects, and organizations all have brand books that lay out the details of their marketing designs, branding, looks, everything down to the fonts and colors.

The Go brand book is short, but it does specify an official color for Golang: #00ADD8. It also covers critical details about how to depict the Go gopher, including the physics of gopher belly folds:

Extremely important details

There are some other branding books - the Coca Cola brand book. is simultaneously fascinating and terrible, in a late stage capitalism kind of way.

At any rate, at least the Go brand book is about something useful, and contains silly things like gophers.

Gopher specs

Tags:    go    golang    rosalind    bioinformatics    black    python    gofmt   

March 2022

How to Read Ulysses

July 2020

Applied Gitflow

September 2019

Mocking AWS in Unit Tests

May 2018

Current Projects