charlesreid1.com blog

Recursive Backtracking in Go for Bioinformatics Applications: 1. Counting Variations

Posted in Computational Biology

permalink

This is the first in a series of three blog posts describing our solution to a bioinformatics problem from Rosalind.info, Problem BA1(i) (Find most frequent words with mismatches in a string). To solve this problem and generate variations of a DNA string as required, we implemented a recursive backtracking method in the Go programming language.



Table of Contents



Problem Description

The task at hand is to take a given input strand of DNA, and generate variations from it that have up to \(d\) differences (a Hamming distance of \(d\)) in the codons (base pairs).

In part 1 of this series, we walk through the construction of an analytical formula to count the number of variations of a given DNA string that can be generated, given the constraints of the problem.

In part 2 of this series, we cover several techniques to generate variations on a DNA string, and present pseudocode for the recursive backtracking method that we use here.

In part 3 of this series, we will cover our implementation of the recursive backtracking method in the Go programming language.

Useful Functions

It's always useful to review some basic mathematics useful for combinatorics applications. We'll review the factorial and binomial functions, which will show up in our final formula for the total nubmer of variations we will be generating.

Factorial Function

The factorial function for an integer \(n\) is written \(n!\) and is defined for \(n \geq 1\) as:

$$ n! = n \cdot (n-1) \cdot \dots \cdot 2 \cdot 1 $$

for example, \(5!\) would be:

$$ 5! = 5 \times 4 \times 3 \times 2 \times 1 = 120 $$

Binomial Function

The binomial function has many applications in combinatorics. It is the number of ways of independently selecting \(k\) items from a set of \(n\) items, and is written:

$$ \binom{n}{k} = \dfrac{ n! }{ k! (n-k)! } $$



Counting Permutations

What we want is a formula to count the number of permutations

To derive a formula, it helps to think through the problem starting with smaller special cases, and generalize from there in terms of the problem parameters.

Deriving the Formula

The problem we're trying to solve is generating all perms with hamming distance less than or equal to d, but let's start with a simpler problem: generating all perms with hamming distance of exactly d.

Then we can just sum up over each d.

Start with a simple situation: string of dna with 3 codons. Case of hamming distance 0 too trivial, so start with case of hamming distance of 1.

There are two terms in our combinatorics formula that we need to think about:

  • Term 1: We have a certain number of codons to modify (this is fixed by the Hamming distance d that we pick). Term 1 counts up the number of ways of selecting which indices of the original DNA string to modify.

  • Term 2: Once we've picked out the indices we are going to modify, we have several variations for each index (4 total codons, so 3 variations). Term 2 is a count of the number of variations that are possible, given the choice of indices in the original DNA string to modify.

The approach here is to think about these two terms independently and separately. Each term has a formula to count the number of possibilities indexed by each. Then, because these are independent choices, the total number of combined choices is the product of these two terms.

Term 1: Picking DNA Indices

The first term in our formula for number of variations will be the term representing the number of ways of choosing which indices in the original DNA input string to edit.

Given a Hamming distance of \(d\), and the fact that we have one and only one edit (Hamming distance unit) per base pair, Term 1 counts the number of ways of picking \(d\) items from a set of \(n\) items. Order does not matter.

This problem is equivalent to having a row of \(n\) on/off switches, all in the off position, and counting the number of ways of throwing exactly \(d\) of them into the on position.

Likewise, it is equivalent to having \(d\) identical colored balls, and counting the number of ways of placing them into \(n\) slots, one ball per slot.

We can see how the problem has a kind of triangular structure. Returning to the scenario of \(n\) on/off switches:

  • If we have \(d = n\) switches to throw, or if we have \(d = 0\) switches to throw, in either case we have only 1 possible outcome.

  • If we have \(d = n-1\) switches to throw, or if we have \(d = 1\) switch to throw, either way we have \(n\) possible outcomes

  • If we have \(d = n-2\) or \(d = 2\) switches to throw, there are \(n (n-1)\) possible outcomes; etc.

In fact, this problem - choosing \(d\) things from a set of \(n\) things - is common enough that there is a special function just to describe it, and that's the binomial function (covered above).

The binomial function is defined as:

$$ \binom{n}{k} = \dfrac{ n! }{ k! (n-k)! } $$

In the scenarios posed above, the order of our choices did not matter - the balls were not numbered, the order in which we threw each switch did not affect the outcome.

If the order did matter, if the order in which the on/off switches were thrown mattered or if the balls that were placed into slots had sequential numbers on them, then we would need a different function - the expression above to count the number of outcomes would not have a \(k!\) in the denominator.

Term 1: Side Note on Ordering

If the order of the index choices does not matter, the \(k!\) term in the denominator must be included to cancel out double-counting in the situations where (for example) \(i\) is chosen first and \(j\) is chosen second, and then the situation where \(j\) is chosen first and \(i\) is chosen second.

If the \(k!\) term is present in the denominator, it says that the order in which items are selected does not matter, in which case we are generating combinations.

To count combinations, use the "n choose k" function. See the Combination article on Wolfram MathWorld.

If the \(k!\) term is not present in the denominator, it says that the order in which items are selected does matter, in which case we are generating permutations.

To count permutations, use the "n pick k" function. See the Permutation article on Wolfram MathWorld.

Term 2: Modifying DNA Codons

Once we've selected the \(d\) indices in the original DNA string that we are going to modify, we have to count the number of ways those base pairs can be modified.

We have \(d\) base pairs to modify, and \(c = 4\) total codons (ATGC). Each base pair that we are modifying has \(c-1\) possible codons that it we can swap it out with, and each choice is independent, so the number of possibile outcomes (Term 2) is:

$$ (c-1)^{d} $$



Final Counting Formula

To write the final formula for counting the number of variations \(V\) of a given DNA string of length \(n\) that are a Hamming distance of less than or equal to \(d\), with \(c\) possible codons (A, T, G, C), we will need to sum over Hamming distances from 0 to \(d\):

$$ V = \sum_{k = 0}^{d} \binom{n}{k} (c-1)^{k} $$

Implementing in Go

Now, let's look at how we would implement this counting formula in Go. This will be useful, since programs run much faster when they are able to allocate all the sapce they need in memory ahead of time. Counting the number of variations of our DNA input string will allow us to do just that.

Binomial and Factorial Functions in Go

We'll start with binomial and factorial functions in Go: continuing with our theme of recursion, we implement a recursive factorial function.

// Compute the factorial of an integer.
func Factorial(n int) int {
    if n < 2 {
        // base case
        return 1
    } else {
        // recursive case
        return n * Factorial(n-1)
    }
}

The factorial function will behave correctly for the case of \(n=1\) and \(n=0\), and will return 1 if \(n\) is negative (which is reasonable behavior for our purposes.)

The binomial function utilizes the factorial function:

// Returns value of the binomial coefficient Binom(n, k).
func Binomial(n, k int) int {

    result := 1

    // Since C(n, k) = C(n, n-k)
    if k > (n - k) {
        k = n - k
    }

    // Calculate value of:
    //
    // ( n * (n-1) * ... * (n-k+1) )
    // -----------------------------
    //   ( k * (k-1) * ... * 1 )
    // 
    for i := 0; i < k; i++ {
        result *= n - i
        result /= i + 1
    }

    return result
}

(Note that we might want to add some additional error checks to the Binomial() function.)

Variations Counting Function in Go

Now we can put everything together into a function to count the number of "Hamming neighbors" - variations on a given DNA string that are a Hamming distance of up to \(d\) away from the original DNA string.

To count the number of Hamming neighbors, we implement the formula above. We leave out the error checks on the parameter values here, for brevity.

The inputs are:

  • \(n\) - length of DNA input string
  • \(d\) - maximum Hamming distance
  • \(c\) - number of codons (4, ATGC)
// Given an input string of DNA of length n,
// a maximum Hamming distance of d,
// and a number of codons c, determine
// the number of Hamming neighbors of
// distance less than or equal to d
// using a combinatorics formula.
func CountHammingNeighbors(n, d, c int) (int, error) {

    // We require the following:
    // n > 0
    // d >= 0
    // c > 0

    // Use combinatorics to calculate number of variations
    nv := 0
    for dd := 0; dd <= d; dd++ {

        // Binomial(n,d) => number of ways we can
        //                  pick codons to edit
        next_term := Binomial(n, dd)

        // (c-1)^d => number of ways that the codons
        //            we picked to edit can be edited
        for j := 0; j < dd; j++ {
            next_term *= (c - 1)
        }
        nv += next_term
    }
    return nv, nil
}

We can run this with a few values of k and d to verify it returns the expected values:

For kmer AAA k = 3:
d = 0, count = 1
d = 1, count = 10
d = 2, count = 37
d = 3, count = 64

for a kmer of length 3, we can compute the first 3 values (1, 10, 37) by hand. The last value, when \(d = k\), is a special case where every base pair in the DNA strand can be changed to any codon. Since there are 4 possible codons, this leads to \(4^k = 2^{2k}\) possibilities.

For \(d = k = 3\), we have \(2^6 = 64\) possible DNA strings.

Now, moving on to \(k=5\):

For kmer AAAAA k = 5:
d = 0, count = 1
d = 1, count = 16
d = 2, count = 106
d = 3, count = 376
d = 4, count = 781
d = 5, count = 1024

We can calculate 1 and 16 by hand, verifying those two numbers. As before, the case of \(k = d = 5\) gives a total of \(4^5 = 2^{10} = 1024\) possible DNA strings.



Next: Recursive Backtracking in Go for Bioinformatics Applications: 2. Generating Variations

Tags:    go    golang    rosalind    computational biology    bioinformatics    recursion    backtracking    strings    combinatorics   

Basic Data Structures in Go: Maps

Posted in Computational Biology

permalink

Basic Data Structures in Go: Maps

Continuing with our series of blog posts on what we've been learning about Go in the process of solving problems on Rosalind.info, this post will cover how some basic data structures work in Go, and how we used each to solve problems from the Chapter 1 Rosalind problems.

Maps

The simplest way to describe a map is to say it is a structure useful for storing key-value pairs.

Before we walk through what maps look like in Go, let's talk about what a map is (in the data structure sense). And to do that, it's useful to talk a bit about mathematical functions and maps in the mathematical sense.

What is a map

The term "map" is taken from mathematics. A map is just a relationship - a general term, but still useful. Most mathematics courses deal with functions, which are formally defined as maps from one set of numbers onto another, such that one input corresponds to one output.

A (data structure) map, similarly, is a relationship between two sets - a key set, and a value set. Each key corresponds to only one value.

Maps are typically stored under the hood as either a hash map (which does not sort keys and has very fast O(1), or constant time, operations) or a tree map (which sorts keys using a binary tree and has slower O(log N) operations).

Map notation in Go

In Go, map types are denoted map[keyType]valueType.

For example, to map strings to integers, we would use a map of type map[string]int.

We create a map variable by declaring its type:

var m map[string]int

However, this does not allocate any space for the map, and trying to add keys to the map at this point would result in an error.

We need to actually allocate space for the map. In Go, you can allocate space for a map two ways: first, using Go's built-in make() function; and second, by creating and populating the map in one line.

Using make() with maps

To allocate space for a map, you can use the make() function and pass it the map type. This will actually create space in memory for the map, and allow you to add items or look up keys in the map.

var m map[string]int
m = make(map[string]int)

Creating and populating

If you want to create and populate the map in one line, you can specify the type, then have trailing brackets containing the items you want to add:

// Create an empty map
m := map[string]int{}
// Create and populate a map
m := map[string]int{"A":10, "T":15, "G":20, "C":25}

Zero values

One feature of maps that makes them really easy to work with is, if you try and look up a key, and the key does not exist in the map, the map will not raise an exception (which Python does), it will return the zero value of the value type.

For example, the zero value of the int type is 0, so if we create a map like m := map[string]int{"A":10} and we then look up a key that isn't in the map, like m["G"], Go will return 0.

Similarly, the zero value for booleans is false, so you can utilize the zero value behavior to create a set data structure using maps. By creating a map[keyType]bool map, you can use the boolean value to indicate membership of a key in the given set. Then, if you look up keys that do not exist in the map, Go will return the zero value of booleans by default, which will be false.

Easy iterating over maps

It is easy to iterate over maps using the range keyword in Go, which will return the keys (optionally, both keys and values) of the map in a loop:

var m map[string]int{
        "ABC":10, 
        "DEF":20, 
        "GHI":30
}

for k,v := range m {
    fmt.Println("Key:",k," --> Value:",v)
}

Example: Assembling kmer histogram

Here is an example using maps: this function assembles a kmer histogram from a strand of DNA base pairs. To do this, it loops over every codon in the DNA strand and increments a counter in a map. Finally, this histogram map is returned. (This function is useful for determining the most frequent kmer.)

Here's the whole function that helps solve Rosalind problem BA1B. We'll look at it piece by piece below:

// Return the histogram of kmers of length k 
// found in the given input
func KmerHistogram(input string, k int) (map[string]int,error) {

    result := map[string]int{}

    if len(input)<1 {
        err := fmt.Sprintf("Error: input string was not DNA. Only characters ATCG are allowed, you had %s",input)
        return result, errors.New(err)
    }

    // Number of substring overlaps
    overlap := len(input) - k + 1

    // If overlap < 1, we are looking
    // for kmers longer than our input
    if overlap<1 {
        return result,nil
    }

    // Iterate over each position,
    // extract the string,
    // increment the count.
    for i:=0; i<overlap; i++ {
        // Get the kmer of interest
        substr := input[i:i+k]

        // If it doesn't exist, the value is 0
        result[substr] += 1
    }

    return result,nil
}

Function Walkthrough

The first thing you'll notice is the comment style: there is a comment right before each function, which is common practice in Go, because comments that come before a function are picked up by godoc (the Go documentation tool) and turned into documentation.

// Return the histogram of kmers of length k 
// found in the given input
func KmerHistogram(input string, k int) (map[string]int,error) {

Go has internalized the idea that comments are a part of the documentation, so comments don't need to be formatted in any special way (like /// this or /** this */ business) to end up being picked up by godoc.

Next, we create an empty map (kmer strings to integer frequency counters) and stride over the entire input string with a window the size of the kmers we are interested in, adding or incrementing each corresponding counter in the map as we go.

The overlap variable is the number of possible kmers of length k in the entire input string, which requires a bit of algebra to gt the indices right:

    // Number of substring overlaps
    overlap := len(input) - k + 1

The for loop utilizes Go's slice notation to take a slice of the string (which does not require creating or duplicating any string data), and uses the extracted kmer as a key to add or increment the counter:

    // Iterate over each position,
    // extract the string,
    // increment the count.
    for i:=0; i<overlap; i++ {
        // Get the kmer of interest
        substr := input[i:i+k]

        // If it doesn't exist, the value is 0
        result[substr] += 1
    }

This is where the behavior of maps for non-existent keys comes in handy - in Go, if you ask for a key that does not exist in the map, the map will return the zero value of the specified type.

This statement:

        result[substr] += 1

can also be written as:

        result[substr] = result[substr] + 1

So, the first time a kmer (stored in substr) is encountered, the kmer will not exist as a key in the map, and the key lookup on the right hand side will return 0, incrementing the counter to 1 the first time the kmer is encountered.

Subsequent times the kmer is encountered, the value will be found and substituted on the right side so it will be incremented by 1.

Finally, when we return from the function, we can use a Python-like syntax of returning multiple values separated by commas, yet another great feature of Go:

    return result,nil
}

By convention, the return types of functions will include an error type at the very end, so if you had a function named ExampleFunction that returned three integers, the function definition would look like this:

func ExampleFunction() (int, int, int, error) {
    ...    
}

Additionally, we also see from the function's return statement above that we can use the reserved keyword nil to set a variable to a null value, and that the convention is to return nil in place of the error, so that in the calling function we can set up a structure like this:

a, b, c, err := ExampleFunction()

if err != nil {
    err := "Please read this important message"
    return -1,-1,-1,errors.New(err)
}

Summary

Maps are my favorite data structure, so I'm glad that they're easy to use in Go. Some important points about using maps in Go:

  • Declaring a variable as a map type does not allocate any space for the map; saying var m map[keyType]valueType and then trying to access keys will cause an exception.

  • To allcoate space for a map, use make(map[keyType]valueType) or instantiate with the {} notation, like m := map[keyType]valueType{"asdf":1}

  • To ask for a key, use the square bracket notation. To set a value for a key, use the square bracket notation on the left and assign a value on the right: m[my_key] = my_value

  • Asking for missing keys will return the zero value of whatever type the map values are.

  • Iterating over key-value pairs in a map using a for loop is easy using the built-in range keyword.

Addendum: Check if a Key is in a Map

Because of the default behavior of maps, where they return a zero value for keys that do not exist in the map, it is not immediately obvious how to differentiate between the case where a key is in a map already and has a zero value, versus the case where the key does not yet exist in the map and the zero value is only being returned because the key can't be found.

To resolve this, we can use two Go features: error-checking, and the underscore - a variable that serves as a one-way sink for information, and serves a similar purpose to the underscore in Python.

First, we mentioned above that various operations that can return errors will return an error type as the last return type, along with any other return values. This includes the operation of accessing a key. To assign the value and the error to variables at once:

v, err := m[my_key]

Now, to check if the key exists in the map, we are only concerned with the variable err and we don't really need the variable v. Instead of assigning v to a variable that we never use, and then having the Go compiler complain about it, we can use the underscore as a placeholder:

_, err := m[my_key]

Now, we just add a check for whether err is nil, and voila, we have our check of whether a key is in the map or not:

_, err := m[my_key]

if err != nil {
    // This key is missing from the map

} else {
    // This key already exists in the map

}

Tags:    go    golang    rosalind    computational biology    bioinformatics    maps   

Learning Bioinformatics with Go and Rosalind

Posted in Computational Biology

permalink

Learning Go with Rosalind

What is Rosalind?

Rosalind.info is a website with programming challenges, similar in spirit to Project Euler, but with a focus on bioinformatics.

Problems in the bioinformatics track are presented grouped by chapter, with several problems per chapter. The problems are designed like a coding competition, with problems providing structured input files and expecting structured output from each calculation. Each time you solve a problem, a unique input is generated, and you have a time limit in which to run your code to solve the problem.

What is Go?

Go is a programming language that is static, compiled, and concurrent. It is essentially intended as a modern replacement to C and C++, and is designed for more modern hardware, networks, and scale of code projects.

Go is a language that was invented in Google. It provides some very powerful features, and its design for concurrency is the primary feature that motivated me to learn Go. See this Go blog post on "Concurrency is not parallelism" for a more in-depth discussion of Go's concurrency design, and how that is different from (and more general and more powerful than) parallelism.

Initial Impression of Go

My initial impression of Go has been positive thus far.

Syntax

The first thing about Go that we will remark on is the syntax - it is unusual because it reverses the order that most programming languages use for variable and type declarations.

For example, in Java, you declare an integer array like this:

public int[5] arr;

while in Go, you would declare an analogous data structure like this:

var arr [5]int

While this reversal is a bit confusing at first, it is different enough from other languages that it easily sticks in your head - which is actually a welcome change from the stew of slightly-different-but-largely-overlapping syntax occurring across common programming languages.

Go also shares many of the features that make Python such an easy language to use, including lack of semicolons (yay!) and a useful selection of built-in methods.

Godoc

One of the handiest features of Go is godoc, a command-line utility that runs a local web server at http:8080 that serves up Go documentation. This includes documentation for the standard Go library, as well as documentation for any libraries found in the Go PATH.

Environment Variables

Speaking of the Go PATH, one confusing thing about getting started with Go is all of the environment variables that must be set for Go to find various things.

Basically, you need to have two directories for Go: the location of your Go tree, and the directory where you store Go packages and executables.

  • $GOROOT refers to the location of your Go tree. For example, this might be the directory ~/go.

  • $GOPATH refers to the location where your Go packages and executables are stored. For example, this might be ~/work.

Both of these variables should be set in your ~/.profile.

See Go/Installing on the charlesreid1 wiki for coverage of getting set up with Go.

Arrays and Slices

The first concept in Go that threw me off was the concept of arrays versus slices. This was principally due to the poor explanation and choice of examples given on the Go blog post on how slices work.

Arrays are chunks of memory that are set aside to store contiguous slots for data. To create an array, its size must be known in advance, making them less useful in practice. To resize an array, a new array must be created, and the old array copied into the new array.

Slices, on the other hand, can be thought of as tiny data structures containing pointers to array data. Slices can easily be resized by changing the data structure, without changing the underlying data.

There are also built-in make() and copy() functions, to allocate slices that have a specified capacity (or none, in which case the slice has a variable size) and copy data from arrays to slices or from slice to slice.

The confusion around such a fundamental data type makes Go more difficult to learn and to intuit about what's going on. While Go is clearly superior to C and Java in many ways, it also has some unfortunate stumbling blocks in its most basic data structure that many early learners are sure to have trouble with.

Learning a New Language

One of the things that stood out to me while learning Go was how different it was from learning a non-language systems technology (library) like Docker or docker-compose, or Ansible.

When you are learning a second or third programming language, you generally have a mental roadmap of concepts, and how they fit together and build upon each other. Furthermore, you already know, before ever starting to learn the language, what that roadmap of concepts generally looks like, from your prior learnings.

(If you pick up and compare any dozen books on various programming languages, you'll find certain core concepts that are repeated from book to book.)

Compare this with a technology like Ansible, an extremely powerful Python library that is used for automation of IT and compute infrastructure. While extremely powerful, Ansible is also extremely complex, and feels bewildering to learn because it requires all users (including seasoned experts in Unix shell scripting), to re-learn an entirely new way of doing things using Ansible's system of modules and YAML syntax.

Ansible has no conceptual roadmap. The Ansible documentation moves somewhat haphazardly through the many topics that must be covered but that don't fit together. Books that cover Ansible often follow the documentation's organization, or have a similarly confused organization of concepts. There is no denying it's a great piece of software, but it is very difficult to reason about and teach compared to a computer language.

Core Concepts

The first few Rosalind problems from Chapter 1 were an excellent warm-up to get to know Go, since the tasks were relatively easy and the main bit to work out was how to use the data structure correctly, rather than the algorithm.

Three data structures that we ended up utilizing to solve the Chapter 1 challenges using Go were:

  • Hash maps
  • Strings
  • Bit vectors

More coverage of how these data structures work in Go, plus details of how we used them to solve Rosalind probelms, to follow in the next blog post.

Tags:    go    golang    rosalind    computational biology    bioinformatics   

March 2022

How to Read Ulysses

July 2020

Applied Gitflow

September 2019

Mocking AWS in Unit Tests

May 2018

Current Projects