Building Snakemake Command Line Wrappers for Workflows

Posted Monday 01/21/2019 in Snakemake

NOTE: These ideas are implemented in the repository charlesreid1/2019-snakemake-cli.

Table of Contents

Basic Idea: Wrapping Snakemake API Calls
- 2018-snakemake-cli
- 2019-snakemake-cli
Turning Executables into Packages
End Result: Using bananas
Adding Travis CI Tests
Final Repository
Next Steps

Basic Idea: Wrapping Snakemake API Calls

2018-snakemake-cli

This blog post covers the implementation of an idea that was originally explored in a blog post from Titus Brown, Pydoit, snakemake, and workflows-as-applications.

That blog post implemented a basic command line wrapper around the Snakemake API to demonstrate how a Snakemake workflow could be turned into an executable.

Relevant code is in ctb/2018-snakemake-cli, but the basic idea is to implement a command line utility that takes two orthogonal sets of inputs: a workflow configuration file, and a parameter set.

./run <workflow-config> <workflow-params>

The run script is a Python executable file that parses arguments from the user.

Here is the main entrypoint of run:

#! /usr/bin/env python
"""
Execution script for snakemake workflows.
"""
import argparse
import os.path
import snakemake
import sys
import pprint
import json

thisdir = os.path.abspath(os.path.dirname(__file__))

def main(args):
    # 
    # ...see below...
    #

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='run snakemake workflows', usage='''run <workflow> <parameters> [<target>]
Run snakemake workflows, using the given workflow name & parameters file.
''')

    parser.add_argument('workflowfile')
    parser.add_argument('paramsfile')
    parser.add_argument('-n', '--dry-run', action='store_true')
    args = parser.parse_args()

    sys.exit(main(args))

The main() method uses the os module to look for the Snakefile, the config file, and the params file, then makes a call to the Snakemake API:

def main(args):
    #
    # ...find the snakefile...
    # ...find the config file...
    # ...find the params file...
    # 

    target = workflow_info['workflow_target']
    config = dict()

    print('--------')
    print('details!')
    print('\tsnakefile: {}'.format(snakefile))
    print('\tconfig: {}'.format(workflowfile))
    print('\tparams: {}'.format(paramsfile))
    print('\ttarget: {}'.format(target))
    print('--------')

    # run!!
    status = snakemake.snakemake(snakefile, 
                                 configfile=paramsfile,
                                 targets=[target], 
                                 printshellcmds=True,
                                 dryrun=args.dry_run, 
                                 config=config)

    if status: # translate "success" into shell exit code of 0
       return 0
    return 1

This call uses the provided parameters file to set the Snakemake configuration dictionary, but this can be overridden with the config dictionary. Additional argparser flags can be added, and the config dictionary contents modified based on the flags.

2019-snakemake-cli

We wanted to take this demo a step further, and add a few things to it:

Bundle the Snakefile and command line utility as an installable Python package with a setup.py
Implement Travis CI tests of the Snakemake workflow.

We implemented a bundled Snakemake workflow as a command line tool called bananas.

Turning Executables into Packages

We began with an executable script run and wished to turn it into an installable command line utility called bananas.

To do this, we moved the contents of run into a new file command.py in a new Python module called cli:

cli/
├── Snakefile
├── __init__.py
└── command.py

The Snakefile will contain the workflow. Here is the very simple workflow from ctb/2018-snakemake-cli. The named rules are specified by the workflow configuration file, while the parameters in {} are provided through the parameters file (or via command line flags).

cli/Snakefile:

name = config['name']

rule rulename1:
     input:
        "hello.txt"

rule target1:
     output:
        "hello.txt"
     shell:
        "echo hello {name} > {output}"

rule target2:
     output:
        "goodbye.txt"
     shell:
        "echo goodbye {name} > {output}"

NOTE: In this case we are bundling the Snakefile with the command line wrapper, and writing the command line wrapper to expect the Snakefile to be in the package. But we can modify the command line wrapper function (below) to look for the Snakefile in a local directory, allowing the user to provide Snakefiles and workflows to the command line wrapper.

The __init__.py file sets two important parameters: the name of the command line utility, and the version number:

cli/__init__.py:

_program = "bananas"
__version__ = "0.1.0"

The contents of command.py are similar to run and basically control how the command line utility runs:

cli/command.py:

"""
Command line interface driver for snakemake workflows
"""
import argparse
import os.path
import snakemake
import sys
import pprint
import json

from . import _program


thisdir = os.path.abspath(os.path.dirname(__file__))
parentdir = os.path.join(thisdir,'..')
cwd = os.getcwd()

def main(sysargs = sys.argv[1:]):

    parser = argparse.ArgumentParser(prog = _program, description='bananas: run snakemake workflows', usage='''bananas <workflow> <parameters> [<target>]

bananas: run snakemake workflows, using the given workflow name & parameters file.

''')

    parser.add_argument('workflowfile')
    parser.add_argument('paramsfile')
    parser.add_argument('-n', '--dry-run', action='store_true')
    parser.add_argument('-f', '--force', action='store_true')
    args = parser.parse_args(sysargs)

    # ...find the Snakefile...
    # ...find the config file...
    # ...find the params file...

    target = workflow_info['workflow_target']
    config = dict()

    print('--------')
    print('details!')
    print('\tsnakefile: {}'.format(snakefile))
    print('\tconfig: {}'.format(workflowfile))
    print('\tparams: {}'.format(paramsfile))
    print('\ttarget: {}'.format(target))
    print('--------')

    # run bananas!!
    status = snakemake.snakemake(snakefile, configfile=paramsfile,
                                 targets=[target], printshellcmds=True,
                                 dryrun=args.dry_run, forceall=args.force,
                                 config=config)

    if status: # translate "success" into shell exit code of 0
       return 0
    return 1


if __name__ == '__main__':
    main()

The last component here is to make the function in cli/command.py the entrypoint of a command line utility called bananas, which can be done via setup.py. This will put the executable bananas in the Python binaries folder when the package is installed.

setup.py:

from setuptools import setup, find_packages
import glob
import os

with open('requirements.txt') as f:
    required = [x for x in f.read().splitlines() if not x.startswith("#")]

# Note: the _program variable is set in __init__.py.
# it determines the name of the package/final command line tool.
from cli import __version__, _program

setup(name='bananas',
      version=__version__,
      packages=['cli'],
      test_suite='pytest.collector',
      tests_require=['pytest'],
      description='bananas command line interface',
      url='https://charlesreid1.github.io/2019-snakemake-cli',
      author='@charlesreid1',
      author_email='cmreid@ucdavis.edu',
      license='MIT',
      entry_points="""
      [console_scripts]
      {program} = cli.command:main
      """.format(program = _program),
      install_requires=required,
      include_package_data=True,
      keywords=[],
      zip_safe=False)

First, we grab the variables from __init__.py:

from cli import __version__, _program

Next we specify where our package lives, the cli directory:

setup(name='bananas',
        ...
        packages=['cli'],

and finally, we specify that we want to build a command line interface, with the entrypoint being the main() method of the cli/command.py file using entry_points:

setup(name='bananas',
        ...
        entry_points="""
[console_scripts]
{program} = cli.command:main
      """.format(program = _program),

End Result: Using bananas

The end result is a command line utility that bundles a Snakemake workflow. The repository contains some tests, so let's run through the quick start installation and run the tests.

Quick Start: Installing

Start by setting up a virtual environment:

virtualenv vp
source vp/bin/activate

Install required components, then install the package:

pip install -r requirements.txt
python setup.py build install

Now you should see bananas on your path:

which bananas

Quick Start: Running Tests

pytest

Quick Start: Running Examples

Change to the test/ directory and run tests with the example config and param files.

cd test

Run the hello workflow with Amy params:

rm -f hello.txt
bananas workflow-hello params-amy

Run the hello workflow with Beth params:

rm -f hello.txt
bananas workflow-hello params-beth

Run the goodbye workflow with Beth params:

rm -f goodbye.txt
bananas workflow-goodbye params-beth

Adding Travis CI Tests

To test or workflow, we break down the necessary tasks:

Use a Python environment
Install our requirements (snakemake)
Install bananas with setup.py
Run pytest

This is an easy Travis file to write, following the Travis docs.

.travis.yml:

language: python
python:
  - "3.5"
  - "3.6"
  #- "3.7-dev" # fails due to datrie build failure (snakemake dependency)

# command to install dependencies
install:
  - pip install -r requirements.txt
  - python setup.py build install

# command to run tests
script:
  - pytest

Final Repository

All of the code for this repository is in charlesreid1/2019-snakemake-cli.

See the v2.0 tag in case there are changes to the code that are not reflected in this blog post.

Next Steps

This demo provides a starting point for creating executable Snakemake workflows that are installable.

A few open question and directions:

Bundling the Snakefile vs. user-provided Snakefles
- There is obviously more utility and flexibility in letting the user provide Snakefiles.
- User-provided Snakefiles provide more ways for workflows to go wrong.
- Testing is either more difficult, or shifted to the workflow author.
- Bundled Snakefiles take the burden of writing the workflow off of the user, so they can focus on param/config files.
Kubernetes
- Can we make the command line wrapper work with a Kubernetes cluster?
- See charlesreid1/2019-snakemake-byok8s for proof of concept.
Applications
- How can we apply this concept?
- spacegraphcats
- eelpond
- dahak and dahak-taco

Tags: python bioinformatics workflows pipelines snakemake travis

Recursive Backtracking in Go for Bioinformatics Applications: 3. Go Implementation of Backtracking

Posted Thursday 01/03/2019 in Computational Biology

permalink

This is the third in a series of three blog posts describing our solution to a bioinformatics problem from Rosalind.info, Problem BA1(i) (Find most frequent words with mismatches in a string). To solve this problem and generate variations of a DNA string as required, we implemented a recursive backtracking method in the Go programming language.

Table of Contents

Problem Description
Recursive Backtracking Pseudocode
Recursive Backtracking: Go Implementation
- Visit Hamming Neighbors Function
  - Public, Non-Recursive Function
  - Private, Recursive Function
- Assemble Visit Variation Function
Tests
Final Code
Go Forth and Be Fruitful

Problem Description

The task at hand is to take a given input strand of DNA, and generate variations from it that have up to \(d\) differences (a Hamming distance of \(d\)) in the codons (base pairs).

In part 1 of this series, we walk through the construction of an analytical formula to count the number of variations of a given DNA string that can be generated, given the constraints of the problem.

In part 2 of this series, we cover several techniques to generate variations on a DNA string, and present pseudocode for the recursive backtracking method that we use here.

In part 3 of this series, we will cover our implementation of the recursive backtracking method in the Go programming language.

Recursive Backtracking Pseudocode

To review from the prior post, our pseudocode for recursive backtracking to explore variations or combinations looks like the following:

explore method:
    base case:
        visit this solution
    recursive case:
        for each available choice:
            make a choice
            explore outcomes
            unmake the choice
            move on to the next choice

The key elements there are the base and recursive cases, and the mechanism of iterating over each possible choice and making/exploring/unmaking the choice.

Recursive Backtracking: Go Implementation

In total, we have three different methods to accomplish this task:

VisitHammingNeighbors(input,d): this is the public method that the user calls to generate a string array of all strings that are a Hamming distance of up to d from the input string input. This public method performs parameter and error checking, initializes space for data, and collects results.
visitHammingNeighbors_recursive(base_kmer, depth, choices, results_map): this method is the private recursive method available only to the package. This method performs the actual recursive work.

NOTE: the function name starts with a lower case letter, so it is not exported by the package - i.e., it is not available to the user when they import this package.

The base case of the visitHammingNeighbors_recursive() function will pass the final set of choices to the final step:

assemble_variations(base_kmer, choices, results_map): this method (private to the package) is a recursive method that uses the chosen indices and

Visit Hamming Neighbors Function

The function call to visit all Hamming neighbors and add them to the results set is split into two parts: a non-recursive public function, which provides a public wrapper that is user-friendly and performs error-checking on the parameters provided, and a recursive private function that is used internally but not intended to be called by users directly.

Public, Non-Recursive Function

Here is the entry point function that the user calls when they wish to generate all variations on a given string of DNA, and have the variations returned as a string slice.

// Given an input string of DNA, generate variations
// of said string that are a Hamming distance of
// less than or equal to d.
func VisitHammingNeighbors(input string, d int) (map[string]bool, error) {

    // a.k.a. visit_kmer_neighbors

    // number of codons
    n_codons := 4

    // Use combinatorics to calculate the total
    // number of variation.
    buffsize, _ := CountHammingNeighbors(len(input), d, n_codons)

The call to CountHammingNeighbors() uses the counting formula from Part 1 to predict the number of variations. If the user has selected an astronomical problem size, the program warns the user.

    // This blows up quickly, so warn the user
    // if their problem is too big
    MAX := int(1e6)
    if buffsize > MAX {
        msg := fmt.Sprintf("Error: you are generating over MAX = %d permutations, you probably don't want to do this.", d)
        return nil, errors.New(msg)
    }

Now the actual recursive backtracking algorithm begins. The code loops over every possible value of Hamming distance \(d\) and calls the recursive method at each value of \(d\).

    // Store the final results in a set (string->bool map)
    results := make(map[string]bool)

    // Begin backtracking algorithm
    for dd := 0; dd <= d; dd++ {

        // The choices array will change with each recursive call.
        // Go passes all arguments by copy, which is good for us.
        choices := []int{}

        // Populate list of neighbors
        visitHammingNeighbors_recursive(input, dd, choices, results)

    }

We don't assign any results from the call to visitHammingNeighbors_recursive() because we pass in a data structure (actually a pointer to a data structure), results, that is modified in-place.

Thus, when we complete a call to visitHammingNeighbors_recursive(), results will contain all variations already.

    // Check if we have the right number of results
    if len(results) != buffsize {
        fmt.Printf("WARNING: number of results (%d) did not match expected value (%d)\n", len(results), buffsize)
    }

    return results
}

Private, Recursive Function

In the above function, the call to the recursive function to visit all Hamming neighbors happens here:

        // Populate list of neighbors
        visitHammingNeighbors_recursive(input, dd, choices, results)

The user passes the original kmer input, along with the Hamming distance parameter dd, the list of choices of indices that have already been selected choices, and the data structure storing all resulting strings results.

As with the pseudocode, we have a base case and a recursive case. The recursive function is being called repeatedly until it reaches a depth of 0, with the depth parameter being decremented each call.

// Recursive function: given an input string of DNA,
// generate Hamming neighbors that are a Hamming
// distance of exactly d. Populate the neighbors
// array with the resulting neighbors.
func visitHammingNeighbors_recursive(base_kmer string, depth int, choices []int, results map[string]bool) error {

    if depth == 0 {

        // Base case

    } else {

        // Recursive case

    }
}

The base case occurs when we reach a depth of 0 and have no further choices to make. We reach this base case for each binary number with \(d\) digits set to 1; once the base case is reached, we call the assemble_variations() function to substitute all possible codons at the selected indices.

func visitHammingNeighbors_recursive(base_kmer string, depth int, choices []int, results map[string]bool) error {

    if depth == 0 {

        // Base case
        assemble_variations(base_kmer, choices, results)
        return nil

The recursive case is slightly more complicated, but it follows the same backtracking pseudocode covered previously: from a set of possible choices, try each choice, recursively call this function, then unmake the choice and move on to the next choice.

Here, the choice is which index c in the kmer to modify. Each kmer can only be modified once, so we have a for loop to check if the index c is in the list of choices already made.

    } else {

        // Recursive case
        for c := 0; c <= len(base_kmer); c++ {

            var indexAlreadyTaken bool
            for _, choice := range choices {
                if c == choice {
                    indexAlreadyTaken = true
                }
            }

As before, the recursive call to this function will not return any values that need to be stored, since results points to a data structure (map) that is modified in-place.

            if !indexAlreadyTaken {

                // This will make a new copy of choices
                // for each recursive function call
                choices2 := append(choices, c)
                err := visitHammingNeighbors_recursive(base_kmer, depth-1, choices2, results)
                if err != nil {
                    return err
                }

            }
        }

    }

    return nil
}

Assemble Visit Variation Function

Once we've generated each list of indices to modify, we call a second recursive function to substitute each codon into each index.

In the recursive method above, each recursive function call added a new choice to choices; in this recursive function, each recursive funcction call pops a choice from choices. Thus, the base case is when choices is empty.

Here are the base and recursive cases:

// Given a base kmer and a choice of indices where
// the kmer should be changed, generate all possible
// variations on this base_kmer.
func assemble_variations(base_kmer string, choices []int, results map[string]bool) {

    if len(choices) > 0 {

        // Recursive case
        ...

    } else {

        // Base case
        ...

    }
}

The recursive case pops a choice from choices, finds which nucleotide (AGCT) is at that location, and assembles the list of possible choices (the other 3 nucleotide values). It then performs the recursive backtracking algorithm, choosing from each of the three possible nucleotide values, exploring the choice by making a recursive call, then un-making the choice.

func assemble_variations(base_kmer string, choices []int, results map[string]bool) {

    if len(choices) > 0 {

        // Recursive case

        all_codons := []string{"A", "T", "G", "C"}

        // Pop the next choice
        // https://github.com/golang/go/wiki/SliceTricks
        ch_ix, choices := choices[0], choices[1:]

        // Get the value of the codon at that location
        if ch_ix < len(base_kmer) {
            // slice of string is bytes,
            // so convert back to string
            this_codon := string(base_kmer[ch_ix])
            for _, codon := range all_codons {

                if codon != this_codon {
                    // Swap out the old codon with the new codon
                    new_kmer := base_kmer[0:ch_ix] + codon + base_kmer[ch_ix+1:]
                    assemble_variations(new_kmer, choices, results)
                }
            }
        }

    } else {

        // Base case
        results[base_kmer] = true

    }
}

Tests

The last step after some debugging was to write tests for the function to generate all variations of a DNA string, to ensure the recursive backtracking functions work correctly.

The pattern we use is to create a struct containing test parameters, then create a test matrix by initializing instances of the parameter struct with the parameters we want to test.

Here is how we set up the tests:

func TestMatrixVisitHammingNeighbors(t *testing.T) {
    var tests = []struct {
        input string
        d     int
        gold  []string
    }{
        {"AAA", 1,
            []string{"AAC", "AAT", "AAG", "AAA", "CAA", "GAA", "TAA", "ATA", "ACA", "AGA"},
        },
    }
    for _, test := range tests {

        ...

    }
}

Each test case should generate all Hamming neighbors, and compare to the list of Hamming neighbors provided. This requires two tricks:

sort before comparing, to ensure a proper comparison
use a custom EqualStringSlices() function that will iterate through two string slices element-wise to check if they are equal.

The EqualStringSlices() function is required because Go does not have built-in equality checks for slices.

Here is what the tests look like:

    for _, test := range tests {

        // Money shot
        result, err := VisitHammingNeighbors(test.input, test.d)

        // Check if there was error
        if err != nil {
            msg := fmt.Sprintf("Error: %v", err)
            t.Error(msg)
        }

        // Sort before comparing
        sort.Strings(test.gold)
        sort.Strings(result)

        if !EqualStringSlices(result, test.gold) {
            msg := fmt.Sprintf("Error testing VisitHammingNeighbors():\ncomputed = %v\ngold     = %v",
                result, test.gold)
            t.Error(msg)
        }
    }

Final Code

The final version of the recursive function to visit all Hamming neighbors and return them in a string array can be found in the go-rosalind library on Github.

Specifically, in the file rosalind_ba1.go, there is a VisitHammingNeighbors() function that is the public function that calls the private recursive function visitHammingNeighbors_recursive(), and the recursive function to swap out codons is in the visit() funciont.

Go Forth and Be Fruitful

Now that you have the basic tools to imlement a recursive backtracking algorithm in Go to generate string variations, you have one of the key ingredients to solve Rosalind.info problem BA1i, "Find Most Frequent Words with Mismatches by String".

This problem is tricky principally because it requires generating every DNA string variation, so now you should have the key ingredient to solve BA1i (and several problems that follow).

You can use the final version of the methods we covered by importing the go-rosalind library in your Go code (link to go-rosalind documentation on godoc.org) or you can implement your own version of these algorithms. The Go code we covered in this post is also on Github in the charlesreid1/go-rosalind repository.

Tags: go golang rosalind computational biology bioinformatics recursion backtracking strings combinatorics

Recursive Backtracking in Go for Bioinformatics Applications: 2. Generating Variations

Posted Friday 12/28/2018 in Computational Biology

permalink

This is the second in a series of three blog posts describing our solution to a bioinformatics problem from Rosalind.info, Problem BA1(i) (Find most frequent words with mismatches in a string). To solve this problem and generate variations of a DNA string as required, we implemented a recursive backtracking method in the Go programming language.

Part 1: Counting Variations
Part 2: Generating Variations (you are here)
Part 3: Go Implementation of Recursive Backtracking

Problem Description
Permutations vs Combinations vs Variations
Recursion
- Recursive Backtracking Pseudocode
Appying to DNA Variations
- Generating Visits with Binary Numbers
- Assembling the Variation

Problem Description

The task at hand is to take a given input strand of DNA, and generate variations from it that have up to \(d\) differences (a Hamming distance of \(d\)) in the codons (base pairs).

In part 2 of this series, we cover several techniques to generate variations on a DNA string, and present pseudocode for the recursive backtracking method that we use here.

In part 3 of this series, we will cover our implementation of the recursive backtracking method in the Go programming language.

Permutations vs Combinations vs Variations

Before covering generation of variations of a DNA string, we should cover some terminology for clarification.

If we were to use the term permutations, as in, we are counting (or generating) permutations of the input DNA string, that would imply that we were doing some kind of rearrangement of the elements of the input DNA string (for example, swapping two codons). This is not the problem that we are solving, and requires different formulas. (See Permutations entry on Wolfram MathWorld.)

The variations that we are referring to are not exactly combinations, either, though. If we were to use the term combinations, it would imply that we were choosing a set of \(k\) integers from a set of \(d\) integers \({1, 2, \dots, d}\).

The variations that we are counting are similar to combinations, but with the additional act of swapping out each codon at the position (integer) selected with three other possible codons, so there are more variations than combinations (and many more permutations than variations).

Transforming the Problem Space

A surprisingly large variety of problems in combinatorics can be transformed into an equivalent problem involving binary numbers, which are usually easier to think about.

To generate variations, we can break up the process of producing a variation into two steps, or choices, and then convert these choices (and the process of making them) into an equivalent problem in terms of binary numbers.

We can decompose the cration of a DNA string variation into the first step of choosing which codons (indices) to edit, and the second step of cycling through every possible codon (ATGC) at the selected indices.

To translate this into an equivalent binary number problem, consider the input string of DNA "AAAAA" and let the Hamming distance that we are considering be \(d = 1\). Then we can code each index with a 0 (not chosen) or a 1 (chosen) and turn the problem into cycling throgh all binary numbers with 1 bit:

The second step is to cycle through each alternate codon at the given position, so that 00001 would generate the variations:

AAAAC
AAAAG
AAAAT

and so on.

We saw this two-part technique already when counting the total number of variations that could be created in Part 1: Counting Variations. It resulted in a counting formula with two terms, a binomial term for step 1 and an exponential term for step 2.

We can think of the problem as forming a tree with several decision nodes that need to be explored; this type of problem structure is ideal for a recursive backtracking algorithm.

We will cover the use of recursive backtracking to actually explore the entire tree of possible outcomes (not just count it), starting with some review and background on recursive backtracking and how it works.

Recursion

Recursion is a common pattern to use for problems that require exploring a large problem space that requires us to make several selections.

A recursive backtracking algorithm is analogous to exploring a maze but laying out a rope as you go, so tht you can revisit each possible route. In this case, we are using backtracking to make the choice of which indices of the input DNA string to modify. We want to explore all possible choices to generate all possible variations of the input DNA string, and backtracking gives us the framework to do that.

For example, if we wanted to recursively generate codon choices for the case of an input DNA string like "AAAAA" and \(d = 2\), we would call a recursive method twice; the first time through, we would choose one of the five indices, and mark it as picked; then we would call the method again, and choose a second index (different from the first) and mark it as picked.

When unrolled, this is equivalent to a nested for loop,

for i in range( 0 .. len(dna_string) ):
    for j in range( 0 .. len(dna_string) ):
        if (i != j):
            Start with the binary number 00000
            Set the digit at index i to 1
            Set the digit at index j to 1

Recursive Backtracking Pseudocode

Basic pseudocode for a backtracking method:

explore method:
    base case:
        visit this solution
    recursive case:
        for each available choice:
            make a choice
            explore outcomes
            unmake the choice
            move on to the next choice

Applying to DNA Variations

There are actually two places where we need to apply backtracking to our problem.

Generating Visits with Binary Numbers

The first application of recursive backtracking is to carry out step 1, choosing indices in the original DNA string to modify or cycle through altnerate codons. We showed above how generating variations on a kmer of length \(k\) at a distance \(d\) from the original kmer was equivalent to generating binary numbers with \(d\) bits set to 1.

We can use recursive backtracking to generate these numbers. By creating a method that recursively selects an index to switch to 1, and passing that (and all prior choices) on to further recursive calls, the function can recurse to a given depth \(d\) and visit all possible binary numbers with \(d\) bits set to 1.

The base case of this recursive method would be reached when all \(d\) choices had been made and \(d\) bits were set to 1. Then the choice of indices to swap out with alternate codons would be passed on to a recursive method that would carry out Step 2 (see below).

For example, to generate variations of the 5-mer AAAAA, we would start by selecting a Hamming distance \(d\), then generate a binary number with \(d\) bits set to 1 to select indices to modify. Suppose \(d = 2\); then the first few binary numbers are:

AAAAA
11000
10100
10010
10001
...

To expand on the pseudocode a bit more, to generate a binary number with \(d\) bits flipped to 1 we will want to call a recursive method with a depth of \(d\), making a choice at each recursive call of which index to set to 1 next.

The \(n^{th}\) recursive call picks the \(n^{th}\) index for 1. Each index can only be chosen once in the stack of recursive calls, and the indices that have been chosen by prior recursive function calls are passed along.

Thus we need a minimum of two parameters: an integer indicating the depth level of this recursive function call, and an integer array of index choices.

function generate_binary_numbers( depth, choices[], ... ):

    if depth is 0,
        base case
        no more choices left to make
        choices[] is full
        pass along choices[] to assemble the variations

    else,
        recursive case
        for each possible index,
            if this index is not already in choices,
                add this index to choices
                generate_binary_numbers( depth+1, choices[] )
                remove this index from choices

Assembling the Variation

Each binary number is then turned into variations by substituting every combination of 3 codons in every position with a 1 possible, so the first binary number for \(d=2\) would generate the variations:

AAAAA
11000
-----
CCAAA
GCAAA
TCAAA
CGAAA
GGAAA
TGAAA
CTAAA
GTAAA
TTAAA

This would be repeated for all Hamming distances up to the maximum specified Hamming distance.

Like the generation of binary numbers, the substitution of all possible combinations of codons at these positions is a task conducive to a recursive backtracking algorithm.

Like the prior task's recursive method, this task's recursive method will have one parameter for depth (number of choices left to make) and a range of choices to try (codons).

function assemble_variations( depth, choices[], ... ):

    if depth is 0,
        base case
        no more choices left to make
        choices[] is full
        pass along choices[] to assemble the variations

    else,
        recursive case
        for each possible index,
            if this index is not already in choices,
                add this index to choices
                generate_binary_numbers( depth+1, choices[] )
                remove this index from choices

In the final part, Part 3, of this blog post, we will cover the actual Go implementation of these functions.

Tags: go golang rosalind bioinformatics recursion backtracking strings combinatorics

← Newer Older →

charlesreid1.com blog

Basic Idea: Wrapping Snakemake API Calls

2018-snakemake-cli

2019-snakemake-cli

Turning Executables into Packages

End Result: Using bananas

Quick Start: Installing

Quick Start: Running Tests

Quick Start: Running Examples

Adding Travis CI Tests

Final Repository

Next Steps

Problem Description

Recursive Backtracking Pseudocode

Recursive Backtracking: Go Implementation

Visit Hamming Neighbors Function

Public, Non-Recursive Function

Private, Recursive Function

Assemble Visit Variation Function

Tests

Final Code

Go Forth and Be Fruitful

Table of Contents

Problem Description

Permutations vs Combinations vs Variations

Transforming the Problem Space

Recursion

Recursive Backtracking Pseudocode

Applying to DNA Variations

Generating Visits with Binary Numbers

Assembling the Variation

March 2022

February 2022

January 2022

December 2021

August 2020

July 2020

April 2020

March 2020

February 2020

December 2019

November 2019

October 2019

September 2019

May 2019

April 2019

March 2019

February 2019

January 2019

December 2018

October 2018

May 2018

March 2018

February 2018

January 2018

September 2017

August 2017

July 2017

June 2017

May 2017

April 2017

March 2017