charlesreid1.com blog

Basic Data Structures in Go: Maps

Posted in Rosalind

permalink

Basic Data Structures in Go: Maps

Continuing with our series of blog posts on what we've been learning about Go in the process of solving problems on Rosalind.info, this post will cover how some basic data structures work in Go, and how we used each to solve problems from the Chapter 1 Rosalind problems.

Maps

The simplest way to describe a map is to say it is a structure useful for storing key-value pairs.

Before we walk through what maps look like in Go, let's talk about what a map is (in the data structure sense). And to do that, it's useful to talk a bit about mathematical functions and maps in the mathematical sense.

What is a map

The term "map" is taken from mathematics. A map is just a relationship - a general term, but still useful. Most mathematics courses deal with functions, which are formally defined as maps from one set of numbers onto another, such that one input corresponds to one output.

A (data structure) map, similarly, is a relationship between two sets - a key set, and a value set. Each key corresponds to only one value.

Maps are typically stored under the hood as either a hash map (which does not sort keys and has very fast O(1), or constant time, operations) or a tree map (which sorts keys using a binary tree and has slower O(log N) operations).

Map notation in Go

In Go, map types are denoted map[keyType]valueType.

For example, to map strings to integers, we would use a map of type map[string]int.

We create a map variable by declaring its type:

var m map[string]int

However, this does not allocate any space for the map, and trying to add keys to the map at this point would result in an error.

We need to actually allocate space for the map. In Go, you can allocate space for a map two ways: first, using Go's built-in make() function; and second, by creating and populating the map in one line.

Using make() with maps

To allocate space for a map, you can use the make() function and pass it the map type. This will actually create space in memory for the map, and allow you to add items or look up keys in the map.

var m map[string]int
m = make(map[string]int)

Creating and populating

If you want to create and populate the map in one line, you can specify the type, then have trailing brackets containing the items you want to add:

// Create an empty map
m := map[string]int{}
// Create and populate a map
m := map[string]int{"A":10, "T":15, "G":20, "C":25}

Zero values

One feature of maps that makes them really easy to work with is, if you try and look up a key, and the key does not exist in the map, the map will not raise an exception (which Python does), it will return the zero value of the value type.

For example, the zero value of the int type is 0, so if we create a map like m := map[string]int{"A":10} and we then look up a key that isn't in the map, like m["G"], Go will return 0.

Similarly, the zero value for booleans is false, so you can utilize the zero value behavior to create a set data structure using maps. By creating a map[keyType]bool map, you can use the boolean value to indicate membership of a key in the given set. Then, if you look up keys that do not exist in the map, Go will return the zero value of booleans by default, which will be false.

Easy iterating over maps

It is easy to iterate over maps using the range keyword in Go, which will return the keys (optionally, both keys and values) of the map in a loop:

var m map[string]int{
        "ABC":10, 
        "DEF":20, 
        "GHI":30
}

for k,v := range m {
    fmt.Println("Key:",k," --> Value:",v)
}

Example: Assembling kmer histogram

Here is an example using maps: this function assembles a kmer histogram from a strand of DNA base pairs. To do this, it loops over every codon in the DNA strand and increments a counter in a map. Finally, this histogram map is returned. (This function is useful for determining the most frequent kmer.)

Here's the whole function that helps solve Rosalind problem BA1B. We'll look at it piece by piece below:

// Return the histogram of kmers of length k 
// found in the given input
func KmerHistogram(input string, k int) (map[string]int,error) {

    result := map[string]int{}

    if len(input)<1 {
        err := fmt.Sprintf("Error: input string was not DNA. Only characters ATCG are allowed, you had %s",input)
        return result, errors.New(err)
    }

    // Number of substring overlaps
    overlap := len(input) - k + 1

    // If overlap < 1, we are looking
    // for kmers longer than our input
    if overlap<1 {
        return result,nil
    }

    // Iterate over each position,
    // extract the string,
    // increment the count.
    for i:=0; i<overlap; i++ {
        // Get the kmer of interest
        substr := input[i:i+k]

        // If it doesn't exist, the value is 0
        result[substr] += 1
    }

    return result,nil
}

Function Walkthrough

The first thing you'll notice is the comment style: there is a comment right before each function, which is common practice in Go, because comments that come before a function are picked up by godoc (the Go documentation tool) and turned into documentation.

// Return the histogram of kmers of length k 
// found in the given input
func KmerHistogram(input string, k int) (map[string]int,error) {

Go has internalized the idea that comments are a part of the documentation, so comments don't need to be formatted in any special way (like /// this or /** this */ business) to end up being picked up by godoc.

Next, we create an empty map (kmer strings to integer frequency counters) and stride over the entire input string with a window the size of the kmers we are interested in, adding or incrementing each corresponding counter in the map as we go.

The overlap variable is the number of possible kmers of length k in the entire input string, which requires a bit of algebra to gt the indices right:

    // Number of substring overlaps
    overlap := len(input) - k + 1

The for loop utilizes Go's slice notation to take a slice of the string (which does not require creating or duplicating any string data), and uses the extracted kmer as a key to add or increment the counter:

    // Iterate over each position,
    // extract the string,
    // increment the count.
    for i:=0; i<overlap; i++ {
        // Get the kmer of interest
        substr := input[i:i+k]

        // If it doesn't exist, the value is 0
        result[substr] += 1
    }

This is where the behavior of maps for non-existent keys comes in handy - in Go, if you ask for a key that does not exist in the map, the map will return the zero value of the specified type.

This statement:

        result[substr] += 1

can also be written as:

        result[substr] = result[substr] + 1

So, the first time a kmer (stored in substr) is encountered, the kmer will not exist as a key in the map, and the key lookup on the right hand side will return 0, incrementing the counter to 1 the first time the kmer is encountered.

Subsequent times the kmer is encountered, the value will be found and substituted on the right side so it will be incremented by 1.

Finally, when we return from the function, we can use a Python-like syntax of returning multiple values separated by commas, yet another great feature of Go:

    return result,nil
}

By convention, the return types of functions will include an error type at the very end, so if you had a function named ExampleFunction that returned three integers, the function definition would look like this:

func ExampleFunction() (int, int, int, error) {
    ...    
}

Additionally, we also see from the function's return statement above that we can use the reserved keyword nil to set a variable to a null value, and that the convention is to return nil in place of the error, so that in the calling function we can set up a structure like this:

a, b, c, err := ExampleFunction()

if err != nil {
    err := "Please read this important message"
    return -1,-1,-1,errors.New(err)
}

Summary

Maps are my favorite data structure, so I'm glad that they're easy to use in Go. Some important points about using maps in Go:

  • Declaring a variable as a map type does not allocate any space for the map; saying var m map[keyType]valueType and then trying to access keys will cause an exception.

  • To allcoate space for a map, use make(map[keyType]valueType) or instantiate with the {} notation, like m := map[keyType]valueType{"asdf":1}

  • To ask for a key, use the square bracket notation. To set a value for a key, use the square bracket notation on the left and assign a value on the right: m[my_key] = my_value

  • Asking for missing keys will return the zero value of whatever type the map values are.

  • Iterating over key-value pairs in a map using a for loop is easy using the built-in range keyword.

Addendum: Check if a Key is in a Map

Because of the default behavior of maps, where they return a zero value for keys that do not exist in the map, it is not immediately obvious how to differentiate between the case where a key is in a map already and has a zero value, versus the case where the key does not yet exist in the map and the zero value is only being returned because the key can't be found.

To resolve this, we can use two Go features: error-checking, and the underscore - a variable that serves as a one-way sink for information, and serves a similar purpose to the underscore in Python.

First, we mentioned above that various operations that can return errors will return an error type as the last return type, along with any other return values. This includes the operation of accessing a key. To assign the value and the error to variables at once:

v, err := m[my_key]

Now, to check if the key exists in the map, we are only concerned with the variable err and we don't really need the variable v. Instead of assigning v to a variable that we never use, and then having the Go compiler complain about it, we can use the underscore as a placeholder:

_, err := m[my_key]

Now, we just add a check for whether err is nil, and voila, we have our check of whether a key is in the map or not:

_, err := m[my_key]

if err != nil {
    // This key is missing from the map

} else {
    // This key already exists in the map

}

Tags:    go    golang    rosalind    bioinformatics    maps   

Learning Bioinformatics with Go and Rosalind

Posted in Rosalind

permalink

Learning Go with Rosalind

What is Rosalind?

Rosalind.info is a website with programming challenges, similar in spirit to Project Euler, but with a focus on bioinformatics.

Problems in the bioinformatics track are presented grouped by chapter, with several problems per chapter. The problems are designed like a coding competition, with problems providing structured input files and expecting structured output from each calculation. Each time you solve a problem, a unique input is generated, and you have a time limit in which to run your code to solve the problem.

What is Go?

Go is a programming language that is static, compiled, and concurrent. It is essentially intended as a modern replacement to C and C++, and is designed for more modern hardware, networks, and scale of code projects.

Go is a language that was invented in Google. It provides some very powerful features, and its design for concurrency is the primary feature that motivated me to learn Go. See this Go blog post on "Concurrency is not parallelism" for a more in-depth discussion of Go's concurrency design, and how that is different from (and more general and more powerful than) parallelism.

Initial Impression of Go

My initial impression of Go has been positive thus far.

Syntax

The first thing about Go that I will remark on is the syntax - it is unusual because it reverses the order that most programming languages use for variable and type declarations.

For example, in Java, you declare an integer array like this:

public int[5] arr;

while in Go, you would declare an analogous data structure like this:

var arr [5]int

While this reversal is a bit confusing at first, it is different enough from other languages that it easily sticks in your head - which is actually a welcome change from the stew of slightly-different-but-largely-overlapping syntax occurring across common programming languages.

Go also shares many of the features that make Python such an easy language to use, including lack of semicolons (yay!) and a useful selection of built-in methods.

Godoc

One of the handiest features of Go is godoc, a command-line utility that runs a local web server at http:8080 that serves up Go documentation. This includes documentation for the standard Go library, as well as documentation for any libraries found in the Go PATH.

Environment Variables

Speaking of the Go PATH, one confusing thing about getting started with Go is all of the environment variables that must be set for Go to find various things.

Basically, you need to have two directories for Go: the location of your Go tree, and the directory where you store Go packages and executables.

  • $GOROOT refers to the location of your Go tree. For example, this might be the directory ~/go.

  • $GOPATH refers to the location where your Go packages and executables are stored. For example, this might be ~/work.

Both of these variables should be set in your ~/.profile.

See Go/Installing on the charlesreid1 wiki for coverage of getting set up with Go.

Arrays and Slices

The first concept in Go that threw me off was the concept of arrays versus slices. This was principally due to the poor explanation and choice of examples given on the Go blog post on how slices work.

Arrays are chunks of memory that are set aside to store contiguous slots for data. To create an array, its size must be known in advance, making them less useful in practice. To resize an array, a new array must be created, and the old array copied into the new array.

Slices, on the other hand, can be thought of as tiny data structures containing pointers to array data. Slices can easily be resized by changing the data structure, without changing the underlying data.

There are also built-in make() and copy() functions, to allocate slices that have a specified capacity (or none, in which case the slice has a variable size) and copy data from arrays to slices or from slice to slice.

The confusion around such a fundamental data type makes Go more difficult to learn and to intuit about what's going on. While Go is clearly superior to C and Java in many ways, it also has some unfortunate stumbling blocks in its most basic data structure that many early learners are sure to have trouble with.

Learning a New Language

One of the things that stood out to me while learning Go was how different it was from learning a non-language systems technology (library) like Docker or docker-compose, or Ansible.

When you are learning a second or third programming language, you generally have a mental roadmap of concepts, and how they fit together and build upon each other. Furthermore, you already know, before ever starting to learn the language, what that roadmap of concepts generally looks like, from your prior learnings.

(If you pick up and compare any dozen books on various programming languages, you'll find certain core concepts that are repeated from book to book.)

Compare this with a technology like Ansible, an extremely powerful Python library that is used for automation of IT and compute infrastructure. While extremely powerful, Ansible is also extremely complex, and feels bewildering to learn because it requires all users (including seasoned experts in Unix shell scripting), to re-learn an entirely new way of doing things using Ansible's system of modules and YAML syntax.

Ansible has no conceptual roadmap. The Ansible documentation moves somewhat haphazardly through the many topics that must be covered but that don't fit together. Books that cover Ansible often follow the documentation's organization, or have a similarly confused organization of concepts. There is no denying it's a great piece of software, but it is very difficult to reason about and teach compared to a computer language.

Core Concepts

The first few Rosalind problems from Chapter 1 were an excellent warm-up to get to know Go, since the tasks were relatively easy and the main bit to work out was how to use the data structure correctly, rather than the algorithm.

Three data structures that we ended up utilizing to solve the Chapter 1 challenges using Go were:

  • Hash maps
  • Strings
  • Bit vectors

More coverage of how these data structures work in Go, plus details of how we used them to solve Rosalind probelms, to follow in the next blog post.

Tags:    go    golang    rosalind    bioinformatics   

First Post of the Fall, Part 2: Flaskadillo

Posted in Python

permalink

Flask + ILLO = Flaskadillo

On October 15, 2018, I had the opportunity to offer an in-lab learning opportunity (ILLO) at the Lab for Data Intensive Biology. The ILLO focused on Flask, a useful Python library for creating and running web servers. This library is useful because it has a very low learning curve, but also has the complexity to handle complicated, real-world projects.

As a part of this in-lab learning opportunity, I created repository with five simple Flask examples to highlight five useful capabilities of Flask.

The repository is called flaskadillo and it is available on git.charlesreid1.com or on github.com.

The five capabilities covered by the examples in flaskadillo are listed below:

  1. hello - hello world flask server

  2. api - a simple API server

  3. jinja - a simple Flask server that makes use of Jinja templates

  4. package - a simple demonstration of how to package flask apps

  5. tests - a simple demonstration of how to write Flask tests

Example 1: Hello World

We'll just cover example 1 here, but similar materials are available for all five examples.

Example 1 consists of a simple flask app, simple.py:

from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
    return "Hello World!"

The hello directory of the flaskadillo repo covers how to install the necessary packages and run the Flask application.

There is also a unit test, test_simple.py, which demonstrates how to write tests for Flask applications. To run the unit test, run:

pytest

More Information

For instructions on each of the 5 examples, visit each of the 5 directories in the flaskadillo repository.

Why flaskadillo?

Because armadillo.

Why armadillo?

The word armadillo means "little armoured one" in Spanish.

Armadillos are related to anteaters and sloths (all are in the Xenartha superorder).

The Aztecs called them turtle-rabbits.

Tags:    Github    Software    Python    Flask   

May 2018

Current Projects