charlesreid1.com blog

Deconvoluting Convolutional Neural Networks

Posted in Machine Learning

permalink

Introduction: A Simple CNN Example

As part of our weekly Deep Learning for Genomics reading group here in the Lab for Data Intensive Biology (DIB Lab), we are applying convolutional neural networks (deep learning) to various problems in genomics and biology.

For the most recent meeting, we prepared some notes on how convolutional neural networks work. The notes are in the form of a Jupyter notebook. This blog post summarizes some of the important conclusions from the notebook and links to relevant sections in the notebook.

In the notebook covered in this blog post, we set up a simple convolutional neural network from an example on the keras blog. This example is used to classify input images as being either a cat or a dog.

All materials covered in this blog post are in the charlesreid1/deconvoluting-convolutions repository on Github.

Exploring the Data

TL;DR: When developing a deep learning model for a problem, it is important to start by exploring the data and understanding it thoroughly.

Link to "Image Data" section of notebook

Create CNN

TL;DR: Our convolutional neural network consists of the following architecture:

  • Convolutional Stage #1
    • Convolution (3 x 3 kernel, 32 filters)
    • Activation (ReLU)
    • Max Pooling (2x2)
  • Convolutional Stage #2
    • Convolution (3 x 3 kernel, 32 filters)
    • Activation (ReLU)
    • Max Pooling (2x2)
  • Convolutional Stage #3
    • Convolution (3 x 3 kernel, 64 filters)
    • Activation (ReLU)
    • Max Pooling (2x2)
  • Flatten
  • Dense (64 nodes)
  • Activation (ReLU)
  • Dropout (0.5)
  • Dense (1 node)
  • Activation (ReLU)

Link to "Create Convolutional Neural Network" section of notebook

Analyzing Network Architecture and Tensor Shapes

TL;DR: Each step of the neural network transforms an input tensor of a given shape into an output tensor of a (potentially different) shape.

In this section of the notebook, we step through each of the neural network's layers to explain how the size of each layer's inputs and outputs are determined.

Link to "Network Architecture/Shapes" section of notebook

Input Image Layer

TL;DR: The size of the cat and dog images is 150 x 150 pixels. Each image is a color image, so it consists of 3 channels. Therefore, the input to the very first layer has a shape of

$$ (\mbox{None}, w_0, h_0, c_0) = (\mbox{None}, 150, 150, 3) $$

(where "None" indicates a variable-size dimension that is equal to the number of total input images, or alternatively, the number of images per batch, if we are using batch learning).

Link to "Input Image Layer" section of notebook

First Convolution Layer

TL;DR: A convolutional layer with a kernel size of \(k_1 \times k_1\) and a number of filters \(c_1\) will transform the shape of the input image to:

$$ (\mbox{None}, w_1, h_1, c_1) = (\mbox{None}, 148, 148, 32) $$

where

$$ w_1 = w_0 - k_1 + 1 \\ h_1 = h_0 - k_1 + 1 $$

Importantly, each of the three input channels are added together to determine their contribution to the final convolution filters - the number of input channels does not affect the number of output channels.

The total number of output channels is equal to the number of filters in the convolution layer.

Link to "First Convolutional Layer" section of notebook

First Activation Layer

TL;DR: The activation layer is a straightforward one-to-one mapping - each individual value from the output of the convolution layer is fed through the rectified linear unit (ReLU) function and the resulting output value becomes the input to the next layer. The ReLU function is given by:

$$ \mbox{ReLU}(x) = \max(0,x) $$

The activation layer does not change the shape of the input tensor.

Link to "First Activation Layer" section of notebook

First MaxPooling Layer

TL;DR: The max pooling layer is a way of making the final convolutional filters (the "feature-detectors" of the convolutional neural network) less sensitive to the exact placement of features. The pooling layer only affects the size of the filter, not the number of channels.

If we use a max pooling window of \(p_1 \times p_1\), we will reduce the image size by \(\mbox{ceil}(w_1/p_1)\) and \(\mbox{ceil}(h_1/p_1)\). This reduces the input tensor shape to:

$$ (\mbox{None}, \mbox{ceil}(w_1/p_1), \mbox{ceil}(h_1/p_1), c_1) = (\mbox{None}, 74, 74, 32) $$

Link to "First Max Pooling Layer" section of notebook

Second Convolution Layer

TL;DR: The second convolutional layer has a kernel size of \(k_2 \times k_2\) and a number of filters \(c_2\), which will transform the shape of the input image in the same way as described for the first convolutional layer.

Note that just as the number of channels (3) in each input to the first convolutional layer did not affect the final number of channels in the output of the convolutional layer (number of channels was fixed by specifying number of output filters for the convolutional layer), so the number of input channels to the second convolutional layer does not affect the number of output channels from the second convolutional layer.

The final shape coming out of the second convolutional layer is:

$$ (\mbox{None}, w_2, h_2, c_2) = (\mbox{None}, 72, 72, 32) $$

where

$$ w_2 = w_1 - k_2 + 1 \\ h_2 = h_1 - k_2 + 1 \\ $$

Link to "Second Convolutional Layer" section of notebook

Second Activation Layer

TL;DR: The activation layer again uses a function to map input values to output values in a one-to-one mapping, so the activation layer does not change the shape of the input tensor.

Link to "Second Activation Layer" section of notebook

Second MaxPooling Layer

TL;DR: The second max pooling layer uses a pooling window of size \(p_2 \times p_2\). This will reduce the input size to \(\mbox{ceil}(w_2/p_2) \times \mbox{ceil}(h_2/p_2)\). This reduces the input tensor shape to:

$$ (\mbox{None}, \mbox{ceil}(w_2/p), \mbox{ceil}(h_2/p), c_2) = (\mbox{None}, 36, 36, 32) $$

Link to "Second Max Pooling Layer" section of notebook

Third Convolution Layer

TL;DR: The third convolution layer with a kernel size of \(k_3 \times k_3\) and \(c_3\) output filters will transform the input tensor shape in the following way (note that the third convolutional layer has 64 filters, not 32):

$$ (\mbox{None}, w_3, h_3, c_3) = (\mbox{None}, 34, 34, 64) $$

where

$$ w_3 = w_2 - k_3 + 1 \\ h_3 = h_2 - k_3 + 1 $$

Link to "Third Convolutional Layer" section of notebook

Third Activation Layer

TL;DR: The activation layer again uses a function to map input values to output values in a one-to-one mapping, so the activation layer does not change the shape of the input tensor.

Link to "Third Activation Layer" section of notebook

Third MaxPooling Layer

TL;DR: The thid max pooling layer uses a pooling window of size \(p_3 \times p_3\). This will reduce the input size to \(\mbox{ceil}(w_3/p_3) \times \mbox{ceil}(h_3/p_3)\). This reduces the input tensor shape to:

$$ (\mbox{None}, \mbox{ceil}(w_3/p_3), \mbox{ceil}(h_3/p_3), c_2) = (\mbox{None}, 17, 17, 64) $$

Link to "Third Max Pooling Layer" section of notebook

Flatten and Dense Layers

TL;DR: The flatten layer converts a tensor of dimension \((\mbox{None}, 17, 17, 64)\) into a 1D vector of \(17 \times 17 \times 64 = 18,496\) neural network nodes. This does not change any of the values, it simply reshapes the input tensor.

The first dense layer reduces the flattened \(18,496\) nodes to \(64\) nodes, using a fully connected layer of nodes. These values are then passed through an activation function (as with the above activation layers, this is a one-to-one mapping and does not change the shape of the input tensor). The dense layer is followed by a dropout layer to help prevent overfitting; this pattern is common in convolutional neural networks.

The second dense layer further reduces the \(64\) nodes to a single node, whose output will determine whether the input image is a cat or a dog.

Link to "Flatten Layer" section of notebook

Link to "Dense (64) Layers" section of notebook

Link to "Dense (1) Layers" section of notebook

Categorical Output

TL;DR: Normally when classifying cats and dogs, we would have two output neurons, one to output a binary yes/no to answer "is this a cat?" and another output a binary yes/no to answer "is this a dog?". However, in this example, we assume that all inputs contain either only cats or only dogs, so the single-output binary classifier is determining whether an image is a dog (0) or a cat (1).

Image Transformer

TL;DR: The ImageDataGenerator class is a class provided by keras for loading image data from a directory and (optionally) applying various transformations to the images in order to generate additional training data from a set of images. For example, the following code block from the notebook creates an ImageDataGenerator class that will load images from a folder on disk, and applies various transformations (shearing, zooming, and horizontally flipping) to each image during the training process.

train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

This can then be used to generate test image data:

train_generator = train_datagen.flow_from_directory(
    'train',
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='binary')

This will look for images in the relative path train/data/ (note the implicit data/ directory tacked on the end). Note that this image data generator allows us to use images that do not have size \(150 \times 150\), as they will be re-sized to target_size.

Link to "Image Transformer" section of notebook

Next Steps

Now that we have walked through a sample convolutional neural network and covered how each layer transforms the size of the input tensor, we are ready to start applying convolutional neural networks to real problems.

Our next blog post will cover the materials in the charlesreid1/deep-learning-genomics repository on Github, which applies the convolutional neural network concept in a 1D context (applying convolutions to 1D sequences, instead of 2D images) to learn about (and predict) DNA transcription factor binding sites.

Graphs for Bioinformatics, Part 2: Finding Eulerian Paths

Posted in Computational Biology

permalink

The Context: de Bruijn Graphs

In Part 1 of this post we discussed a data structure called a de Bruijn graph and covered its application to genome assembly. To summarize, a de Bruijn graph is a type of graph that represents a set of k-mers as a set of directed edges on a graph, connecting the k-mer's (k-1)-mer prefix (the source vertex) to the k-mer's (k-1)-mer suffix (the destination vertex).

As an example, if \(k = 5\), we can represent the k-mer "AAGCT" as an edge connecting the vertex AAGC to the vertex AGCT.

The de Bruijn graph is used to solve a set of problems on Rosalind.info, a website with bioinformatics programming challenges, as part of working through the textbook Bioinformatics Algorithms: An Active Learning Approach and its associated website (Rosalind.info).

Assembling the de Bruijn Graph

The problems from Rosalind.info that require the use of a de Bruijn graph come from Chapter 3. These problems generally give the user either a list of k-mers (to assemble into a de Bruijn graph, as in problem BA3E) or a long sequence of DNA (which can be turned into a list of k-mers and assembled into a de Bruijn graph, as in problem BA3D).

If we are starting with a long string of DNA, we can run through the entire string and extract k-mers using a sliding window. For a string of DNA of length \(d\), this procedure will create \(d - k + 1\) k-mers.

Directed Graph Representation: Adjacency List

The de Bruijn graph is a directed graph. To represent this graph in memory, we utilize an adjacency list data structure. An adjacency list is a key-value lookup table (implemented using a hash map) wherein each source vertex in the graph is a key in the lookup table, and the corresponding value is a list of all destination vertices (all vertices that have a directed edge starting from the source vertex and ending at that vertex).

A Python dictionary can be used to implement the adjacency list hash table. The dictionary keys are the source vertices (or rather, their string labels), and the dictionary values are a list of destination vertices (a list of their string labels).

Thus, the graph AA -> BB -> CC -> DD would be represented with the hash table:

adjacency_list['AA'] = ['BB']
adjacency_list['BB'] = ['CC']
adjacency_list['CC'] = ['DD']

(Notice from this example that the keys of the adjacency list gives a list of source vertices only, to get all vertices we need to look at the values of the adjacency list too.)

A Quick Example

As a simple example, consider the de Bruijn graph formed from the DNA string AAGATTCTCTAC and \(k = 4\).

This is first turned into a bag of \(d - k + 1 = 9\) 4-mers (our edges):

Sequence:   AAGATTCTCTAC
4-mers:     AAGA
             AGAT
              GATT
               ATTC
                TTCT
                 TCTC
                  CTCT
                   TCTA
                    CTAC

Next, we also create a bag of \(d - k + 1 = 10\) 3-mers (vertices):

Sequence:   AAGATTCTCTAC
3-mers:     AAG
             AGA
              GAT
               ATT
                TTC
                 TCT
                  CTC
                   TCT
                    CTA
                     TAC

Now we can iterate over every 4-mer edge, find its prefix 3-mer and suffix 3-mer, and create a corresponding entry in the adjacency list hash table.

The list of edges looks like this:

AAG -> AGA
AGA -> GAT
ATT -> TTC
CTA -> TAC
CTC -> TCT
GAT -> ATT
TCT -> CTA,CTC
TTC -> TCT

The corresponding dictionary should look like this:

adjacency_list['AAG'] = ['AGA']
adjacency_list['AGA'] = ['GAT']
adjacency_list['ATT'] = ['TTC']
adjacency_list['CTA'] = ['TAC']
adjacency_list['CTC'] = ['TCT']
adjacency_list['GAT'] = ['ATT']
adjacency_list['TCT'] = ['CTA','CTC']
adjacency_list['TTC'] = ['TCT']

Python vs Go

Now that we're ready to implement a directed graph object and populate it using the data given in the problem, we have to make the difficult choice of what language we want to use to implement the directed graph.

We have covered our use of the Go programming language for Rosalind.info problems before (we have previously covered recursion for Chapter 2 problems in Part 1, Part 2, and Part 3 of another post, and we also wrote this post on our impression of Go and its usefulness in bioinformatics.

We are also implementing all of the solutions to the Rosalind.info problems in our Go libarary, go-rosalind (see corresponding documentation on godoc.org).

However, we have learned the hard way that Go requires a lot of boilerplate code (boilerplate code that is necessary, mind you, because all of that boilerplate will eventually morph into something problem-specific).

This all means that Go is a very cumbersome language to use to get an algorithm prototype up and running.

Python, on the other hand, is a very easy language for prototyping and has plenty of handy built-in functions and modules that make prototyping an algorithm far easier and faster than doing it in Go.

Our strategy, therefore, is to prototype our algorithm and corresponding graph object in Python, get the algorithm working and tested, then convert the code to Go when we are finished.

Directed Graph Class: Python Implementation

Note that while we could simply use the dictionary object itself as the graph data structure, this is somewhat inelegant, and we would like instead to define a class to bundle related behavior and data together.

We implement the directed graph by defining an AdjacencyGraph class. This is just a glorified wrapper around the ajacency list dictionary, with some extra methods.

We start by defining the class (it inherits from object so it has no parent type):

class AdjacencyGraph(object):
    """Directed graph stored using an adjacency list"""
    def __init__(self):
        """Constructor"""
        self.adj = {}
        self.dfs_started = False

The constructor just initializes an empty adjacency list dictionary.

We also define two built-in methods for convenience: __str__ for the string representation of the graph (so we can pass the graph object to print()), and __len__ for getting the number of (source) vertices on the graph.

    def __str__(self):
        """String representation"""
        s = []
        for source in self.adj.keys():
            sinks = self.adj[source]
            for sink in sinks:
                m = "%s -> %s\n"%(source,sink)
                s.append(m)
        return "".join(s)

    def __len__(self):
        """Number of vertices on graph"""
        s = set()
        for source in self.adj.keys():
            s.add(source)
            for sink in self.adj[k]:
                s.add(sink)
        return len(s)

Next, we define some basic functionality useful for all graphs:

  • Getting the in-degree and out-degree of a vertex
    def in_degree(self,u):
        n = 0
        for v in self.adj.keys():
            sinks = self.adj[v]
            if u in sinks:
                n += 1
        return n

    def out_degree(self,u):
        if u in self.adj.keys():
            return len(self.adj[u])
        else:
            return 0

We also define a generator for creating vertices:

    def vertices(self):
        vertices = set()
        for k in self.adj.keys():
            vertices.add(k)
            for m in self.adj[k]:
                vertices.add(m)
        for v in vertices:
            yield v

    def n_vertices(self):
        return len(self)

    def n_edges(self):
        n = 0
        for source in self.adj.keys():
            try:
                n += len(self.adj[source])
            except:
                # in case value is None
                pass
        return n

    def get_neighbors(self,u):
        """Get all neighbors of node u"""
        # Note: neighbors are stored in
        # sorted order
        if u in self.adj.keys():
            return self.adj[u]
        else:
            return []

Finally, we add a method add_edge() that allows us to create an edge from vertex u to vertex v (and add the vertices to the graph if either do not yet exist on the graph).

For convenience, we maintain the adjacency list values (the list of destination vertices) in lexicographic order.

    def add_edge(self, u, v):
        """Add an edge from u to v"""
        # For each source vertex:
        if u in self.adj.keys():

            # Get existing sink list
            t = self.adj[u]

            # Append to it
            t.append(v)

            # Keep list of sinks sorted
            # (lexicographic string sorting)
            t.sort()

            # Create the new edge 
            # from source to sink
            self.adj[u] = t

        else:
            # Initialize the list of sinks (v)
            # for the given source (u)
            self.adj[u] = [v]

Now, to assemble the de Bruijn graph, we can iterate over every k-mer edge, form the prefix and suffix vertices, and call the add_edge() function on the graph.

Checking for Eulerian Paths and Cycles

To recap Eulerian paths versus Eulerian cycles (discussed in Part 1 of this post:

  • An Eulerian path is a path that visits every edge of a given graph exactly once.
  • An Eulerian cycle is an Eulerian path that begins and ends at the ''same vertex''.

According to Steven Skienna's Algorithm Design Handbook, there are two conditions that must be met for an Eulerian path or cycle to exist. These conditions are different for undirected graphs versus directed graphs.

Undirected graphs:

  • An undirected graph contains an Euler cycle iff (1) it is connected, and (2) each vertex is of even degree.

  • An undirected graph contains an Euler path iff (1) it is connected, and all but two vertices are of even degree. These two vertices will be the start and end vertices for the Eulerian path.

Directed graphs:

  • A directed graph contains an Euler cycle iff (1) it is strongly-connected, and (2) each vertex has the same in-degree as out-degree

  • A directed graph contains an Euler path iff (1) it is connected, and (2) all vertices except two (x,y) have the same in-degree as out-degree, and (x,y) are vertices with in-degree one less than and one more than out-degree

Algorithm Building Blocks

Algorithm to find Eulerian paths/cycles consists of several steps using several algorithms.

Undirected graphs are the simpler case; directed graphs are more complicated.

To perform a DFS on a directed graph, implement two functions:

  1. Write a DFS function that takes a graph as an input argument and that visits each node of the graph in a depth-first search.

  2. Write a visitation function that takes a node as an input argument and that performs some action on the node. This visitation function is called by the DFS function on each node that it visits.

Kosaraju's Algorithm: Connected Components

On an undirected graph, can use Fleury's Algorithm to follow edges (classify edges as bridge or non-bridge, then leave bridges for last).

On a directed graph, we have twice the amount of work: we are not just checking that all vertices are reachable from a given vertex, we are also checking that all vertices can also reach that vertex.

To Be Continued...

In the next part of this post, we will start with the slightly simpler case of finding an Euler cycle (which has no start or end vertices). Then we will show how finding the Euler path is actually a special case of finding the Euler cycle.

First, we will use Hierholzer's Algorithm to find Euler cycles (this is the simpler case). Order does not matter because it is a cycle; Hierholzer's algorithm is used to find the Euler cycle.

Next, we will modify the above algorithm to find Euler paths. This requires keeping track of the start and end candidate nodes. We verify only one each; we complete the cycle by adding an edge. Once we find the cycle, we remove the edge. Finally, we rearrange the cycle to have the correct start and end nodes.

Stay tuned for Part 3...

Tags:    go    golang    rosalind    computational biology    bioinformatics    euler    recursion    backtracking    graphs    algorithms    hamiltonian    eulerian   

Graphs for Bioinformatics, Part 1: de Bruijn Graphs, Hamiltonian Paths, and Eulerian Paths

Posted in Computational Biology

permalink

The Context: Rosalind.info

To provide a bit of context for a discussion of Euler paths and Euler cycles: starting around December, a group of us in the Lab for Data Intensive Biology (DIB Lab) started working through the textbook Bioinformatics Algorithms: An Active Learning Approach and the associated website, Rosalind.info.

Rosalind.info is a site that is similar in style to Project Euler, a familiar topic on this blog. Project Euler poses computationally challenging problems in the domain of mathematics.

Like Project Euler, the visitor is given one small example input and the corresponding correct output, and one large example input and corresponding output. Also like Project Euler, the problems vary in how much computer science versus domain expertise is needed, but they are largely focused on writing algorithms rather than on the science behind the computations.

Unlike Project Euler, however, Rosalind.info does give plenty of hints (via the textbook, if you have a copy), and sometimes even gives pseudocode for the algorithm. The book is required to get enough context to answer some of the Rosalind.info problems.

Graphs for Bioinformatics

The textbook focuses on different problems in each chapter. For example, Chapter 1 uses the example of a string of DNA that marks where replication begins to introduce some basic bioinformatics concepts and algorithms. Chapter 2 uses the concept of molecular clocks to introduce motifs and motif-finding, the focus of most of the problems in Chapter 2.

Chapter 3 focuses on the problem of genome assembly - how we assemble an entire genome from short segments alone. In particular, the chapter focuses on de Bruijn graphs, which are graphs that, given a sequence of symbols drawn from an alphabet, are composed of edges (one for each k-mer, that is, a chunk of the sequence of length k), and vertices (one for each k-mer prefix and k-mer suffix, connected by a directed edge of the k-mer). We will cover more of the details of these graphs shortly.

Building a K-mer Graph (The Wrong Graph)

The Bioinformatics Algorithm book starts with a general discussion of how to represent a sequence of DNA nucleotides using a graph. The idea they discuss initially (which is an obvious, but not necessarily good, one) is splitting the sequence into k-mer chunks, like so:

      Sequence:   TAATGCCATGGGATGTT
      Pieces:     TAA 
                   AAT
                    ATG 
                     TGC
                      GCC 
                       CCA
                        CAT 
                         ATG
                          TGG 
                           GGG
                            GGA 
                             GAT
                              ATG 
                               TGT
                                GTT

and letting one k-mer be represented by one vertex. Then the sequence above could be turned into the graph:

TAA -> AAT -> ATG -> TGC -> GCC -> CCA -> CAT -> ATG -> TGG -> GGG -> GGA -> GAT -> ATG -> TGT -> GTT

On this graph, every edge has the property that the first (k-1) nucleotides of the destination match the last (k-1) nucleotides of the source.

If we did not know this sequence in advance, we could draw every edge with that property - every time the last (k-1) characters of a k-mer match the first (k-1) characters of another k-mer, an edge is drawn between those two vertices.

That graph would result in many more edges than the graph shown above.

Furthermore, in theory, if each read sequence came from a single genome and we had the entire genome covered by read sequences, a path through the graph that visits every vertex (every k-mer) would yield the full genome.

A path through a graph that visits every vertex once is called a Hamiltonian path.

Why is this hard? Because the problem of proving a Hamiltonian path exists, let alone finding it, becomes very difficult for large graphs.

Building a De Bruijn Graph (The Right Graph)

Nicolaas de Bruijn introduced (in 1946, in a paper entitled simply "A combinatorial problem") a new way of representing a sequence with a graph. He split a given sequence into k-mers, as before, but instead of representing each k-mer as a vertex on the graph, he represented each k-mer as an edge on the graph.

This type of graph is called a de Bruijn graph.

Specifically, for a DNA sequence, each k-mer from the sequence is represented by an edge, where the source vertex is that k-mer's (k-1)-nucleotide suffix and the destination vertex is that k-mer's (k-1)-nucleotide prefix.

      Sequence:   TAATGCCATGGGATGTT
      Pieces:     TA  
                   AA
                    AT 
                     TG
                      GC 
                       CC
                        CA 
                         AT
                          TG 
                           GG
                            GG 
                             GA 
                              AT  
                               TG 
                                GT 
                                 TT

Now this sequence is written as the graph:

TA -> AA -> AT -> TG -> GC -> CC -> CA -> AT -> TG -> GG -> GG -> GA -> AT -> TG -> GT -> TT

so that the original breakup of the sequence into k-mers is still represented, but now as edges rather than as vertices. That is, the k-mer TAA is represented by the edge TA -> AA.

Transform the Problem: Hamiltonian Paths to Eulerian Paths

The change in the problem representation (k-mers as vertices to k-mers as edges) changes the problem of finding the Hamiltonian path (a path through the graph that visits every vertex exactly once) into the problem of finding the Eulerian path (a path through the graph that visits every edge exactly once).

An Example

Let's look at a slightly simpler example - the one de Bruijn was originally considering - so we can see de Bruijn graphs in action in a slightly simpler case.

In his 1946 paper "A combinatorial problem", de Bruijn describes the problem thus:

Some years ago Ir. K. Posthumus stated an interesting conjecture concerning certain cycles of digits 0 or 1, which we shall call \(P_n\) cycles. For \(n = 1, 2, 3, \dots\), a \(P_n\) cycle be an ordered cycle of \(2^n\) digits 0 or 1, such that the \(2^n\) possible ordered sets of \(n\) consecutive digits of that cycle are all different. As a consequence, any ordered set of \(n\) digits 0 or 1 occurs exactly once in that cycle.

For example, a \(P_3\) cycle is \(00010111\), respectively showing the triples 000, 001, 010, 011, 111, 100, 100, which are all the possible triples indeed.

In this case, de Bruijn is discussing complete de Bruijn graphs - he constructs a de Bruijn graph of all possible 3-mers (our k-mers, \(k = 3\)), and constructs a path through the graph that visits every edge of the graph. Here is the sequence broken down as above:

      Sequence:   00010111
      Pieces:     00  
                   00
                    01
                     10
                      01
                       11
                        11

The alphabet here is binary: 0 and 1.

This (seemingly simple) example is a bit confusing, but here's what's going on: we have four vertices on the de Bruijn graph, consisting of the 2-mers:

00
01
10
11

Now, if we draw an edge for every possible 3-mer, we would start with the 3-mer 000, which is actually represented by a self-edge from vertex 00 to vertex 00, because the prefix matches the suffix.

Similarly, the 3-mer 111 is represented by a self-edge from vertex 11 to vertex 11.

The other 3-mers are represented by their corresponding edges: 001 is represented by the edge 00 -> 01, 010 by the edge 01 -> 10, etc.

By drawing every possible edge (to represent every possible 3-mer), we assemble the complete de Bruijn graph (that is, the de Bruijn graph containing vertices for all possible 2-mers connected by edges representing every possible 3-mer in the given alphabet).

The sequence de Bruijn gives in his paper is an Euler path through the complete (de Bruijn) graph (that is, a path through the de Bruijn graph that visits every edge exactly once):

Sequence: 00010111

00 -> 00 -> 01 -> 10 -> 01 -> 11 -> 11

Back to DNA

Now the utility of the de Bruijn methodology is more clear: if we can come up with fast, efficient algorithms to find Euler paths on large graphs, we can transform the assembly problem (given fragments of a long sequence, reconstruct the sequence) into the problem of finding an Eulerian path, which is tractable even for large graphs.

Compare this with string matching algorithms utilizing dynamic programming, which can cost \(O(N^2)\) and make genome assembly computationally infeasible.

Part 1 (this post) has covered the basic idea behind assembling DNA sequences using de Bruijn graphs. In Part 2 of this post, we will move on to a discussion of the "real world" problem, warts and all, and how we can relax some of the strict assumptions (like assuming perfect coverage of the original sequence and assuming that all reads are perfect).

Tags:    go    golang    rosalind    computational biology    bioinformatics    euler    recursion    backtracking    graphs    algorithms    hamiltonian    eulerian   

March 2022

How to Read Ulysses

July 2020

Applied Gitflow

September 2019

Mocking AWS in Unit Tests

May 2018

Current Projects