Deconvoluting Convolutional Neural Networks

Posted in Machine Learning

Introduction: A Simple CNN Example

As part of our weekly Deep Learning for Genomics reading group here in the Lab for Data Intensive Biology (DIB Lab), we are applying convolutional neural networks (deep learning) to various problems in genomics and biology.

For the most recent meeting, we prepared some notes on how convolutional neural networks work. The notes are in the form of a Jupyter notebook. This blog post summarizes some of the important conclusions from the notebook and links to relevant sections in the notebook.

In the notebook covered in this blog post, we set up a simple convolutional neural network from an example on the keras blog. This example is used to classify input images as being either a cat or a dog.

All materials covered in this blog post are in the charlesreid1/deconvoluting-convolutions repository on Github.

Exploring the Data

TL;DR: When developing a deep learning model for a problem, it is important to start by exploring the data and understanding it thoroughly.

Link to "Image Data" section of notebook

Create CNN

TL;DR: Our convolutional neural network consists of the following architecture:

• Convolutional Stage #1
• Convolution (3 x 3 kernel, 32 filters)
• Activation (ReLU)
• Max Pooling (2x2)
• Convolutional Stage #2
• Convolution (3 x 3 kernel, 32 filters)
• Activation (ReLU)
• Max Pooling (2x2)
• Convolutional Stage #3
• Convolution (3 x 3 kernel, 64 filters)
• Activation (ReLU)
• Max Pooling (2x2)
• Flatten
• Dense (64 nodes)
• Activation (ReLU)
• Dropout (0.5)
• Dense (1 node)
• Activation (ReLU)

Link to "Create Convolutional Neural Network" section of notebook

Analyzing Network Architecture and Tensor Shapes

TL;DR: Each step of the neural network transforms an input tensor of a given shape into an output tensor of a (potentially different) shape.

In this section of the notebook, we step through each of the neural network's layers to explain how the size of each layer's inputs and outputs are determined.

Link to "Network Architecture/Shapes" section of notebook

Input Image Layer

TL;DR: The size of the cat and dog images is 150 x 150 pixels. Each image is a color image, so it consists of 3 channels. Therefore, the input to the very first layer has a shape of

$$(\mbox{None}, w_0, h_0, c_0) = (\mbox{None}, 150, 150, 3)$$

(where "None" indicates a variable-size dimension that is equal to the number of total input images, or alternatively, the number of images per batch, if we are using batch learning).

Link to "Input Image Layer" section of notebook

First Convolution Layer

TL;DR: A convolutional layer with a kernel size of $$k_1 \times k_1$$ and a number of filters $$c_1$$ will transform the shape of the input image to:

$$(\mbox{None}, w_1, h_1, c_1) = (\mbox{None}, 148, 148, 32)$$

where

$$w_1 = w_0 - k_1 + 1 \\ h_1 = h_0 - k_1 + 1$$

Importantly, each of the three input channels are added together to determine their contribution to the final convolution filters - the number of input channels does not affect the number of output channels.

The total number of output channels is equal to the number of filters in the convolution layer.

Link to "First Convolutional Layer" section of notebook

First Activation Layer

TL;DR: The activation layer is a straightforward one-to-one mapping - each individual value from the output of the convolution layer is fed through the rectified linear unit (ReLU) function and the resulting output value becomes the input to the next layer. The ReLU function is given by:

$$\mbox{ReLU}(x) = \max(0,x)$$

The activation layer does not change the shape of the input tensor.

Link to "First Activation Layer" section of notebook

First MaxPooling Layer

TL;DR: The max pooling layer is a way of making the final convolutional filters (the "feature-detectors" of the convolutional neural network) less sensitive to the exact placement of features. The pooling layer only affects the size of the filter, not the number of channels.

If we use a max pooling window of $$p_1 \times p_1$$, we will reduce the image size by $$\mbox{ceil}(w_1/p_1)$$ and $$\mbox{ceil}(h_1/p_1)$$. This reduces the input tensor shape to:

$$(\mbox{None}, \mbox{ceil}(w_1/p_1), \mbox{ceil}(h_1/p_1), c_1) = (\mbox{None}, 74, 74, 32)$$

Link to "First Max Pooling Layer" section of notebook

Second Convolution Layer

TL;DR: The second convolutional layer has a kernel size of $$k_2 \times k_2$$ and a number of filters $$c_2$$, which will transform the shape of the input image in the same way as described for the first convolutional layer.

Note that just as the number of channels (3) in each input to the first convolutional layer did not affect the final number of channels in the output of the convolutional layer (number of channels was fixed by specifying number of output filters for the convolutional layer), so the number of input channels to the second convolutional layer does not affect the number of output channels from the second convolutional layer.

The final shape coming out of the second convolutional layer is:

$$(\mbox{None}, w_2, h_2, c_2) = (\mbox{None}, 72, 72, 32)$$

where

$$w_2 = w_1 - k_2 + 1 \\ h_2 = h_1 - k_2 + 1 \\$$

Link to "Second Convolutional Layer" section of notebook

Second Activation Layer

TL;DR: The activation layer again uses a function to map input values to output values in a one-to-one mapping, so the activation layer does not change the shape of the input tensor.

Link to "Second Activation Layer" section of notebook

Second MaxPooling Layer

TL;DR: The second max pooling layer uses a pooling window of size $$p_2 \times p_2$$. This will reduce the input size to $$\mbox{ceil}(w_2/p_2) \times \mbox{ceil}(h_2/p_2)$$. This reduces the input tensor shape to:

$$(\mbox{None}, \mbox{ceil}(w_2/p), \mbox{ceil}(h_2/p), c_2) = (\mbox{None}, 36, 36, 32)$$

Link to "Second Max Pooling Layer" section of notebook

Third Convolution Layer

TL;DR: The third convolution layer with a kernel size of $$k_3 \times k_3$$ and $$c_3$$ output filters will transform the input tensor shape in the following way (note that the third convolutional layer has 64 filters, not 32):

$$(\mbox{None}, w_3, h_3, c_3) = (\mbox{None}, 34, 34, 64)$$

where

$$w_3 = w_2 - k_3 + 1 \\ h_3 = h_2 - k_3 + 1$$

Link to "Third Convolutional Layer" section of notebook

Third Activation Layer

TL;DR: The activation layer again uses a function to map input values to output values in a one-to-one mapping, so the activation layer does not change the shape of the input tensor.

Link to "Third Activation Layer" section of notebook

Third MaxPooling Layer

TL;DR: The thid max pooling layer uses a pooling window of size $$p_3 \times p_3$$. This will reduce the input size to $$\mbox{ceil}(w_3/p_3) \times \mbox{ceil}(h_3/p_3)$$. This reduces the input tensor shape to:

$$(\mbox{None}, \mbox{ceil}(w_3/p_3), \mbox{ceil}(h_3/p_3), c_2) = (\mbox{None}, 17, 17, 64)$$

Link to "Third Max Pooling Layer" section of notebook

Flatten and Dense Layers

TL;DR: The flatten layer converts a tensor of dimension $$(\mbox{None}, 17, 17, 64)$$ into a 1D vector of $$17 \times 17 \times 64 = 18,496$$ neural network nodes. This does not change any of the values, it simply reshapes the input tensor.

The first dense layer reduces the flattened $$18,496$$ nodes to $$64$$ nodes, using a fully connected layer of nodes. These values are then passed through an activation function (as with the above activation layers, this is a one-to-one mapping and does not change the shape of the input tensor). The dense layer is followed by a dropout layer to help prevent overfitting; this pattern is common in convolutional neural networks.

The second dense layer further reduces the $$64$$ nodes to a single node, whose output will determine whether the input image is a cat or a dog.

Link to "Flatten Layer" section of notebook

Link to "Dense (64) Layers" section of notebook

Link to "Dense (1) Layers" section of notebook

Categorical Output

TL;DR: Normally when classifying cats and dogs, we would have two output neurons, one to output a binary yes/no to answer "is this a cat?" and another output a binary yes/no to answer "is this a dog?". However, in this example, we assume that all inputs contain either only cats or only dogs, so the single-output binary classifier is determining whether an image is a dog (0) or a cat (1).

Image Transformer

TL;DR: The ImageDataGenerator class is a class provided by keras for loading image data from a directory and (optionally) applying various transformations to the images in order to generate additional training data from a set of images. For example, the following code block from the notebook creates an ImageDataGenerator class that will load images from a folder on disk, and applies various transformations (shearing, zooming, and horizontally flipping) to each image during the training process.

train_datagen = ImageDataGenerator(
rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True)


This can then be used to generate test image data:

train_generator = train_datagen.flow_from_directory(
'train',
target_size=(img_width, img_height),
batch_size=batch_size,
class_mode='binary')


This will look for images in the relative path train/data/ (note the implicit data/ directory tacked on the end). Note that this image data generator allows us to use images that do not have size $$150 \times 150$$, as they will be re-sized to target_size.

Link to "Image Transformer" section of notebook

Next Steps

Now that we have walked through a sample convolutional neural network and covered how each layer transforms the size of the input tensor, we are ready to start applying convolutional neural networks to real problems.

Our next blog post will cover the materials in the charlesreid1/deep-learning-genomics repository on Github, which applies the convolutional neural network concept in a 1D context (applying convolutions to 1D sequences, instead of 2D images) to learn about (and predict) DNA transcription factor binding sites.

Graphs for Bioinformatics, Part 1: de Bruijn Graphs, Hamiltonian Paths, and Eulerian Paths

Posted in Computational Biology

The Context: Rosalind.info

To provide a bit of context for a discussion of Euler paths and Euler cycles: starting around December, a group of us in the Lab for Data Intensive Biology (DIB Lab) started working through the textbook Bioinformatics Algorithms: An Active Learning Approach and the associated website, Rosalind.info.

Rosalind.info is a site that is similar in style to Project Euler, a familiar topic on this blog. Project Euler poses computationally challenging problems in the domain of mathematics.

Like Project Euler, the visitor is given one small example input and the corresponding correct output, and one large example input and corresponding output. Also like Project Euler, the problems vary in how much computer science versus domain expertise is needed, but they are largely focused on writing algorithms rather than on the science behind the computations.

Unlike Project Euler, however, Rosalind.info does give plenty of hints (via the textbook, if you have a copy), and sometimes even gives pseudocode for the algorithm. The book is required to get enough context to answer some of the Rosalind.info problems.

Graphs for Bioinformatics

The textbook focuses on different problems in each chapter. For example, Chapter 1 uses the example of a string of DNA that marks where replication begins to introduce some basic bioinformatics concepts and algorithms. Chapter 2 uses the concept of molecular clocks to introduce motifs and motif-finding, the focus of most of the problems in Chapter 2.

Chapter 3 focuses on the problem of genome assembly - how we assemble an entire genome from short segments alone. In particular, the chapter focuses on de Bruijn graphs, which are graphs that, given a sequence of symbols drawn from an alphabet, are composed of edges (one for each k-mer, that is, a chunk of the sequence of length k), and vertices (one for each k-mer prefix and k-mer suffix, connected by a directed edge of the k-mer). We will cover more of the details of these graphs shortly.

Building a K-mer Graph (The Wrong Graph)

The Bioinformatics Algorithm book starts with a general discussion of how to represent a sequence of DNA nucleotides using a graph. The idea they discuss initially (which is an obvious, but not necessarily good, one) is splitting the sequence into k-mer chunks, like so:

      Sequence:   TAATGCCATGGGATGTT
Pieces:     TAA
AAT
ATG
TGC
GCC
CCA
CAT
ATG
TGG
GGG
GGA
GAT
ATG
TGT
GTT


and letting one k-mer be represented by one vertex. Then the sequence above could be turned into the graph:

TAA -> AAT -> ATG -> TGC -> GCC -> CCA -> CAT -> ATG -> TGG -> GGG -> GGA -> GAT -> ATG -> TGT -> GTT


On this graph, every edge has the property that the first (k-1) nucleotides of the destination match the last (k-1) nucleotides of the source.

If we did not know this sequence in advance, we could draw every edge with that property - every time the last (k-1) characters of a k-mer match the first (k-1) characters of another k-mer, an edge is drawn between those two vertices.

That graph would result in many more edges than the graph shown above.

Furthermore, in theory, if each read sequence came from a single genome and we had the entire genome covered by read sequences, a path through the graph that visits every vertex (every k-mer) would yield the full genome.

A path through a graph that visits every vertex once is called a Hamiltonian path.

Why is this hard? Because the problem of proving a Hamiltonian path exists, let alone finding it, becomes very difficult for large graphs.

Building a De Bruijn Graph (The Right Graph)

Nicolaas de Bruijn introduced (in 1946, in a paper entitled simply "A combinatorial problem") a new way of representing a sequence with a graph. He split a given sequence into k-mers, as before, but instead of representing each k-mer as a vertex on the graph, he represented each k-mer as an edge on the graph.

This type of graph is called a de Bruijn graph.

Specifically, for a DNA sequence, each k-mer from the sequence is represented by an edge, where the source vertex is that k-mer's (k-1)-nucleotide suffix and the destination vertex is that k-mer's (k-1)-nucleotide prefix.

      Sequence:   TAATGCCATGGGATGTT
Pieces:     TA
AA
AT
TG
GC
CC
CA
AT
TG
GG
GG
GA
AT
TG
GT
TT


Now this sequence is written as the graph:

TA -> AA -> AT -> TG -> GC -> CC -> CA -> AT -> TG -> GG -> GG -> GA -> AT -> TG -> GT -> TT


so that the original breakup of the sequence into k-mers is still represented, but now as edges rather than as vertices. That is, the k-mer TAA is represented by the edge TA -> AA.

Transform the Problem: Hamiltonian Paths to Eulerian Paths

The change in the problem representation (k-mers as vertices to k-mers as edges) changes the problem of finding the Hamiltonian path (a path through the graph that visits every vertex exactly once) into the problem of finding the Eulerian path (a path through the graph that visits every edge exactly once).

An Example

Let's look at a slightly simpler example - the one de Bruijn was originally considering - so we can see de Bruijn graphs in action in a slightly simpler case.

In his 1946 paper "A combinatorial problem", de Bruijn describes the problem thus:

Some years ago Ir. K. Posthumus stated an interesting conjecture concerning certain cycles of digits 0 or 1, which we shall call $$P_n$$ cycles. For $$n = 1, 2, 3, \dots$$, a $$P_n$$ cycle be an ordered cycle of $$2^n$$ digits 0 or 1, such that the $$2^n$$ possible ordered sets of $$n$$ consecutive digits of that cycle are all different. As a consequence, any ordered set of $$n$$ digits 0 or 1 occurs exactly once in that cycle.

For example, a $$P_3$$ cycle is $$00010111$$, respectively showing the triples 000, 001, 010, 011, 111, 100, 100, which are all the possible triples indeed.

In this case, de Bruijn is discussing complete de Bruijn graphs - he constructs a de Bruijn graph of all possible 3-mers (our k-mers, $$k = 3$$), and constructs a path through the graph that visits every edge of the graph. Here is the sequence broken down as above:

      Sequence:   00010111
Pieces:     00
00
01
10
01
11
11


The alphabet here is binary: 0 and 1.

This (seemingly simple) example is a bit confusing, but here's what's going on: we have four vertices on the de Bruijn graph, consisting of the 2-mers:

00
01
10
11


Now, if we draw an edge for every possible 3-mer, we would start with the 3-mer 000, which is actually represented by a self-edge from vertex 00 to vertex 00, because the prefix matches the suffix.

Similarly, the 3-mer 111 is represented by a self-edge from vertex 11 to vertex 11.

The other 3-mers are represented by their corresponding edges: 001 is represented by the edge 00 -> 01, 010 by the edge 01 -> 10, etc.

By drawing every possible edge (to represent every possible 3-mer), we assemble the complete de Bruijn graph (that is, the de Bruijn graph containing vertices for all possible 2-mers connected by edges representing every possible 3-mer in the given alphabet).

The sequence de Bruijn gives in his paper is an Euler path through the complete (de Bruijn) graph (that is, a path through the de Bruijn graph that visits every edge exactly once):

Sequence: 00010111

00 -> 00 -> 01 -> 10 -> 01 -> 11 -> 11


Back to DNA

Now the utility of the de Bruijn methodology is more clear: if we can come up with fast, efficient algorithms to find Euler paths on large graphs, we can transform the assembly problem (given fragments of a long sequence, reconstruct the sequence) into the problem of finding an Eulerian path, which is tractable even for large graphs.

Compare this with string matching algorithms utilizing dynamic programming, which can cost $$O(N^2)$$ and make genome assembly computationally infeasible.

Part 1 (this post) has covered the basic idea behind assembling DNA sequences using de Bruijn graphs. In Part 2 of this post, we will move on to a discussion of the "real world" problem, warts and all, and how we can relax some of the strict assumptions (like assuming perfect coverage of the original sequence and assuming that all reads are perfect).

The Git-Commit-Ectomy

Posted in Git

TLDR: Visit the git-commit-ectomy guide: http://pages.charlesreid1.com/git-commit-ectomy

Consider the following completely hypothetical scenario.

Suppose you've been working for a while on your latest invention, a brand-new whiz-bang command line tool that's fast and solves an important problem and you're chugging your way to the finish line.

As part of preparing to release your software tool, you add some tests, because that's what you do.

Those tests require some data, so you add a few test data sets, a few hundred kilobytes each, nothing fancy.

Then one day, the intern (who is just trying to be helpful by adding a new test) slips in a 70 MB test data set, and slips it in with a string of commits that somehow get incorporated into the master branch.

(Side note: you turned on branch protection to prevent this whole mess, didn't you? Didn't you?? 'Course you did. This is all just a hypothetical scenario.)

Now, the situation is complicated: there are several open pull requests and active branches, and a non-trivial amount of history that's been added since the time the large test data set was accidentally added.

The intern apologizes profusely and promises to bring in donuts every day next week. But the damage is done.

The intern, a git novice, pulls out a laptop and runs a git rm on the files, pushing to the remote and happily, ignorantly believing the problem has been solved.

But the intern does not understand how git works. It has a perfect memory, and remembers every file in every commit. Since the problematic first commit that added the large files, git has remembered and will always remember that large file. It's in git's blood. It's what git was designed to do.

Once the intern has been, ahem, moved along, and branch protection has been turned on, it's time to find a git surgeon to perform a git-commit-ectomy to remove the problematic large files from the repository entirely.

Dr. Reid's Patented Git-Commit-Ectomy

If it's a git-commit-ectomy you need, try Dr. Reid's Patented Git-Commit-Ectomy to ease your git commit pains.

Whether you want to keep thing simple and remove a git commit from a single branch, or if you've got multiple branches, Dr. Reid's Patented Git-Commit-Ectomy will get you back on your feet.

Dr. Reid's Patented Git-Commit-Ectomy can handle even the most messy, confused, and tangled git commit history - with a bit of work and a gifted surgeon the git-commit-ectomy can smooth things out and get you feeling right as rain.

Visit the git-commit-ectomy guide: http://pages.charlesreid1.com/git-commit-ectomy