Building Snakemake Command Line Wrappers for Kubernetes Workflows

Posted in Snakemake

permalink

NOTE: These ideas are implemented in the repository charlesreid1/2019-snakemake-byok8s.

Recap: Workflows as Executables

In our previous blog post, Building Snakemake Command Line Wrappers we covered some approaches to making Snakemake workflows into executables that can be run as command line utilities.

In this post, we extend those ideas to Snakemake workflows that run on Kubernetes clusters.

2018-snakemake-cli

To recap, back in March 2018 Titus Brown wrote a blog post titled Pydoit, snakemake, and workflows-as-applications in which he implemented a proof-of-concept command line utility wrapping the Snakemake API to create an executable Snakemake workflow.

The end result was a command line utility that could be run like so:

./run <workflow-config> <workflow-params>

Relevant code is in ctb/2018-snakemake-cli.

2019-snakemake-cli

In our previous blog post, Building Snakemake Command Line Wrappers, we extended this idea to create a bundled executable command line utility that could be installed with setup.py and run from a working directory. We also demonstrated a method of writing tests for the Snakemake workflow and running those tests with Travis CI.

We packaged the Snakefile with the command line utility, but the approach is flexible and can be modified to use a user-provided Snakemake workflow or Snakefile.

The end result was a command line utility called bananas that could be installed and run like the run wrapper above:

bananas <workflow-config> <workflow-params>

Relevant code is in charlesreid1/2019-snakemake-cli.

2019-snakemake-byok8s

The next logical step in bundling workflows was to take advantage of Snakemake's ability to run workflows across distributed systems.

Specifically, we wanted to modify the command line utility above to run the workflow on a user-provided Kubernetes cluster, instead of running the workflow locally.

The result is 2019-snakemake-byok8s, a command line utility that can be installed with a setup.py and that launches a Snakemake workflow on a user-provided Kubernetes cluster. Furthermore, we demonstrate how to use minikube to run a local Kubernetes cluster to test Snakemake workflows on Kubernetes clusters.

Here's what it looks like in practice:

# Get byok8s
git clone https://github.com/charlesreid1/2019-snakemake-byok8s.git
cd ~/2019-snakemake-byok8s

# Create a virtual environment
virtualenv vp
vp/bin/actiavte

# Install byok8s
pip install -r requirements.txt
python setup.py build install

# Create virtual k8s cluster
minikube start

# Run the workflow on the k8s cluster
cd /path/to/workflow/
byok8s my-workflowfile my-paramsfile --s3-bucket=my-bucket

# Clean up the virtual k8s cluster
minikube stop

We cover the details below.

Overview of 2019-snakemake-byok8s

Cloud + Scale = Kubernetes (k8s)

First, why kubernetes (k8s)?

To scale Snakemake workflows to multiple compute nodes, it is not enough to just give Snakemake a pile of compute nodes and a way to remotely connect to each. Snakemake requires the compute nodes to have a controller and a job submission system.

When using cloud computing platforms like GCP (Google Cloud Platform) or AWS (Amazon Web Services), k8s is a simple and popular way to orchestrate multiple compute nodes (support for Docker images is also baked directly into k8s).

Snakemake k8s Support

Snakemake has built-in support for k8s, making the combination a logical choice for running Snakemake workflows at scale in the cloud.

The minikube tool, which we will cover later in this blog post, makes it easy to run a local virtual k8s cluster for testing purposes, and even makes it possible to run k8s tests using Travis CI.

Snakemake only requires the --kubernetes flag, and an optional namespace, to connect to the k8s cluster. (Under the hood, Snakemake uses the Kubernetes Python API to connect to the cluster and launch jobs.)

If you can run kubectl from a computer to control the Kubernetes cluster, you can run a Snakemake workflow on that cluster.

Let's get into the changes required in the Python code.

Modifying the CLI

In our prior post covering charlesreid1/2019-snakemake-cli, we showed how to create a command line utility using the cli/ directory for the command line interface package, and specifying it is a cli entrypoint in setup.py:

cli/
├── Snakefile
├── __init__.py
└── command.py

and the relevant bit from setup.py:

setup(name='bananas',
        ...
        entry_points="""
[console_scripts]
{program} = cli.command:main
      """.format(program = _program),

We want our new command line utility, byok8s, to work the same way, so we can do a s/byok8s/bananas/g across the package.

The only change required happens in the file command.py, where the Snakemake API call happens.

Namespaces

Checking the Snakemake API documentation, we can see that the API has a kubernetes option:

kubernetes (str) – submit jobs to kubernetes, using the given namespace.

so command.py should modify the Snakmake API call accordingly, adding a kubernetes namespace. This is a parameter the user usually won't need to provide (default is the typical namespace we want to use) but we added a -k argument to the ArgParser to allow the user to specify the Kubernetes namespace name. By default the Kubernetes namespace used is default.

Adding flags

We add and modify some flags to make the workflow more flexible:

  • The user now provides the Snakefile, which is called Snakefile in the current working directory by default but can be specified with the --snakefile or -s flag

  • The user provides the k8s namespace using the --k8s-namespace or -k flag

  • The user provides the name of an S3 bucket for Snakemake worker nodes to use for I/O using the --s3-bucket flag

Finally, the user is also required to provide their AWS credentials to access the S3 bucket, via two environment variables that Snakemake passes through to the Kubernetes worker nodes:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

For Travis CI testing, these environment variables can be set in the repository settings on the Travis website once Travis CI has been enabled.

See https://charlesreid1.github.io/2019-snakemake-byok8s/travis_tests/ for details.

Local Kubernetes Clusters with Minikube

What is minikube?

Minikube is a Go program that allows users to simulate a single-node kubernetes cluster using a virtual machine. This is useful for local testing of Kubernetes workflows, as it does not require setting up or tearing down cloud infrastructure, or long waits for remote resources to become ready.

We cover two ways to use it:

  1. Installing and running a minikube virtual kubernetes cluster on AWS (for development and testing of Snakemake + kubernetes workflows)

  2. Running a minikube cluster on a Travis CI worker node to enable us to test Snakemake + kubernetes workflows.

AWS

Using Minikube from an AWS EC2 compute node comes with two hangups.

The first is that AWS nodes are virtual machines, and you can't run virtual machines within virtual machines, so it is not possible to use minikube's normal VirtualBox mode, which creates a kubernetes cluster using a virutal machine.

Instead, we must use minikube's native driver, meaning minikube uses docker directly. This is tricky for several reasons:

  • we can't bind-mount a local directory into the kubernetes cluster
  • the minikube cluster must be run with sudo privileges, which means permissions can be a problem

The second hangup with minikube on AWS nodes is that the DNS settings of AWS nodes are copied into the Kubernetes containers, including the kubernetes system's DNS service container. Unfortunately, the AWS node's DNS settings are not valid in the kubernetes cluster, so the DNS container crashes, and no container in the kubernetes cluster can reach the outside world. This must be fixed with a custom config file (provided with byok8s; details below).

Installing Python Prerequisites

To use byok8s from a fresh Ubuntu AWS node (tested with Ubuntu 16.04 (xenial) and 18.04 (bionic)), you will want to install a version of conda; we recommend using pyenv and miniconda:

curl https://pyenv.run | bash

Restart your shell and install miniconda:

pyenv update
pyenv install miniconda3-4.3.30
pyenv global miniconda3-4.3.30

You will also need the virtualenv package to set up a virtual environment:

pip install virtualenv

Installing byok8s

Start by cloning the repo and installing byok8s:

cd 
git clone https://github.com/charlesreid1/2019-snakemake-byok8s.git
cd ~/2019-snakemake-byok8s

Next, you'll create a virtual environment:

virtualenv vp
source vp/bin/activate

pip install -r requirements.txt
python setup.py build install

Now you should be ready to rock:

which byok8s

Starting a k8s cluster with minikube

Install minikube:

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 \
  && sudo install minikube-linux-amd64 /usr/local/bin/minikube

Now you're ready to start a minikube k8s cluster on your AWS node! Start a k8s cluster as root with:

sudo minikube start

NOTE: The minikube start command will print some commands for you to run to fix permissions - it is importat you run them!

Tear down the cluster with:

sudo minikube stop

While the k8s cluster is running, you can control it and interact with it like a normal k8s cluster using kubectl.

However, as-is, the cluster's DNS settings are broken! We need to fix them before running.

Fixing DNS issues with AWS

We mentioned a second hangup with AWS was with the DNS settings.

The problem is with /etc/resolv.conf on the AWS host node. It is set up for AWS's internal cloud network routing, but this is copied into the CoreDNS container, which is the kube-system container that manages DNS requests from all k8s containers. The settings from the AWS host confuse the DNS container, and it cannot route any DNS requests.

The Problem

If you're having the problem, you will see something like this with kubectl, where the coredns containers are in a CrashLoopBackOff:

$ kubectl get pods --namespace=kube-system

NAME                               READY   STATUS             RESTARTS   AGE
coredns-86c58d9df4-lvq8b           0/1     CrashLoopBackOff   5          5m17s
coredns-86c58d9df4-pr52t           0/1     CrashLoopBackOff   5          5m17s
etcd-minikube                      1/1     Running            15         4h43m
kube-addon-manager-minikube        1/1     Running            16         4h43m
kube-apiserver-minikube            1/1     Running            15         4h43m
kube-controller-manager-minikube   1/1     Running            15         4h43m
kube-proxy-sq77h                   1/1     Running            3          4h44m
kube-scheduler-minikube            1/1     Running            15         4h43m
storage-provisioner                1/1     Running            6          4h44m

This will cause all Snakemake jobs to fail with a name resolution failure when it tries to write its output files to the AWS S3 bucket:

$ kubectl logs snakejob-c71fba38-f64b-5803-915d-933ae273d7a4

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   target1
    1

[Thu Jan 24 00:06:03 2019]
rule target1:
    output: cmr-smk-0123/alpha.txt
    jobid: 0

echo alpha blue > cmr-smk-0123/alpha.txt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connection.py", line 171, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/opt/conda/lib/python3.7/site-packages/urllib3/util/connection.py", line 56, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/opt/conda/lib/python3.7/socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

and the kubernetes log for the CoreDNS container

$ kubectl logs --namespace=kube-system coredns-86c58d9df4-lvq8b

.:53
2019/01/25 14:54:48 [INFO] CoreDNS-1.2.2
2019/01/25 14:54:48 [INFO] linux/amd64, go1.11, fc62f9c
CoreDNS-1.2.2
linux/amd64, go1.11, eb51e8b
2019/01/25 14:54:48 [INFO] plugin/reload: Running configuration MD5 = 486384b491cef6cb69c1f57a02087373
2019/01/25 14:54:48 [FATAL] plugin/loop: Seen "HINFO IN 9273194449250285441.798654804648663468." more than twice, loop detected

Basically, the AWS node's DNS name server settings cause an infinite DNS loop to be set up.

The Fix

Fixing this problem requires manually setting the DNS name servers inside the CoreDNS container to Google's public DNS servers, 8.8.8.8 and 8.8.4.4.

To apply this fix, we use a YAML configuration file to patch the CoreDNS container image.

Hat tip to this long Github issue in the minikube Github repo, and specifically this comment by Github user jgoclawski. and also this comment by Github user bw2. (Note that neither of these quite solve the problem - jgoclawski's solution is for kube-dns, not CoreDNS, and bw2's YAML is not valid, but both got me most of the way to a solution.)

Here is the YAML file (also in the 2019-snakemake-byok8s repo here: https://github.com/charlesreid1/2019-snakemake-byok8s/blob/master/test/fixcoredns.yml):

fixcoredns.yml:

kind: ConfigMap
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           upstream 8.8.8.8 8.8.4.4
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
        }
        proxy .  8.8.8.8 8.8.4.4
        cache 30
        reload
    }
metadata:
  creationTimestamp: 2019-01-25T22:55:15Z
  name: coredns
  namespace: kube-system

(NOTE: There is also a fixkubedns.yml if you are using an older Kubernetes version that uses kube-dns instead of CoreDNS.)

To tell the k8s cluster to use this image when it creates a CoreDNS container, run this kubectl command while the cluster is running:

kubectl apply -f fixcoredns.yml

Last but not least, delete all kube-system containers and let Kubernetes regenerate them:

kubectl delete --all pods --namespace kube-system

The pods will regenerate quickly, and you can check to confirm that the CoreDNS container is no longer in the CrashLoopBackOff state and is Running nicely:

kubectl get pods --namespace=kube-system

This is all documented in this comment in the same Github issue in the minikube repo that was linked to above, kubernetes/minikube issue #2027: dnsmasq pod CrashLoopBackOff.

AWS + byok8s Workflow

Now that the k8s cluster is running successfully, run the example byok8s workflow in the test/ directory of the byok8s repository (assuming you cloned the repo to ~/byok8s, and are in the same virtual environment as before):

# Return to our virtual environment
cd ~/2019-snakemake-byok8s/test/
source vp/bin/activate

# Verify k8s is running
minikube status

# Export AWS keys for Snakemake
export AWS_ACCESS_KEY_ID="XXXXX"
export AWS_SECRET_ACCESS_KEY="XXXXX"

# Run byok8s
byok8s workflow-alpha params-blue --s3-bucket=mah-bukkit 

The bucket you specify must be created in advance and be writable by the account whose credentials you are passing in via environment variables.

When you do all of this, you should see the job running, then exiting successfully:

$ byok8s --s3-bucket=cmr-0123 -f workflow-alpha params-blue
--------
details!
    snakefile: /home/ubuntu/2019-snakemake-byok8s/test/Snakefile
    config: /home/ubuntu/2019-snakemake-byok8s/test/workflow-alpha.json
    params: /home/ubuntu/2019-snakemake-byok8s/test/params-blue.json
    target: target1
    k8s namespace: default
--------
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   target1
    1
Resources before job selection: {'_cores': 1, '_nodes': 9223372036854775807}
Ready jobs (1):
    target1
Selected jobs (1):
    target1
Resources after job selection: {'_cores': 0, '_nodes': 9223372036854775806}

[Mon Jan 28 18:06:08 2019]
rule target1:
    output: cmr-0123/alpha.txt
    jobid: 0

echo alpha blue > cmr-0123/alpha.txt
Get status with:
kubectl describe pod snakejob-e585b53f-f9d5-5142-ac50-af5a0d532e85
kubectl logs snakejob-e585b53f-f9d5-5142-ac50-af5a0d532e85
Checking status for pod snakejob-e585b53f-f9d5-5142-ac50-af5a0d532e85
[Mon Jan 28 18:06:18 2019]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /home/ubuntu/2019-snakemake-byok8s/test/.snakemake/log/2019-01-28T180607.988313.snakemake.log
unlocking
removing lock
removing lock
removed all locks

Woo hoo! You've successfully run a Snakemake workflow on a virtual Kubernetes cluster!

Travis

Like running minikube on an AWS node, running minikube on Travis workers also suffers from DNS issues. Fortunately, Github user LiliC worked out how to run minikube on Travis, and importantly, did so for multiple versions of minikube and kubernetes.

The relevant .travis.yml file is available in the LiliC/travis-minikube repo on Github.

We ended up using the minikube-30-kube-1.12 branch of LiliC/travis-minikube, which used the most up-to-date version of minikube and kubernetes available in that repo. The .travis.yml file provided by LiliC on that branch is here.

The example script by LiliC provided 90% of the legwork (thanks!!!), and we only needed to modify a few lines of LiliC's Travis file (which launches a redis container using kubectl) to use Snakemake (launched via byok8s) instead.

.travis.yml

Here is the final .travis.yml file, which has explanatory comments.

.travis.yml:

# Modified from original:
# https://raw.githubusercontent.com/LiliC/travis-minikube/minikube-30-kube-1.12/.travis.yml

# byok8s and Snakemake both require Python,
# so we make this Travis CI test Python-based.
language: python
python:
- "3.6"

# Running minikube via travis requires sudo
sudo: required

# We need the systemd for the kubeadm and it's default from 16.04+
dist: xenial

# This moves Kubernetes specific config files.
env:
- CHANGE_MINIKUBE_NONE_USER=true

install:
# Install byok8s requirements (snakemake, python-kubernetes)
- pip install -r requirements.txt
# Install byok8s cli tool
- python setup.py build install

before_script:
# Do everything from test/
- cd test
# Make root mounted as rshared to fix kube-dns issues.
- sudo mount --make-rshared /
# Download kubectl, which is a requirement for using minikube.
- curl -Lo kubectl https://storage.googleapis.com/kubernetes-release/release/v1.12.0/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
# Download minikube.
- curl -Lo minikube https://storage.googleapis.com/minikube/releases/v0.30.0/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/
- sudo minikube start --vm-driver=none --bootstrapper=kubeadm --kubernetes-version=v1.12.0
# Fix the kubectl context, as it's often stale.
- minikube update-context
# Wait for Kubernetes to be up and ready.
- JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}'; until kubectl get nodes -o jsonpath="$JSONPATH" 2>&1 | grep -q "Ready=True"; do sleep 1; done

################
## easy test
script:
- kubectl cluster-info
# Verify kube-addon-manager.
# kube-addon-manager is responsible for managing other kubernetes components, such as kube-dns, dashboard, storage-provisioner..
- JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}'; until kubectl -n kube-system get pods -lcomponent=kube-addon-manager -o jsonpath="$JSONPATH" 2>&1 | grep -q "Ready=True"; do sleep 1;echo "waiting for kube-addon-manager to be available"; kubectl get pods --all-namespaces; done
# Wait for kube-dns to be ready.
- JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}'; until kubectl -n kube-system get pods -lk8s-app=kube-dns -o jsonpath="$JSONPATH" 2>&1 | grep -q "Ready=True"; do sleep 1;echo "waiting for kube-dns to be available"; kubectl get pods --all-namespaces; done

################ 
## hard test
# run byok8s workflow on the k8s cluster
- byok8s --s3-bucket=cmr-0123 -f workflow-alpha params-blue

End Product: byok8s

The final byok8s package can be found in the charlesreid1/2019-snakemake-byok8s repository on Github.

You can find documentation for 2019-snakemake-byok8s here: https://charlesreid1.github.io/2019-snakemake-byok8s/

To return to our quick start, here is what running byok8s end-to-end on a minikube kubernetes cluster on an AWS node looks like (slightly modified from the intro of our post):

# Install minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 \
  && sudo install minikube-linux-amd64 /usr/local/bin/minikube

# Get byok8s
git clone https://github.com/charlesreid1/2019-snakemake-byok8s.git
cd ~/2019-snakemake-byok8s

# Create a virtual environment
virtualenv vp
vp/bin/actiavte

# Install byok8s
pip install -r requirements.txt
python setup.py build install

# Create virtual k8s cluster
sudo minikube start

# Fix CoreDNS
kubectl apply -f fixcoredns.yml
kubectl delete --all pods --namespace kube-system

# Wait for kube-system to respawn
kubectl get pods --namespace=kube-system

# Run the workflow on the k8s cluster
cd test/
byok8s workflow-alpha params-blue --s3-bucket=mah-bukkit 

# Clean up the virtual k8s cluster
sudo minikube stop

Documentation

You can find documentation for 2019-snakemake-byok8s here: https://charlesreid1.github.io/2019-snakemake-byok8s/

The documentation covers a quick start on AWS nodes, similar to what is covered above, as well as more information about running byok8s on other types of Kubernetes clusters (e.g., AWS, Google Cloud, and Digital Ocean).

Next Steps

Last year we were working on implementing metagenomic pipelines for shotgun sequencing data as part of the dahak-metagenomics project. We implemented several Snakemake workflows in the dahak repo, and began (but never completed) work on a command line utility to run these workflows called dahak-taco.

Our next major goal is to reboot dahak-taco and redesign it to run metagenomic workflows from dahak on Kubernetes clusters, similar to the way byok8s works.

Stay tuned for more!

Tags:    python    bioinformatics    workflows    pipelines    snakemake    travis    kubernetes    minikube