charlesreid1.com blog

A Few of My Favorite PEPs

Posted in Python

permalink

Table of Contents



What's your favorite PEP?

PEPs, or Python Enhancement Proposals, are documents in which features, additions, or general ideas are proposed as additions to the core Python language.

As a Python user, we believe it's important to ask questions like this.

Picking a "favorite PEP" is not just about having a ready and clever answer to a question you might expect in a technical interview; the PEP documents really are important, and really do shape where Python is today and where it will be in the future.

So let's look at a few of our favorite PEPs.

PEP 0: The PEP Index

PEP0 - the easiest answer to the question, "what's your favorite PEP?"

PEP 0 - Index of Python Enhancement Proposals (PEPs) lists all PEPs, including PEPs about PEPs, accepted PEPs, open PEPs, finished PEPs, informational PEPs, and abandoned PEPs.

This is also a good place to search for a keyword or browse PEPs.

This PEP is the favorite of people who love enumerations, library card catalogs, biblical genealogies, and litanies.

PEP 8: The Python Style Guide

PEP 8 covers the recommended Python style. It is a surprisingly quick read.

This PEP dishes "official" opinions about controversial topics such as:

  • tabs or spaces (spoiler: spaces)
  • line width
  • whitespace
  • naming conventions for variables, classes, and modules

This PEP is the chosen favorite of those programmers who keep their crayons organized in the correct color order.

PEP 20: The Zen of Python

PEP 20 contains 20 aphorisms that compose the Zen of Python - only 19 of which are contained in the PEP...

Also available from Python via:

>>> import this

Many of the aphorisms in PEP 20 come in pairs.

The first seven alone compose an excellent philosophy of programming. Six symmetric rules:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

The seventh, one of the principal ideas behind Python:

Readability counts.

The next pair of aphorisms is important to our own style of programming:

Special cases aren't special enough to break the rules.

Although practicality beats purity.

The latter aphorism is an acknowledgement that, ultimately, programming is a means to an end, and Python (or whatever programming language you use) should not get in the way of reaching that end - especially not for the sake of some abstract principle or theory.

PEP 20 weighs in on errors:

Errors should never pass silently.

Unless explicitly silenced.

Slightly perplexing:

In the face of ambiguity, refuse the temptation to guess.

More pairs:

There should be one-- and preferably only one -- obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

From the Wikipedia page on Guido van Rossum:

Guido van Rossum is a Dutch programmer...

Now is better than never.

Although never is often better than *right* now.

That last one sounds like an excuse for procrastination.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Finally, the last aphorism covers the reason you never see from module import *:

Namespaces are one honking great idea - let's do more of those!

Namespaces, in this case, come from importing everything in a Python package into a particular variable name - like import itertools or import numpy as np.

It turns out that, yes, in fact, namespaces are a great idea!

PEP 3099: Things That Will Not Change in Python 3000

We can't really decide what we enjoy most about PEP 3099. Maybe it's the fact that it does the opposite of what most proclamations of a new major version do, which is, to say what new features it will not have. Maybe it's the way the language's creators demonstrate how well they have learned from the mistakes of others who adopt the "Burn it to the ground and rewrite from scratch" philosophy. Or maybe it's the delightful nostalgia of "Python 3000".

In any case, PEP 3099 is an instructive read, because any feature that will explicitly be kept during a major version bump is clearly either (a) useful, (b) important, or (c) both. Additionally, it gives some insight into the design decisions made when Python was implemented ("Why does Python do X this way, instead of some other easier way?").

Not only that, you also get to walk through a graveyard of abandoned (but still interesting) ideas, and the links given in the PEP to the Python mailing list can provide additional useful information.

Addendum: PEPs Affecting 2 to 3 Changes

In contrast to PEP 3099, which contains a list of all the things that did not change in Python 3, there were a large number of PEPs that did cause Python 3 to behave differently from Python 2.

PEP 202: List Comprehensions

Of course, picking your favorite PEP can also be an opportunity to make a statement about your favorite language feature of Python, since many of Python's most useful language features got their start as PEPs.

For us, list comprehensions (covered in PEP 202) area clear winner in any competition of most useful language features. List comprehensions are a way of shortening for loop syntax, making it much easier to perform map and filtering operations. Some examples from PEP 202:

>>> print([i for i in range(20) if i%2 == 0])
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

>>> nums = [1, 2, 3, 4]

>>> fruit = ["Apples", "Peaches", "Pears", "Bananas"]

>>> print [(i, f) for i in nums for f in fruit]
[(1, 'Apples'), (1, 'Peaches'), (1, 'Pears'), (1, 'Bananas'),
 (2, 'Apples'), (2, 'Peaches'), (2, 'Pears'), (2, 'Bananas'),
 (3, 'Apples'), (3, 'Peaches'), (3, 'Pears'), (3, 'Bananas'),
 (4, 'Apples'), (4, 'Peaches'), (4, 'Pears'), (4, 'Bananas')]

>>> print([(i, f) for i in nums for f in fruit if f[0] == "P"])
[(1, 'Peaches'), (1, 'Pears'),
 (2, 'Peaches'), (2, 'Pears'),
 (3, 'Peaches'), (3, 'Pears'),
 (4, 'Peaches'), (4, 'Pears')]

>>> print([(i, f) for i in nums for f in fruit if f[0] == "P" if i%2 == 1])
[(1, 'Peaches'), (1, 'Pears'), (3, 'Peaches'), (3, 'Pears')]

>>> print([i for i in zip(nums, fruit) if i[0]%2==0])
[(2, 'Peaches'), (4, 'Bananas')]

List comprehensions enable code to be short but expressive, brief but elegant. Brevity is the soul of wit, after all.

All the PEPs on Github

All the PEPs are available on Github.

Tags:    python    pep    computer science    programming   

Context Managers in Python

Posted in Python

permalink

A Predicament

Recently we spent some time contributing to dib-lab/eelpond (renamed to elvers), an executable Snakemake workflow for running the eelpond mRNAseq workflow.

In the process of tracking down a confusing bug in the Snakemake workflow, we used Snakemake's ability to print a directed acyclic graph (hereafter referred to as a dag) representing its task graph. Snakemake prints the dot notation to stdout.

(The graph representation ended up identifying the problem, which was two task graphs that were independent, but which were not supposed to be independent.)

When creating the Graphviz dot notation, Snakemake is kind enough to direct all of its output messages to stderr, and direct the dot graph output to stdout, which makes it easy to redirect stdout to a .dot file and process it with Graphviz.

Github user @bluegenes (the principal author of elvers) added a --dag file to the run_eelpond script, which asks Snakemake to print the dag when it calls the Snakemake API:

./run_eelpond --dag ... > eelpond_dag.dot

This .dot file can then be rendered into a .png file with another command from the command line,

dot eelpond_dag.dot -Tpng -o eelpond_dag.png

Simple, right?

But here's the problem: While this is a simple and easy way to generate the dag, it introduces some extra steps for the user, and it also prevents us from being able to print anything to stdout before or after the dag is generated, since anything printed out by the program to stdout will be redirected to the final dot file along with all the Graphviz dot output.

So how to avoid the extra steps on the command line, while also improving the flexibility in printing to stdout (i.e., only capturing Snakemake's output to a file)?

Can we add two flags like --dagfile and --dagpng that would, respectively, save the task graph directly into a .dot file, or render the dot output from Snakemake directly into a png using dot?

We implemented precisely this functionality in dib-lab/eelpond PR #73. To do this, we utilized a context manager to capture output from Snakemake. In this post we'll cover how this context manager works, and mention a few other possibilities with context managers.

What is a context manager?

If you have done even a little Python programming, you have probably seen and used context managers before - they are blocks of code that are opened using a with keyword in Python. For example, the classic Pythonic way to write to a file uses a context manager:

with open('file.txt', 'w') as f:
    f.write("\n".join(range(10)))

The context manager defines a runtime context for all the code in the block - and that can be a different context than the rest of the program. When a context is opened (when the with block is encountered), a context manager object is created and its __enter__() method is run. This method will modify the runtime context in whatever way it needs, and the rest of the code in the block will be run. When the context is done, where the block ends, the context manager's __exit__() method is run. This restores the runtime context to its normal state for the rest of the program.

It's a general concept with a lot of different applications. We cover how to use it to capture output to sys.stdout below.

What is Graphviz dot?

We mentioned that Snakemake can output visualizations of workflows in Graphviz dot format. For the purposes of clarity we explain what that format is here.

Without getting too sidetracked, Graphviz dot defines a notation for drawing graphs, and provides software for laying out the graphs in rendered images.

The user specifies the nodes and labels and edges, as well as formatting and layout details, and dot takes care of laying out the graph.

Here's an example of a simple graph in dot notation:

plot.dot

digraph G {
    Boston
    "New York"
    Houston
    "Los Angeles"
    Seattle

    Boston -> "New York"
    Boston -> Houston
    Houston -> Boston
    Houston -> "Los Angeles"
    "New York" -> Seattle
    Seattle -> "New York"
    "New York" -> "Los Angeles"
}

To render this as a .png image,

dot cities.dot -Tpng -o cities.png

which becomes:

dot graph of cities

This tool makes visualizing workflows a breeze, as the flow of tasks is much easier to understand and troubleshoot than the convoluted logic of Snakefile rules. Here is an example from elvers:

elvers dag

Capturing stdout

In elvers, the run_eelpond command line wrapper that kicks off the workflow is a Python script that calls the Snakemake API (we covered this approach in a prior blog post, Building Snakemake Command Line Wrappers for Workflows).

This Python script has a call to the Snakemake API; here is the relevant snippet:

        # ...set up...

        if not building_dag:
            print('--------')
            print('details!')
            print('\tsnakefile: {}'.format(snakefile))
            print('\tconfig: {}'.format(configfile))
            print('\tparams: {}'.format(paramsfile))
            print('\ttargets: {}'.format(repr(targs)))
            print('\treport: {}'.format(repr(reportfile)))
            print('--------')

        # Begin snakemake API call

        status = snakemake.snakemake(snakefile, configfile=paramsfile, use_conda=True, 
                                 targets=['eelpond'], printshellcmds=True, 
                                 cores=args.threads, cleanup_conda= args.cleanup_conda,
                                 dryrun=args.dry_run, lock=not args.nolock,
                                 unlock=args.unlock,
                                 verbose=args.verbose, debug_dag=args.debug, 
                                 conda_prefix=args.conda_prefix, 
                                 create_envs_only=args.create_envs_only,
                                 restart_times=args.restart_times,
                                 printdag=building_dag, keepgoing=args.keep_going,
                                 forcetargets=args.forcetargets,forceall=args.forceall)

        # End snakemake API call

        # ...clean up...

Most of the code that comes before this API call is processing the flags provided by the user. We want to have the flexibility to print to stdout while processing flags, before we get to the Snakemake API call; and we want those messages to be kept separate from the dag output.

In other words, we only want to capture output to stdout between "Begin snakemake API call" and "End snakemake API call". Everywhere else, stdout can go to stdout like normal.

We can do this by recognizing that any Python program printing to stdout uses sys.stdout under the hood to send output to stdout - so if we can somehow tell Python to swap out stdout with a string buffer that has the same methods (print, printf, etc.), run Snakemake, then replace stdout again, we can isolate and capture all stdout from the Snakemake API call.

Replacing stdout

The strategy for our context manager and the entry and exit methods, then, is clear:

  • If the user has specified the --dag flag, the __entry__() method should replace stdout with a StringIO buffer within our new runtime context; otherwise, leave stdout alone.

  • If the user has specified the --dag flag, the __exit__() method should clean up by restoring sys.stdout; otherwise, do nothing.

Now we are ready to make our context manager object.

But wait! What kind of object are we using? Do we need some kind of special context manager class? Nope! This is one of the features of context managers that makes them magical:

Any object can be a context manager.

All we need to do is add __enter__() and __exit__() methods to an object, and it can become a context manager.

Creating a context manager

In our case, we are capturing stdout from Snakemake so that we can potentially process it, and then dump it to a file. We don't know how many lines Snakemake will output, so we will replace sys.stdout with a string buffer. But once the context closes, we want all those strings in something more convenient, like a list.

So, we can define a new class that derives from the list class, and just adds __enter__() and __exit__() methods, to enable this list to be a context manager:

class CaptureStdout(list):
    """
    A utility object that uses a context manager
    to capture stdout from Snakemake. Useful when
    creating the directed acyclic graph.
    """
    def __init__(self,*args,**kwargs):
        pass

    def __enter__(self,*args,**kwargs):
        pass

    def __exit__(self,*args,**kwargs):

    ...

(Note that we include the constructor, since we need the context manager to have a state so that we can restore the original runtime context to the way it was when we're done.)

Constructor

The constructor is where we process any input arguments passed in when the context is created.

Given that we want our context manager to handle the case of a directed acyclic graph by capturing stdout, and do nothing otherwise, we should have a flag in the constructor indicating whether we want to pass stdout through, or whether we want to capture it.

Additionally, we don't need to call the parent (super) class constructor, i.e., the list constructor, because we always start with an empty list. No need to call super().__init__().

Here is the constructor:

class CaptureStdout(list):
    """
    A utility object that uses a context manager
    to capture stdout from Snakemake. Useful when
    creating the directed acyclic graph.
    """
    def __init__(self,passthru=False):
        # Boolean: should we pass everything through to stdout?
        # (this object is only functional if passthru is False)
        super().__init__()
        self.passthru = passthru

Enter method

When we open the context, we want to swap out sys.stdout with a string buffer. But we also want to save the original sys.stdout object reference, so that we can restore the original runtime context and let the program continue printing to stdout after Snakemake is done.

from io import StringIO
import sys

class CaptureStdout(list):

    ...

    def __enter__(self):
        """
        Open a new context with this CaptureStdout
        object. This happens when we say
        "with CaptureStdout() as output:"
        """
        # If we are just passing input on to output, pass thru
        if self.passthru:
            return self

        # Otherwise, we want to swap out sys.stdout with
        # a StringIO object that will save stdout.
        # 
        # Save the existing stdout object so we can
        # restore it when we're done
        self._stdout = sys.stdout
        # Now swap out stdout 
        sys.stdout = self._stringio = StringIO()
        return self

Exit method

To clean up, we will need to restore sys.stdout using the pointer we saved in __enter__, then process the string buffer.

We can also use the del operator to clean up the space used by the buffer object once we've transferred its contents.

from io import StringIO
import sys

class CaptureStdout(list):

    ...

    def __exit__(self, *args):
        """
        Close the context and clean up.
        The *args are needed in case there is an
        exception (we don't deal with those here).
        """
        # If we are just passing input on to output, pass thru
        if self.passthru:
            return self

        # This entire class extends the list class,
        # so we call self.extend() to add a list to 
        # the end of self (in this case, all the new
        # lines from our StringIO object).
        self.extend(self._stringio.getvalue().splitlines())

        # Clean up (if this is missing, the garbage collector
        # will eventually take care of this...)
        del self._stringio

        # Clean up by setting sys.stdout back to what
        # it was before we opened up this context.
        sys.stdout = self._stdout

In action

To see the context manager in action, let's go back to the snippet of code where we call the Snakemake API:

        # Set up a context manager to capture stdout if we're building
        # a directed acyclic graph (which prints the graph in dot format
        # to stdout instead of to a file).
        # If we are not bulding a dag, pass all output straight to stdout
        # without capturing any of it.
        passthru = not building_dag
        with CaptureStdout(passthru=passthru) as output:
            # run!!
            # params file becomes snakemake configfile
            status = snakemake.snakemake(snakefile, configfile=paramsfile, use_conda=True, 
                                     targets=['eelpond'], printshellcmds=True, 
                                     cores=args.threads, cleanup_conda= args.cleanup_conda,
                                     dryrun=args.dry_run, lock=not args.nolock,
                                     unlock=args.unlock,
                                     verbose=args.verbose, debug_dag=args.debug, 
                                     conda_prefix=args.conda_prefix, 
                                     create_envs_only=args.create_envs_only,
                                     restart_times=args.restart_times,
                                     printdag=building_dag, keepgoing=args.keep_going,
                                     forcetargets=args.forcetargets,forceall=args.forceall)

Once we have closed the runtime context, our variable output is a list with all the output from Snakemake (assuming we're creating a dag; if not, everything is passed through to stdout like normal).

The last bit here is to handle the three different dag flags: --dag, --dagfile, and --dagpng.

  • --dag prints the dot graph straight to stdout, like Snakemake's default dag behavior;

  • --dagfile=<dotfile> dumps the dot graph to a dot file

  • --dagpng=<pngfile> uses dot (installed in the elvers conda environment) to render the dot output from Snakemake directly into a png image

We handle these three cases like so:

        if building_dag:

            # These three --build args are mutually exclusive,
            # and are checked in order of precedence (hi to low):
            # --dag         to stdout
            # --dagfile     to .dot
            # --dagpng      to .png

            if args.dag:
                # straight to stdout
                print("\n".join(output))

            elif args.dagfile:
                with open(args.dagfile,'w') as f:
                    f.write("\n".join(output))
                print(f"\tPrinted workflow dag to dot file {args.dagfile}\n\n ")

            elif args.dagpng:
                # dump dot output to temporary dot file
                with open('.temp.dot','w') as f:
                    f.write("\n".join(output))
                subprocess.call(['dot','-Tpng','.temp.dot','-o',args.dagpng])
                subprocess.call(['rm','-f','.temp.dot'])
                print(f"\tPrinted workflow dag to png file {args.dagpng}\n\n ")

Note that before the Snakemake API call, we also check whether dot exists:

        # if user specified --dagpng,
        # graphviz dot must be present
        if args.dagpng:
            if shutil.which('dot') is None:
                sys.stderr.write(f"\n\tError: Cannot find 'dot' utility, but --dotpng flag was specified. Fix this by installing graphviz dot.\n\n")
                sys.exit(-1)

The final code is implemented in the cmr_better_dag_handling branch of eelpond/elvers and pull request #73 in eelpond/elvers.

Using the new dag flags

$ git clone https://github.com/dib-lab/eelpond.git
$ cd eelpond
$ conda env create --file environment.yml -n eelpond
$ conda activate eelpond

Now we're ready to run the workflow. We can use the -w flag to list all workflows, then use the -n flag to do a dry run.

We'll use the kmer_trim workflow target:

$ ./run_eelpond examples/nema.yaml kmer_trim -n

...lots of output...

Job counts:
    count   jobs
    1   eelpond
    10  http_get_fq1
    10  http_get_fq2
    10  khmer_pe_diginorm
    10  khmer_split_paired
    10  trimmomatic_pe
    51
This was a dry-run (flag -n). The order of jobs does not
reflect the order of execution.

Now we can create a dag for this workflow target:

$ ./run_eelpond examples/nema.yaml kmer_trim --dagfile=dag_kmertrimming.dot
    Added default parameters from rule-specific params files.
    Writing full params to examples/.ep_nema.yaml
Building DAG of jobs...
    Printed workflow dag to dot file dag_kmertrimming.dot

Finally, we can use the --dagpng flag for instant gratification:

$ ./run_eelpond examples/nema.yaml kmer_trim --dagpng=dag_kmertrimming.png
    Added default parameters from rule-specific params files.
    Writing full params to examples/.ep_nema.yaml
Building DAG of jobs...
    Printed workflow dag to png file dag_kmertrimming.png

Note that you can add a line rankdir=LR; to your dot file to change the orientation of the graph (left-to-right order makes highly-parallel workflows vertically stretched, so they are eaiser to view).

digraph mydigraph {

    rankdir=LR;

    ...

and here is the result:

elvers dag

Other context manager applications

Actions requiring temporary contexts, which are a bit like self-contained workspaces, are good candidates for context managers. Following are a few examples and references.


SSH connections: the context manager's __enter__ function creates/loads connection details, creates a connection object, and opens the connection. The __exit__ function cleans up by closing the connection. This way, you can say something like

with SSHConnectionManager(device) as conn:
    try:
        conn.send_command("echo hello world")
    except Exception as e:
        print("Enountered an error running remote command", e)

Blog post: Using Python Context Managers for SSH connections

(Note: this blog post uses a context manager that is a generator decorated with a context manager utility function; this is a different approach than our class-based approach but is still valid.


iPython notebook and matplotlib figure management: Camille Scott has a blog post covering a way of managing large Jupyter notebooks with lots of figures using context managers. In this case, the context that is being set up and torn down is a matplotlib plot context, and it is creating each plot, saving it to a file, then closing the plot. This makes the notebook a lot faster than trying to render every single plot, and makes it a lot cleaner than littering the code with manual figure and axis management.

Here is the example usage Camille gives:

with FigManager('genes_per_sample', figsize=tall_size) as (fig, ax):
    genes_support_df.sum().plot(kind='barh', fontsize=14, 
                                color=labels_df.color, 
                                figure=fig, ax=ax)
    ax.set_title('Represented Genes per Sample')

FileLink('genes_per_sample.svg')

(The FileLink function opens/processes the resulting image.)

Nice!

Blog post: Context Managers and IPython Notebook


Database Connection: this example comes from Django's test suite. In it, a context manager is defined that creates new MySQL database connections when the context is opened, and closes them when the context is done.

Here is the context manager, which again uses the decorator + generator approach rather than the class approach:

django/tests/backends/mysql/tests.py:

@contextmanager
def get_connection():
    new_connection = connection.copy()
    yield new_connection
    new_connection.close()

(Link to file on Github)

The first two lines of this function are equivalent to an __enter__ method, while the last line is equivalent to an __exit__ method.

This context manager is then used like so:

    with get_connection() as new_connection:
        new_connection.settings_dict['OPTIONS']['isolation_level'] = self.other_isolation_level
        self.assertEqual(
            self.get_isolation_level(new_connection),
            self.isolation_values[self.other_isolation_level]
        )

The context manager is a convenient way of creating a copy of an existing MySQL connection, then closing it when the requesting method is finished using it.


References

  1. elvers (dib-lab/eelpond on Github),

  2. eelpond mRNAseq workflow.

    • this mRNAseq data processing protocol served as the original inspiration for elvers
  3. Context managers (Python documentation)

  4. PEP 343 - the "with" statement

  5. Snakemake

  6. Graphviz dot

  7. Building Snakemake Command Line Wrappers for Workflows (charlesreid1 blog)

Tags:    context managers    testing    python    programming   

Building Snakemake Command Line Wrappers for Kubernetes Workflows

Posted in Snakemake

permalink

NOTE: These ideas are implemented in the repository charlesreid1/2019-snakemake-byok8s.

Recap: Workflows as Executables

In our previous blog post, Building Snakemake Command Line Wrappers we covered some approaches to making Snakemake workflows into executables that can be run as command line utilities.

In this post, we extend those ideas to Snakemake workflows that run on Kubernetes clusters.

2018-snakemake-cli

To recap, back in March 2018 Titus Brown wrote a blog post titled Pydoit, snakemake, and workflows-as-applications in which he implemented a proof-of-concept command line utility wrapping the Snakemake API to create an executable Snakemake workflow.

The end result was a command line utility that could be run like so:

./run <workflow-config> <workflow-params>

Relevant code is in ctb/2018-snakemake-cli.

2019-snakemake-cli

In our previous blog post, Building Snakemake Command Line Wrappers, we extended this idea to create a bundled executable command line utility that could be installed with setup.py and run from a working directory. We also demonstrated a method of writing tests for the Snakemake workflow and running those tests with Travis CI.

We packaged the Snakefile with the command line utility, but the approach is flexible and can be modified to use a user-provided Snakemake workflow or Snakefile.

The end result was a command line utility called bananas that could be installed and run like the run wrapper above:

bananas <workflow-config> <workflow-params>

Relevant code is in charlesreid1/2019-snakemake-cli.

2019-snakemake-byok8s

The next logical step in bundling workflows was to take advantage of Snakemake's ability to run workflows across distributed systems.

Specifically, we wanted to modify the command line utility above to run the workflow on a user-provided Kubernetes cluster, instead of running the workflow locally.

The result is 2019-snakemake-byok8s, a command line utility that can be installed with a setup.py and that launches a Snakemake workflow on a user-provided Kubernetes cluster. Furthermore, we demonstrate how to use minikube to run a local Kubernetes cluster to test Snakemake workflows on Kubernetes clusters.

Here's what it looks like in practice:

# Get byok8s
git clone https://github.com/charlesreid1/2019-snakemake-byok8s.git
cd ~/2019-snakemake-byok8s

# Create a virtual environment
virtualenv vp
vp/bin/actiavte

# Install byok8s
pip install -r requirements.txt
python setup.py build install

# Create virtual k8s cluster
minikube start

# Run the workflow on the k8s cluster
cd /path/to/workflow/
byok8s my-workflowfile my-paramsfile --s3-bucket=my-bucket

# Clean up the virtual k8s cluster
minikube stop

We cover the details below.

Overview of 2019-snakemake-byok8s

Cloud + Scale = Kubernetes (k8s)

First, why kubernetes (k8s)?

To scale Snakemake workflows to multiple compute nodes, it is not enough to just give Snakemake a pile of compute nodes and a way to remotely connect to each. Snakemake requires the compute nodes to have a controller and a job submission system.

When using cloud computing platforms like GCP (Google Cloud Platform) or AWS (Amazon Web Services), k8s is a simple and popular way to orchestrate multiple compute nodes (support for Docker images is also baked directly into k8s).

Snakemake k8s Support

Snakemake has built-in support for k8s, making the combination a logical choice for running Snakemake workflows at scale in the cloud.

The minikube tool, which we will cover later in this blog post, makes it easy to run a local virtual k8s cluster for testing purposes, and even makes it possible to run k8s tests using Travis CI.

Snakemake only requires the --kubernetes flag, and an optional namespace, to connect to the k8s cluster. (Under the hood, Snakemake uses the Kubernetes Python API to connect to the cluster and launch jobs.)

If you can run kubectl from a computer to control the Kubernetes cluster, you can run a Snakemake workflow on that cluster.

Let's get into the changes required in the Python code.

Modifying the CLI

In our prior post covering charlesreid1/2019-snakemake-cli, we showed how to create a command line utility using the cli/ directory for the command line interface package, and specifying it is a cli entrypoint in setup.py:

cli/
├── Snakefile
├── __init__.py
└── command.py

and the relevant bit from setup.py:

setup(name='bananas',
        ...
        entry_points="""
[console_scripts]
{program} = cli.command:main
      """.format(program = _program),

We want our new command line utility, byok8s, to work the same way, so we can do a s/byok8s/bananas/g across the package.

The only change required happens in the file command.py, where the Snakemake API call happens.

Namespaces

Checking the Snakemake API documentation, we can see that the API has a kubernetes option:

kubernetes (str) – submit jobs to kubernetes, using the given namespace.

so command.py should modify the Snakmake API call accordingly, adding a kubernetes namespace. This is a parameter the user usually won't need to provide (default is the typical namespace we want to use) but we added a -k argument to the ArgParser to allow the user to specify the Kubernetes namespace name. By default the Kubernetes namespace used is default.

Adding flags

We add and modify some flags to make the workflow more flexible:

  • The user now provides the Snakefile, which is called Snakefile in the current working directory by default but can be specified with the --snakefile or -s flag

  • The user provides the k8s namespace using the --k8s-namespace or -k flag

  • The user provides the name of an S3 bucket for Snakemake worker nodes to use for I/O using the --s3-bucket flag

Finally, the user is also required to provide their AWS credentials to access the S3 bucket, via two environment variables that Snakemake passes through to the Kubernetes worker nodes:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY

For Travis CI testing, these environment variables can be set in the repository settings on the Travis website once Travis CI has been enabled.

See https://charlesreid1.github.io/2019-snakemake-byok8s/travis_tests/ for details.

Local Kubernetes Clusters with Minikube

What is minikube?

Minikube is a Go program that allows users to simulate a single-node kubernetes cluster using a virtual machine. This is useful for local testing of Kubernetes workflows, as it does not require setting up or tearing down cloud infrastructure, or long waits for remote resources to become ready.

We cover two ways to use it:

  1. Installing and running a minikube virtual kubernetes cluster on AWS (for development and testing of Snakemake + kubernetes workflows)

  2. Running a minikube cluster on a Travis CI worker node to enable us to test Snakemake + kubernetes workflows.

AWS

Using Minikube from an AWS EC2 compute node comes with two hangups.

The first is that AWS nodes are virtual machines, and you can't run virtual machines within virtual machines, so it is not possible to use minikube's normal VirtualBox mode, which creates a kubernetes cluster using a virutal machine.

Instead, we must use minikube's native driver, meaning minikube uses docker directly. This is tricky for several reasons:

  • we can't bind-mount a local directory into the kubernetes cluster
  • the minikube cluster must be run with sudo privileges, which means permissions can be a problem

The second hangup with minikube on AWS nodes is that the DNS settings of AWS nodes are copied into the Kubernetes containers, including the kubernetes system's DNS service container. Unfortunately, the AWS node's DNS settings are not valid in the kubernetes cluster, so the DNS container crashes, and no container in the kubernetes cluster can reach the outside world. This must be fixed with a custom config file (provided with byok8s; details below).

Installing Python Prerequisites

To use byok8s from a fresh Ubuntu AWS node (tested with Ubuntu 16.04 (xenial) and 18.04 (bionic)), you will want to install a version of conda; we recommend using pyenv and miniconda:

curl https://pyenv.run | bash

Restart your shell and install miniconda:

pyenv update
pyenv install miniconda3-4.3.30
pyenv global miniconda3-4.3.30

You will also need the virtualenv package to set up a virtual environment:

pip install virtualenv

Installing byok8s

Start by cloning the repo and installing byok8s:

cd 
git clone https://github.com/charlesreid1/2019-snakemake-byok8s.git
cd ~/2019-snakemake-byok8s

Next, you'll create a virtual environment:

virtualenv vp
source vp/bin/activate

pip install -r requirements.txt
python setup.py build install

Now you should be ready to rock:

which byok8s

Starting a k8s cluster with minikube

Install minikube:

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 \
  && sudo install minikube-linux-amd64 /usr/local/bin/minikube

Now you're ready to start a minikube k8s cluster on your AWS node! Start a k8s cluster as root with:

sudo minikube start

NOTE: The minikube start command will print some commands for you to run to fix permissions - it is importat you run them!

Tear down the cluster with:

sudo minikube stop

While the k8s cluster is running, you can control it and interact with it like a normal k8s cluster using kubectl.

However, as-is, the cluster's DNS settings are broken! We need to fix them before running.

Fixing DNS issues with AWS

We mentioned a second hangup with AWS was with the DNS settings.

The problem is with /etc/resolv.conf on the AWS host node. It is set up for AWS's internal cloud network routing, but this is copied into the CoreDNS container, which is the kube-system container that manages DNS requests from all k8s containers. The settings from the AWS host confuse the DNS container, and it cannot route any DNS requests.

The Problem

If you're having the problem, you will see something like this with kubectl, where the coredns containers are in a CrashLoopBackOff:

$ kubectl get pods --namespace=kube-system

NAME                               READY   STATUS             RESTARTS   AGE
coredns-86c58d9df4-lvq8b           0/1     CrashLoopBackOff   5          5m17s
coredns-86c58d9df4-pr52t           0/1     CrashLoopBackOff   5          5m17s
etcd-minikube                      1/1     Running            15         4h43m
kube-addon-manager-minikube        1/1     Running            16         4h43m
kube-apiserver-minikube            1/1     Running            15         4h43m
kube-controller-manager-minikube   1/1     Running            15         4h43m
kube-proxy-sq77h                   1/1     Running            3          4h44m
kube-scheduler-minikube            1/1     Running            15         4h43m
storage-provisioner                1/1     Running            6          4h44m

This will cause all Snakemake jobs to fail with a name resolution failure when it tries to write its output files to the AWS S3 bucket:

$ kubectl logs snakejob-c71fba38-f64b-5803-915d-933ae273d7a4

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 4
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   target1
    1

[Thu Jan 24 00:06:03 2019]
rule target1:
    output: cmr-smk-0123/alpha.txt
    jobid: 0

echo alpha blue > cmr-smk-0123/alpha.txt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connection.py", line 171, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw)
  File "/opt/conda/lib/python3.7/site-packages/urllib3/util/connection.py", line 56, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/opt/conda/lib/python3.7/socket.py", line 748, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -3] Temporary failure in name resolution

and the kubernetes log for the CoreDNS container

$ kubectl logs --namespace=kube-system coredns-86c58d9df4-lvq8b

.:53
2019/01/25 14:54:48 [INFO] CoreDNS-1.2.2
2019/01/25 14:54:48 [INFO] linux/amd64, go1.11, fc62f9c
CoreDNS-1.2.2
linux/amd64, go1.11, eb51e8b
2019/01/25 14:54:48 [INFO] plugin/reload: Running configuration MD5 = 486384b491cef6cb69c1f57a02087373
2019/01/25 14:54:48 [FATAL] plugin/loop: Seen "HINFO IN 9273194449250285441.798654804648663468." more than twice, loop detected

Basically, the AWS node's DNS name server settings cause an infinite DNS loop to be set up.

The Fix

Fixing this problem requires manually setting the DNS name servers inside the CoreDNS container to Google's public DNS servers, 8.8.8.8 and 8.8.4.4.

To apply this fix, we use a YAML configuration file to patch the CoreDNS container image.

Hat tip to this long Github issue in the minikube Github repo, and specifically this comment by Github user jgoclawski. and also this comment by Github user bw2. (Note that neither of these quite solve the problem - jgoclawski's solution is for kube-dns, not CoreDNS, and bw2's YAML is not valid, but both got me most of the way to a solution.)

Here is the YAML file (also in the 2019-snakemake-byok8s repo here: https://github.com/charlesreid1/2019-snakemake-byok8s/blob/master/test/fixcoredns.yml):

fixcoredns.yml:

kind: ConfigMap
apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           upstream 8.8.8.8 8.8.4.4
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
        }
        proxy .  8.8.8.8 8.8.4.4
        cache 30
        reload
    }
metadata:
  creationTimestamp: 2019-01-25T22:55:15Z
  name: coredns
  namespace: kube-system

(NOTE: There is also a fixkubedns.yml if you are using an older Kubernetes version that uses kube-dns instead of CoreDNS.)

To tell the k8s cluster to use this image when it creates a CoreDNS container, run this kubectl command while the cluster is running:

kubectl apply -f fixcoredns.yml

Last but not least, delete all kube-system containers and let Kubernetes regenerate them:

kubectl delete --all pods --namespace kube-system

The pods will regenerate quickly, and you can check to confirm that the CoreDNS container is no longer in the CrashLoopBackOff state and is Running nicely:

kubectl get pods --namespace=kube-system

This is all documented in this comment in the same Github issue in the minikube repo that was linked to above, kubernetes/minikube issue #2027: dnsmasq pod CrashLoopBackOff.

AWS + byok8s Workflow

Now that the k8s cluster is running successfully, run the example byok8s workflow in the test/ directory of the byok8s repository (assuming you cloned the repo to ~/byok8s, and are in the same virtual environment as before):

# Return to our virtual environment
cd ~/2019-snakemake-byok8s/test/
source vp/bin/activate

# Verify k8s is running
minikube status

# Export AWS keys for Snakemake
export AWS_ACCESS_KEY_ID="XXXXX"
export AWS_SECRET_ACCESS_KEY="XXXXX"

# Run byok8s
byok8s workflow-alpha params-blue --s3-bucket=mah-bukkit 

The bucket you specify must be created in advance and be writable by the account whose credentials you are passing in via environment variables.

When you do all of this, you should see the job running, then exiting successfully:

$ byok8s --s3-bucket=cmr-0123 -f workflow-alpha params-blue
--------
details!
    snakefile: /home/ubuntu/2019-snakemake-byok8s/test/Snakefile
    config: /home/ubuntu/2019-snakemake-byok8s/test/workflow-alpha.json
    params: /home/ubuntu/2019-snakemake-byok8s/test/params-blue.json
    target: target1
    k8s namespace: default
--------
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    1   target1
    1
Resources before job selection: {'_cores': 1, '_nodes': 9223372036854775807}
Ready jobs (1):
    target1
Selected jobs (1):
    target1
Resources after job selection: {'_cores': 0, '_nodes': 9223372036854775806}

[Mon Jan 28 18:06:08 2019]
rule target1:
    output: cmr-0123/alpha.txt
    jobid: 0

echo alpha blue > cmr-0123/alpha.txt
Get status with:
kubectl describe pod snakejob-e585b53f-f9d5-5142-ac50-af5a0d532e85
kubectl logs snakejob-e585b53f-f9d5-5142-ac50-af5a0d532e85
Checking status for pod snakejob-e585b53f-f9d5-5142-ac50-af5a0d532e85
[Mon Jan 28 18:06:18 2019]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /home/ubuntu/2019-snakemake-byok8s/test/.snakemake/log/2019-01-28T180607.988313.snakemake.log
unlocking
removing lock
removing lock
removed all locks

Woo hoo! You've successfully run a Snakemake workflow on a virtual Kubernetes cluster!

Travis

Like running minikube on an AWS node, running minikube on Travis workers also suffers from DNS issues. Fortunately, Github user LiliC worked out how to run minikube on Travis, and importantly, did so for multiple versions of minikube and kubernetes.

The relevant .travis.yml file is available in the LiliC/travis-minikube repo on Github.

We ended up using the minikube-30-kube-1.12 branch of LiliC/travis-minikube, which used the most up-to-date version of minikube and kubernetes available in that repo. The .travis.yml file provided by LiliC on that branch is here.

The example script by LiliC provided 90% of the legwork (thanks!!!), and we only needed to modify a few lines of LiliC's Travis file (which launches a redis container using kubectl) to use Snakemake (launched via byok8s) instead.

.travis.yml

Here is the final .travis.yml file, which has explanatory comments.

.travis.yml:

# Modified from original:
# https://raw.githubusercontent.com/LiliC/travis-minikube/minikube-30-kube-1.12/.travis.yml

# byok8s and Snakemake both require Python,
# so we make this Travis CI test Python-based.
language: python
python:
- "3.6"

# Running minikube via travis requires sudo
sudo: required

# We need the systemd for the kubeadm and it's default from 16.04+
dist: xenial

# This moves Kubernetes specific config files.
env:
- CHANGE_MINIKUBE_NONE_USER=true

install:
# Install byok8s requirements (snakemake, python-kubernetes)
- pip install -r requirements.txt
# Install byok8s cli tool
- python setup.py build install

before_script:
# Do everything from test/
- cd test
# Make root mounted as rshared to fix kube-dns issues.
- sudo mount --make-rshared /
# Download kubectl, which is a requirement for using minikube.
- curl -Lo kubectl https://storage.googleapis.com/kubernetes-release/release/v1.12.0/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/
# Download minikube.
- curl -Lo minikube https://storage.googleapis.com/minikube/releases/v0.30.0/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/
- sudo minikube start --vm-driver=none --bootstrapper=kubeadm --kubernetes-version=v1.12.0
# Fix the kubectl context, as it's often stale.
- minikube update-context
# Wait for Kubernetes to be up and ready.
- JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}'; until kubectl get nodes -o jsonpath="$JSONPATH" 2>&1 | grep -q "Ready=True"; do sleep 1; done

################
## easy test
script:
- kubectl cluster-info
# Verify kube-addon-manager.
# kube-addon-manager is responsible for managing other kubernetes components, such as kube-dns, dashboard, storage-provisioner..
- JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}'; until kubectl -n kube-system get pods -lcomponent=kube-addon-manager -o jsonpath="$JSONPATH" 2>&1 | grep -q "Ready=True"; do sleep 1;echo "waiting for kube-addon-manager to be available"; kubectl get pods --all-namespaces; done
# Wait for kube-dns to be ready.
- JSONPATH='{range .items[*]}{@.metadata.name}:{range @.status.conditions[*]}{@.type}={@.status};{end}{end}'; until kubectl -n kube-system get pods -lk8s-app=kube-dns -o jsonpath="$JSONPATH" 2>&1 | grep -q "Ready=True"; do sleep 1;echo "waiting for kube-dns to be available"; kubectl get pods --all-namespaces; done

################ 
## hard test
# run byok8s workflow on the k8s cluster
- byok8s --s3-bucket=cmr-0123 -f workflow-alpha params-blue

End Product: byok8s

The final byok8s package can be found in the charlesreid1/2019-snakemake-byok8s repository on Github.

You can find documentation for 2019-snakemake-byok8s here: https://charlesreid1.github.io/2019-snakemake-byok8s/

To return to our quick start, here is what running byok8s end-to-end on a minikube kubernetes cluster on an AWS node looks like (slightly modified from the intro of our post):

# Install minikube
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 \
  && sudo install minikube-linux-amd64 /usr/local/bin/minikube

# Get byok8s
git clone https://github.com/charlesreid1/2019-snakemake-byok8s.git
cd ~/2019-snakemake-byok8s

# Create a virtual environment
virtualenv vp
vp/bin/actiavte

# Install byok8s
pip install -r requirements.txt
python setup.py build install

# Create virtual k8s cluster
sudo minikube start

# Fix CoreDNS
kubectl apply -f fixcoredns.yml
kubectl delete --all pods --namespace kube-system

# Wait for kube-system to respawn
kubectl get pods --namespace=kube-system

# Run the workflow on the k8s cluster
cd test/
byok8s workflow-alpha params-blue --s3-bucket=mah-bukkit 

# Clean up the virtual k8s cluster
sudo minikube stop

Documentation

You can find documentation for 2019-snakemake-byok8s here: https://charlesreid1.github.io/2019-snakemake-byok8s/

The documentation covers a quick start on AWS nodes, similar to what is covered above, as well as more information about running byok8s on other types of Kubernetes clusters (e.g., AWS, Google Cloud, and Digital Ocean).

Next Steps

Last year we were working on implementing metagenomic pipelines for shotgun sequencing data as part of the dahak-metagenomics project. We implemented several Snakemake workflows in the dahak repo, and began (but never completed) work on a command line utility to run these workflows called dahak-taco.

Our next major goal is to reboot dahak-taco and redesign it to run metagenomic workflows from dahak on Kubernetes clusters, similar to the way byok8s works.

Stay tuned for more!

Tags:    python    bioinformatics    workflows    pipelines    snakemake    travis    kubernetes    minikube   

March 2022

How to Read Ulysses

July 2020

Applied Gitflow

September 2019

Mocking AWS in Unit Tests

May 2018

Current Projects