charlesreid1.com blog

centillion: a document search engine

Posted in Python

permalink

We're excited to announce the public release of centillion, a document search engine.

centillion is a search tool that can be used by any individual or organization to index Github repositories (including the content of markdown files), Google Drive folders (including the content of .docx files), and Disqus comment threads.

centillion is tested using Travis CI.

centillion was originally written for the NIH Data Commons effort (which recently concluded). centillion was built to facilitate information-finding in a project with hundreds of people at dozens of institutions generating a sea of email threads, Google Drive folders, markdown files, websites, and Github repositories.

centillion provided a single comprehensive way of searching across All The Things and earned the author many thanks from members across the Data Commons. It is the author's hope that centillion can prove equally useful for other organizations.

Under the hood centillion uses Flask (a web server microframework) and Whoosh (a Python-based search engine tool).

You can get a copy of the latest centillion release here: https://github.com/dcppc/centillion

You can find the latest centillion documentation here: http://nih-data-commons.us/centillion/

Tags:    python    centillion    search    search engine    google drive    github    flask   

Any Color You Like, As Long As It's 00ADD8

Posted in Go

permalink

A short post with some thoughts on how writing Go code has helped me learn to stop worrying and love auto-formatting tools.

Go code is terse. Not Python-terse, but terse. And unlike Java, you don't find yourself constantly resorting to the security blanket of objects - something that Python (mercifully) can go either way on.

I used Java when I taught computer science at South Seattle College, and remember telling students once that one day, students taking CSC 142/143 would be using Go instead of Java. These days, I'm not as certain of that, but given that Go's strengths are asynchronous programming (critical for taking advantage of multicore hardware) and tasks suited for the web, it isn't hard to imagine a "Go 2.0" that becomes a de-facto standard in school curricula.

Something else I like about Go is the way there is a toolchain that adheres to the Unix tooling philosophy: do one thing and do it well. Take gofmt as an example - this is a tool that autoformats Go code to conform to the Go standard spec. gofmt is a simple tool that does just one thing. This tool can be connected to various text editors with hooks, a la vimgo.

gofmt has taught me the value, and convenience, of embracing the norms and standards set by a language's community. Go recommends using tabs, for example, which early on I found a bit repulsive. Before I had vimgo set up, I was stubbornly using spaces instead of tabs in my Go code.

But then I set up vimgo so that, every time I saved a buffer containing Go code, it would run gofmt on the code, replacing all of the nitpicky details (like how many spaces between parentheses and variables, or wether == should be surrounded be spaces) and it just makes an executive decision.

Sure, it uses tabs instead of spaces, but once you start to work on code and save it and you see all of these details just handled, you quickly learn not to worry about it.

And the surprising thing, to me, was just how much overhead I was spending on those things. It adds up.

The gofmt executive decision strategy is similar to black, "The uncompromising Python code formatter," whose slogan is "Any color you like, as long as it's black."

While I really like black and would love to let it handle all of my Python code the way gofmt handles all of my Go code, the unfortunate reality is that Python, unlike Go, does not have an official standard, and if you automatically apply black formatting to all Python code, you can quickly wreak havoc on version-controlled code. You have to tread more lightly with black. I apply black more selectively by only applying it to .py files that are in specific project subdirectories.)

A slogan for gofmt could be, "Any color you like, as long as it's #00ADD8."

Wait, what? Where did #00ADD8 come from?

It's in the Go Brand Book. Prior to discovering this (the link was dropped in an unrelated discussion on the gonuts mailing list), I had no idea waht a brand book was. Turns out, this is very much a thing in marketing. Companies, projects, and organizations all have brand books that lay out the details of their marketing designs, branding, looks, everything down to the fonts and colors.

The Go brand book is short, but it does specify an official color for Golang: #00ADD8. It also covers critical details about how to depict the Go gopher, including the physics of gopher belly folds:

Extremely important details

There are some other branding books - the Coca Cola brand book. is simultaneously fascinating and terrible, in a late stage capitalism kind of way.

At any rate, at least the Go brand book is about something useful, and contains silly things like gophers.

Gopher specs

Tags:    go    golang    rosalind    bioinformatics    black    python    gofmt   

A Few of My Favorite PEPs

Posted in Python

permalink

Table of Contents



What's your favorite PEP?

PEPs, or Python Enhancement Proposals, are documents in which features, additions, or general ideas are proposed as additions to the core Python language.

As a Python user, we believe it's important to ask questions like this.

Picking a "favorite PEP" is not just about having a ready and clever answer to a question you might expect in a technical interview; the PEP documents really are important, and really do shape where Python is today and where it will be in the future.

So let's look at a few of our favorite PEPs.

PEP 0: The PEP Index

PEP0 - the easiest answer to the question, "what's your favorite PEP?"

PEP 0 - Index of Python Enhancement Proposals (PEPs) lists all PEPs, including PEPs about PEPs, accepted PEPs, open PEPs, finished PEPs, informational PEPs, and abandoned PEPs.

This is also a good place to search for a keyword or browse PEPs.

This PEP is the favorite of people who love enumerations, library card catalogs, biblical genealogies, and litanies.

PEP 8: The Python Style Guide

PEP 8 covers the recommended Python style. It is a surprisingly quick read.

This PEP dishes "official" opinions about controversial topics such as:

  • tabs or spaces (spoiler: spaces)
  • line width
  • whitespace
  • naming conventions for variables, classes, and modules

This PEP is the chosen favorite of those programmers who keep their crayons organized in the correct color order.

PEP 20: The Zen of Python

PEP 20 contains 20 aphorisms that compose the Zen of Python - only 19 of which are contained in the PEP...

Also available from Python via:

>>> import this

Many of the aphorisms in PEP 20 come in pairs.

The first seven alone compose an excellent philosophy of programming. Six symmetric rules:

Beautiful is better than ugly.

Explicit is better than implicit.

Simple is better than complex.

Complex is better than complicated.

Flat is better than nested.

Sparse is better than dense.

The seventh, one of the principal ideas behind Python:

Readability counts.

The next pair of aphorisms is important to our own style of programming:

Special cases aren't special enough to break the rules.

Although practicality beats purity.

The latter aphorism is an acknowledgement that, ultimately, programming is a means to an end, and Python (or whatever programming language you use) should not get in the way of reaching that end - especially not for the sake of some abstract principle or theory.

PEP 20 weighs in on errors:

Errors should never pass silently.

Unless explicitly silenced.

Slightly perplexing:

In the face of ambiguity, refuse the temptation to guess.

More pairs:

There should be one-- and preferably only one -- obvious way to do it.

Although that way may not be obvious at first unless you're Dutch.

From the Wikipedia page on Guido van Rossum:

Guido van Rossum is a Dutch programmer...

Now is better than never.

Although never is often better than *right* now.

That last one sounds like an excuse for procrastination.

If the implementation is hard to explain, it's a bad idea.

If the implementation is easy to explain, it may be a good idea.

Finally, the last aphorism covers the reason you never see from module import *:

Namespaces are one honking great idea - let's do more of those!

Namespaces, in this case, come from importing everything in a Python package into a particular variable name - like import itertools or import numpy as np.

It turns out that, yes, in fact, namespaces are a great idea!

PEP 3099: Things That Will Not Change in Python 3000

We can't really decide what we enjoy most about PEP 3099. Maybe it's the fact that it does the opposite of what most proclamations of a new major version do, which is, to say what new features it will not have. Maybe it's the way the language's creators demonstrate how well they have learned from the mistakes of others who adopt the "Burn it to the ground and rewrite from scratch" philosophy. Or maybe it's the delightful nostalgia of "Python 3000".

In any case, PEP 3099 is an instructive read, because any feature that will explicitly be kept during a major version bump is clearly either (a) useful, (b) important, or (c) both. Additionally, it gives some insight into the design decisions made when Python was implemented ("Why does Python do X this way, instead of some other easier way?").

Not only that, you also get to walk through a graveyard of abandoned (but still interesting) ideas, and the links given in the PEP to the Python mailing list can provide additional useful information.

Addendum: PEPs Affecting 2 to 3 Changes

In contrast to PEP 3099, which contains a list of all the things that did not change in Python 3, there were a large number of PEPs that did cause Python 3 to behave differently from Python 2.

PEP 202: List Comprehensions

Of course, picking your favorite PEP can also be an opportunity to make a statement about your favorite language feature of Python, since many of Python's most useful language features got their start as PEPs.

For us, list comprehensions (covered in PEP 202) area clear winner in any competition of most useful language features. List comprehensions are a way of shortening for loop syntax, making it much easier to perform map and filtering operations. Some examples from PEP 202:

>>> print([i for i in range(20) if i%2 == 0])
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

>>> nums = [1, 2, 3, 4]

>>> fruit = ["Apples", "Peaches", "Pears", "Bananas"]

>>> print [(i, f) for i in nums for f in fruit]
[(1, 'Apples'), (1, 'Peaches'), (1, 'Pears'), (1, 'Bananas'),
 (2, 'Apples'), (2, 'Peaches'), (2, 'Pears'), (2, 'Bananas'),
 (3, 'Apples'), (3, 'Peaches'), (3, 'Pears'), (3, 'Bananas'),
 (4, 'Apples'), (4, 'Peaches'), (4, 'Pears'), (4, 'Bananas')]

>>> print([(i, f) for i in nums for f in fruit if f[0] == "P"])
[(1, 'Peaches'), (1, 'Pears'),
 (2, 'Peaches'), (2, 'Pears'),
 (3, 'Peaches'), (3, 'Pears'),
 (4, 'Peaches'), (4, 'Pears')]

>>> print([(i, f) for i in nums for f in fruit if f[0] == "P" if i%2 == 1])
[(1, 'Peaches'), (1, 'Pears'), (3, 'Peaches'), (3, 'Pears')]

>>> print([i for i in zip(nums, fruit) if i[0]%2==0])
[(2, 'Peaches'), (4, 'Bananas')]

List comprehensions enable code to be short but expressive, brief but elegant. Brevity is the soul of wit, after all.

All the PEPs on Github

All the PEPs are available on Github.

Tags:    python    pep    computer science    programming   

May 2018

Current Projects