Table of Contents
- Starting example
- What needs to be changed
- Formatting the data
- Creating the color map
- How the calendar code works
- Final result
Starting example
Let's begin with a D3 example. Mike Bostock provided a Calendar View block illustrating how to draw a very interesting visualization of large amounts of data over time:
You might recognize this type of graph from Github, whose activity graph shows the same visualization.
The data shown in this example consists of several years of stock market data. It is a simple but very large data set, with each data poit consisting of one date and one number (the percentage gain or loss).
The example also shows how to perform a simple calculation from multiple fields of the data to plot a derived quantity. In this case, the data consists of a high, low, and close, and the quantity being plotted is the percent change:
What needs to be changed
To change this calendar visualization to visualize our own data, we need to change two things:
- The data set being visualized
- The color map being used
We can leave the rest alone, or make small modifications as needed. Fortunately, these changes are straightforward to make for the calendar visualization.
Formatting the data
To modify the calendar graph for our own data, we'll output data as a time series: one column of date/time stamps, and another column of data to plot.
Let's take a look at the original data:
Date,Open,High,Low,Close,Volume,Adj Close
2010-10-01,10789.72,10907.41,10759.14,10829.68,4298910000,10829.68
2010-09-30,10835.96,10960.99,10732.27,10788.05,4284160000,10788.05
2010-09-29,10857.98,10901.96,10759.75,10835.28,3990280000,10835.28
2010-09-28,10809.85,10905.44,10714.03,10858.14,4025840000,10858.14
2010-09-27,10860.03,10902.52,10776.44,10812.04,3587860000,10812.04
In the code, we can see where this data is actually being parsed:
var data = d3.nest()
.key(function(d) { return d.Date; })
.rollup(function(d) { return (d[0].Close - d[0].Open) / d[0].Open; })
.object(csv);
So, to modify this to suit our own custom data set, we can output our data as:
date,series1series2
2010-10-01,1,150
2010-10-02,2,250
2010-10-03,3,350
and change the data parsing code to:
var data = d3.nest()
.key(function(d) { return d.date; })
.rollup(function(d) {
// Change this depending on what you want to plot
return d[0].series1;
})
.object(csv);
Next, we discuss a few interesting applications of this visualization technique and how to generate the data sets.
MediaWiki Edits
One of the applications of interest was scraping a MediaWiki wiki (charlesreid1.com/wiki to be precise) to determine the number of edits made to the wiki on a given date.
Fortunately, MediaWiki provides a rich API for interacting with wikis programmatically, and one of the best packages for doing it is pywikibot.
The way we compiled the data set for visualization was to scrape page histories for every page on the wiki, creating one observation for each edit on each page, and agglomerate the edits for each day into a final count.
The schema used was:
_id
- sha1 of texttitle
- title of articletimestamp
- timestamp of editcount
- number of characters in edit
The pseudocode used to extract the page edits was:
get pages generator
for page in pages:
get page revisions generator
for revision in page revisions:
drop old doc from database
insert new doc into database
update record
Finally, a bit of pywikibot code:
# Get the site
site = get_site()
# Get the iterator returning pages to process
page_generator = get_page_generator(site, N)
# Run the algorithm:
for page in page_generator:
page_title = page.title()
print("Now parsing page: %s"%(page_title))
rev_generator = page.revisions(content=count_chars)
for rev in rev_generator:
# Assemble the NoSQL document
doc = {}
doc['_id'] = rev.sha1
doc['title'] = page_title
doc['timestamp'] = rev.timestamp
doc['count'] = len(rev.text)
# Insert the new NoSQL document
page_history_collection.insert_one(doc)
The mechanisms to obtain the page generator
def get_site():
"""Get the Site object representing charlesreid1.com
"""
return pywikibot.Site()
def get_page_generator(s,max_items=0):
"""Get the generator that returns the Page objects
that we're interested in, from Site s.
"""
page_generator = s.allpages()
if(max_items>0):
page_generator.set_maximum_items(max_items)
return page_generator
Note that pywikibot will already have the site configured once you run the login script. For more information about how to log in to a wiki with Pybot, and general information about Pywikibot tasks, see the Pywikibot page on the charlesreid1.com wiki.
Git Commits
Another application of these types of calendars comes directly from Github's visualization of the number of commits made by a user on each day.
To extract this information, you will need a folder full of git repositories, which will allow you to use git status to extract commit information from the log of each repository and assemble it all into a time series for a calendar.
While there is a Python package for interfacing with the git API, git itself is extremely powerful and is capable of doing this just fine. We use Python's subprocess library to make a call to git status, and parse the results into a data structure for exporting to CSV.
Here is the code that was used to walk through each directory and extract information from a git status command:
import subprocess
from glob import glob
import os, re
import pandas as pd
import datetime
[clipped]
df = pd.DataFrame()
orgs = glob(root_dir+"/repositories/*")
for org in orgs:
base_org = os.path.basename(org)
repos = glob(org+"/*")
for repo in repos:
# Print out the org and repo name
base_repo = re.sub('.git','', os.path.basename(repo))
log_file = base_org + "." + base_repo + ".log"
print("%s : %s"%(base_org,base_repo))
# Get each commit
with open(status_dir + "/" + log_file, 'r', encoding="ISO-8859-1") as f:
lines = f.readlines()
for line in lines:
tokens = line.split(" ")
commit_id = tokens[0]
date = tokens[1]
time = tokens[2]
msg = tokens[4:]
df = df.append(
dict(
commit_id = tokens[0],
date = tokens[1],
time = tokens[2],
commits = 1,
msg = " ".join(msg)
),
ignore_index=True
)
ag = df.groupby(['date']).agg({'commits':sum})
ag['commits'] = ag['commits'].apply(int)
ag.to_csv('commit_counts.csv')
The last bit of code groups each commit by date, applying the sum function to the number of commits (1 for each commit), to yield the total number of commits for each date:
date,commits
2014-01-17,2
2014-03-26,11
2014-03-28,3
2014-04-01,4
2014-04-02,10
2014-04-03,4
2014-04-04,3
Creating the color map
The best part of the process is picking a color map for the calendar. The ColorBrewer site has some good color palettes inspired by cartographic color needs. Python also provides some useful libraries and functionality for generating colormaps.
There are a number of options:
Here, we'll cover an alternative approach: defining a colormap that linearly interpolates between colors at particular locations on the interval 0 to 1.
We will also use the webcolors
module in Python to convert between colors in
various formats, and a function make_cmap()
available from Chris Slocum:
make_cmap.py:
def make_cmap(colors, position=None, bit=False):
'''
make_cmap takes a list of tuples which contain RGB values. The RGB
values may either be in 8-bit [0 to 255] (in which bit must be set to
True when called) or arithmetic [0 to 1] (default). make_cmap returns
a cmap with equally spaced colors.
Arrange your tuples so that the first color is the lowest value for the
colorbar and the last is the highest.
position contains values from 0 to 1 to dictate the location of each color.
'''
import matplotlib as mpl
import numpy as np
bit_rgb = np.linspace(0,1,256)
if position == None:
position = np.linspace(0,1,len(colors))
else:
if len(position) != len(colors):
sys.exit("position length must be the same as colors")
elif position[0] != 0 or position[-1] != 1:
sys.exit("position must start with 0 and end with 1")
if bit:
for i in range(len(colors)):
colors[i] = (bit_rgb[colors[i][0]],
bit_rgb[colors[i][1]],
bit_rgb[colors[i][2]])
cdict = {'red':[], 'green':[], 'blue':[]}
for pos, color in zip(position, colors):
cdict['red'].append((pos, color[0], color[0]))
cdict['green'].append((pos, color[1], color[1]))
cdict['blue'].append((pos, color[2], color[2]))
cmap = mpl.colors.LinearSegmentedColormap('my_colormap',cdict,256)
return cmap
Now, an example of how we can call this function: this code creates a colormap ranging from purple to orange.
def purple_to_orange():
"""Dark purple to bright orange."""
start_hex = "#4d2b4b"
mid1_hex = "#8c6bb1"
mid2_hex = "#fdae6b"
end_hex = "#f16913"
start_color = [j/255 for j in webcolors.hex_to_rgb(start_hex)]
mid1_color = [j/255 for j in webcolors.hex_to_rgb(mid1_hex)]
mid2_color = [j/255 for j in webcolors.hex_to_rgb(mid2_hex)]
end_color = [j/255 for j in webcolors.hex_to_rgb(end_hex)]
colors = [start_color, mid1_color, mid2_color, end_color]
position = [0, 0.5, 0.6, 1]
cm = make_cmap(colors, position=position)
Now, the following code will evaluate the colormap to create 9 separate hex colors:
# Now just call cm(0.0) thru cm(1.0)
N = 9
hex_colorz = []
for i in range(N+1):
x = i/N
rgbd_color = cm(x)[0:3]
rgb_color = [int(c*255) for c in rgbd_color]
hex_color = webcolors.rgb_to_hex(rgb_color)
hex_colorz.append(hex_color)
print(hex_colorz)
How the calendar code works
Start with the HTML document:
<!DOCTYPE html>
<body style="background: #272b30;">
<div id="calendar"></div>
<script src="https://d3js.org/d3.v4.min.js"></script>
<script>
/* D3 code goes here */
</script>
</body>
Next, the Javascript code that creates the calendar visualization. We'll walk through each part.
var width = 700,
height = 90,
cellSize = 12;
// big integers
var formatStuff = d3.format(",");
/*
TEH COLORRRZZZZ
*/
var realBackgroundColor = "#272b30";
var tileBackgroundColor = realBackgroundColor;//"#3a3a3a";
var tileStrokeColor = "#3a3a3a";
var monthStrokeColor = "#4a4a4a";
var color = d3.scaleQuantize()
.domain([0, 60])
.range(["#4d2b4b","#5a3961","#684777","#77558f","#8463a5","#cc9189","#fba25c","#f78e43","#f47b2b","#f16913"]);
// purple orange
The canvas goes on the div tag with id calendar
:
/*
Make the canvas
*/
var svg = d3.select("div#calendar")
.selectAll("svg")
.data(d3.range(2010, 2019).reverse())
.enter().append("svg")
.attr("width", width)
.attr("height", height)
.append("g")
.attr("transform", "translate(" + ((width - cellSize * 53) / 2) + "," + (height - cellSize * 7 - 1) + ")");
/*
Write the years
*/
svg.append("text")
.attr("transform", "translate(-6," + cellSize * 3.5 + ")rotate(-90)")
.attr("font-family", "sans-serif")
.attr("font-size", 10)
.attr("fill", "#bbb")
.attr("text-anchor", "middle")
.text(function(d) { return d; });
The next two portions are the meat of the calendar visualization, drawing the tiles and outlines:
/*
Draw the tiles representing days of the year
(also draw tile outlines)
*/
var rect = svg.append("g")
.attr("fill", tileBackgroundColor)
.attr("stroke", tileStrokeColor)
.selectAll("rect")
.data(function(d) { return d3.timeDays(new Date(d, 0, 1), new Date(d + 1, 0, 1)); })
.enter().append("rect")
.attr("width", cellSize)
.attr("height", cellSize)
.attr("x", function(d) { return d3.timeWeek.count(d3.timeYear(d), d) * cellSize; })
.attr("y", function(d) { return d.getDay() * cellSize; })
.datum(d3.timeFormat("%Y-%m-%d"));
/*
Draw outlines of groups representing months
*/
svg.append("g")
.attr("fill", "none")
.attr("stroke", monthStrokeColor)
.selectAll("path")
.data(function(d) { return d3.timeMonths(new Date(d, 0, 1), new Date(d + 1, 0, 1)); })
.enter().append("path")
.attr("d", pathMonth);
Now, the code that loads the data, filters it, performs any calculations, and draws colored rectangles on top of the baseline square grid:
/*
Load up the csv file
*/
d3.csv("page_edits.csv", function(error, csv) {
if (error) throw error;
/*
This is where you decide what values to plot
*/
var data = d3.nest()
.key(function(d) { return d.timestamp ; })
.rollup(function(d) {
return d[0].edits;
})
.object(csv);
rect.filter(function(d) { return d in data; })
.attr("fill", function(d) { return color(data[d]); })
.append("title")
.text(function(d) { return d + ": " + formatStuff(data[d]); });
});
Finally, the most mysterious bit of magic in this code is the code that draws the squares around the months. This has to use the coordinates of the beginning and end of the months to draw a complicated square path.
It's magic, it works, we're happy.
function pathMonth(t0) {
var t1 = new Date(t0.getFullYear(), t0.getMonth() + 1, 0),
d0 = t0.getDay(), w0 = d3.timeWeek.count(d3.timeYear(t0), t0),
d1 = t1.getDay(), w1 = d3.timeWeek.count(d3.timeYear(t1), t1);
return "M" + (w0 + 1) * cellSize + "," + d0 * cellSize
+ "H" + w0 * cellSize + "V" + 7 * cellSize
+ "H" + w1 * cellSize + "V" + (d1 + 1) * cellSize
+ "H" + (w1 + 1) * cellSize + "V" + 0
+ "H" + (w0 + 1) * cellSize + "Z";
}
Final result
The finished product, visualizing edits to charlesreid1.com/wiki/ and commits to git.charlesreid1.com, can be seen at the following links:
- Wiki edits calendar: https://charlesreid1.com/wiki_calendar
- Git commits calendar: https://git.charlesreid1.com/git_calendar