charlesreid1.com blog

The Z-Machine: A Simple Turing Machine

Posted in Computer Science

permalink

Table of Contents

Background

Recently I discovered the wonderful blog of John Graham-Cumming. One of hist posts, from 2013, details a question that he had to answer for the Oxford University Department of Computer Science's "interviews" (which, I believe, are a kind of final examination "interview" to graduate, not an interview for admittance to the program). Graham-Cumming describes one of the quetions he was presented with during his interview.

The Z-Machine: Setup

Here is the problem setup:

Suppose you have a computer with a very simple memory layout. The memory consists of a series of numbered locations, each of which can store numbers. These numbers are positive or negative integers. Here is an illustration of an example of this memory layout:

Z Machine Memory Layout

The Z-Machine: Instructions

Z-Machine Instructions

The machine can only perform three instructions: Zero (Z), Increment (I), and Jump (J).

The Z operator zeros out a location in memory. The operation specifies which index should be zeroed out. For example, Z4 will zero out index 4 (which is the 5th item in memory, since indexing starts at 0).

The I operator increments the value at a location in memory by 1. The operation specifies which index should be incremented. For example, I6 will increment index 6 (the 7th item in memory) by 1.

The J operator compares two locations in memory. If the values are different, the jump operator will branch (that is, jump to a different location in the code). The two locations are specified when calling the operator, and an arrow (or operation number) indicates where the operator should branch TO if the values are not the same. If the values are the same, the code continues.

The program stops when it reaches the end of the instruction list.

Simple Example: Loop

Here is an example of a loop program. This program sets memory index 4 to zero, then increments it until it is equal to the value in memory index 20:

001   Z4
002   I4
003   J4,20 --> 002

The instruction J4,20 --> 002 indicates that the values in cell 4 and cell 20 should be compared, and if they are not equal, the machine should jump to instruction 002.

Implementing an Addition Operation on the Z-Machine

Graham-Cumming includes the following programming challenge in his blog post:

Suppose a machine has two numbers in the first two locations in memory. Utilize these three operations to add the two numbers together and put the result into the third location in memory.

Under what circumstances does the program fail?

The Solution Approach (The Maths)

To approach the solution, start with the maths. What we're doing is trying to define a "complex" arithmetical operation (addition) from simpler "unit" operations (increment by one), so it will be helpful to wipe our mental slate clean and start at the very beginning of the problem.

When I teach a math class, whether it be a developmental math class, an algebra class, or a calculus class, I always spend the first "full" lecture by guiding the students through this very procedure. Here's how I set the tone: "Imagine that you are stranded on a desert island, with no calculators, no math books, nothing but your fingers and toes. Now suppose you are tasked with reinventing all of mathematics, entirely from scratch. How would you do it?"

This is a challenging task - and part of the challenge is just knowing where to begin (just how clean should you wipe the mental slate?). The Z-Machine problem formulation resolves that problem by explicitly enumerating valid operations. But let's continue with the desert island analogy for a bit.

If we begin at what is truly the beginning, we can start with a single unit, the number 1. (If we want to fast forward through thousands of years of human history, we can instead start with the number 0 in addition to the number 1.) Having only a single number is boring, because we can't count anything. We need a way to generate more numbers. So, we begin by defining an increment operation. We begin with the unit, 1. We declare that we can combine 1 with any other number. When we combine 1 with another 1, we get a new, larger number - which we arbitrarily call two, and represent using this funny squiggle: 2.

Now that we have defined the increment operation, adding a unit, we can begin to generate new numbers. We start with 1+1, which gives 2. The next number can be found by adding 1 to 2, which gives us a new number that we arbitrarily call three, and represent with a funny squiggle: 3.

We continue in this manner, until we reach 9, and run out of squiggles to write. The next number we will get is a special number, because it is equal to the total number of fingers. When we get to 9, and add one more, we get "two hands", which we arbitrarily call ten. If we want to keep counting beyond ten, we're stuck, because we've run out of fingers. But we can take a shortcut - we can let one toe represent "two hands". So, we hold up one toe, to represent ten. To write ten, we can let the first digit represent how many toes we are holding up, and the second digit represent how many fingers we are holding up. That means we can write our "two hands" quantity as 10 - one toe, no fingers.

We can keep on incrementing by 1, and using this system we can count all the way up to 99, at which point we will need another pair of hands or feet to keep generating new numbers, or we can suppose that after counting to 99, we are able to hold numbers in our head.

But once again, we're generating numbers slowly. We want a way to generate more numbers, faster, so we can count higher. So, we define a new addition operation. Rather than adding 1, we define the general operation of addition recursively. To add two numbers like a and b, we can define this addition in terms of a unit increment:

$$ a + b = a + 1 + 1 + 1 + \dots + 1 $$

We increment the quantity a by 1, b times. This gives us a way to add arbitrary numbers together, so now we can reach much larger numbers by taking the largest number that we can count to, and adding that number to itself.

Extending this approach can lead us from an increment operation (performed b times) to an addition operation (+b).

It can also lead from an addition operation (performed b times) to a multiplication operation (*b).

Extending the idea further, we can apply the multiplication operation (performed b times) and obtain an exponentiation operation (^b).

This recursive definition of new operations can continue as long as we'd like: applying the exponentiation operation b times yields tetration (^b^b^b^b^b...^b).

But let's get back to addition.

Solution 1: Positive Integers Only

Adding two positive integers is the simplest case. Essentially, we just perform two loops: the first loop increments the result and increments a temporary variable 1, and does that until the temporary variable 1 is equal to the first number. The second loop increments the result and increments the result by 1 for a number of times equal to the number at index 1.

001   Z2                    // clear space for the result
002   Z3                    // clear space for temp variable 1
003   I2                    // increment result
004   I3                    // increment temp variable 1
005   J3,0 --> 003
006   Z4                    // clear space for temp variable 2
007   I2                    // increment result
008   I3                    // increment temp variable 2
009   J4,1 --> 007
010   Z3                    // clean up

(Because we only have an increment operation at our disposal, there is no way for us to deal with negative numbers. Dittos for non-integer real numbers.)

This method will fail when either of the two numbers we are adding are zero.

Solution 2: Dealing With Zeros

A second solution that is a bit more challenging is dealing with the case of possible zeros in the first or second position. The algorithm above will increment the result and the temporary variable at least once (similar to a do-while loop structure), which will always cause the comparison operation J2,0 or J3,1 to fail if either cell 0 or cell 1 holds a zero.

Here is code that can deal more gracefully with a zero in either the first or second positions. This utilizes some extra space in memory to keep track of whether index 0 is a zero and whether index 1 is a zero.

// initialize
001     Z3
002     Z4 // temp 0
003     Z5 // temp 1
004     Z6 // is index 0 a zero?
005     Z7 // is index 1 a zero?
006     Z8 // zero
007     Z9 // one
008     I9

// increment by amount in index 0
009     J0,8 --> 014
010     I6
011     J4,6 --> 014
012     I4
013     I3
014     J0,4 --> 009

// increment by amount in index 1
015     J1,8 --> 020
016     I7
017     J7,8 --> 020
018     I5
019     I3
020     J1,5 --> 017

// clean up
021     Z4
022     Z5
023     Z6
024     Z7

The central idea behind this algorithm is, we keep incrementing the target cell while a condition is false, and the condition we are checking is based on a separate, independent counter. That allows us to correctly increment (and stop incrementing) based on the two numbers in index 0 and index 1. (We don't want the final result cell to be involved in our final condition check.)

This pattern can also be expanded to work for adding an arbitrary number of numbers; one simply needs to add an additional temp variable and an additional "is zero" variable for each new number being added to the total, then another block of 6 statements to increment by the amount in the given index. The block of 6 statements checks if the number we are adding is zero, and if it is not, the result is incremented by that many times.

Implementing a Decrement Operator on the Z-Machine

Suppose an operator places a number into cell 0 of the Z-Machine's memory. We require that the Z-Machine subtract 1 from that number, and place it in cell 1.

The pseudocode approach here is to increment two cells in a particular order: cell 2, which contains a sentinel value, is incremented. The program them checks if cell 2 is equal to cell 0. If it is not, the program increments cell 1, and repeats. If cell 2 is equal to cell 0, the program stops before cell 1 is incremented, leaving it one less than the original value in cell 0.

001   Z1            // decrement result
002   Z2            // zero
003   Z3            // one
004   I3
005   J2,3 --> 007  // always false
006   I1
007   I2
008   J2,0 --> 006

This pseudocode uses hard-coded constants (zero and one) to create a jump condition that is always false and therefore always followed. This allows the machine to skip a line of code like instruction 006, and perform the increment operation in a staggered manner, as described above.

Implementing a Less Than Operator on the Z-Machine

Another challenging operation to implement with the Z-Machine is a comparison operator. Suppose that an operator places two numbers into the first two memory positions of the Z-Machine. That is, index 0 contains a number A, and index 1 contains a number B. Supposing these numbers are both natural numbers (either 0 or positive integers), a comparison operator will compare the two numbers, select the smaller of the two numbers, and place it into the third position in memory (index 2).

The pseudocode approach to implement the comparison operator is to create a counter that start at zero, and check if it is equal to A or B (the numbers at index 0 and index 1). If we continue to increment our counter, and check if it is equal to A or B, and stop when it reaches either A or B, we can guarantee that we will stop when the counter reaches the smaller of the two numbers.

In order to increment the memory cell at index 2 to hold the smaller of the two numbers at index 0 and index 1, we can use the following Z-Machine code, which continually checks if the number at index 2 is equal to either the number at index 0 or the number at index 1, increments if false, and stops when true (when it reaches the smaller of the two).

001     Z2 // smaller of the two numbers
002     Z3 // zero
003     Z4 // one
004     I4

005     J0,2 --> 007
006     J3,4 --> 011
007     J1,2 --> 009
008     J3,4 --> 011
009     I2
010     J3,4 --> 005
011     Z4

Note that this code successfully handles the case where either number (or both) is 0 or any positive integer.

Who Cares? (Or, How To Build A Computer)

This whole exercise may appear, at first glance, to be an exercise in trivial pursuit. Why bother reinventing the wheel? Isn't this nothing more than an entertaining puzzle?

To the contrary - the process of assembling a sequence of simple operations into a cascade of more complex operations is precisely how computational devices are assembled from circuit components. For example, a flip flop circuit utilizes a pair of NOR (negation of OR) gates to store bits. The Apollo Guidance Computer was composed entirely of NOR gates.

In fact, the Apollo Guidance Computer is a fantastic example of a computational device constructed from a set of such simple instructions as the ones available in the Z-Machine. A few example operations from the Wikipedia article on the Apollo Guidance Computer:

AD (add)
    Add the contents of memory to register A and store the result in A.

INDEX
    Add the data retrieved at the address specified by the instruction 
    to the next instruction. INDEX can be used to add or subtract 
    an index value to the base address specified by the operand 
    of the instruction that follows INDEX. This method is used 
    to implement arrays and table look-ups.

It is not unusual for a hardware platform to have a small set of basic commands or instructions that can be carried out, and for that set of instructions to be different from hardware platform to hardware platform. Designing a new computational device requires the system designer to adapt to the hardware's capabilities - not the other way around. For that reason, it is important to keep those engineering and puzzle-solving skills sharp. You never know when you'll be designing a new computer device.

Sources

  1. "The Two Problems I Had To Solve In My Oxford Interview." John Graham-Cumming. Published 2 May 2013. Accessed 24 April 2017. <http://blog.jgc.org/2013/05/the-two-problems-i-had-to-solve-in-my.html>

  2. "Flip Flop (electronics)." Wikipedia. The Wikimedia Foundation. Edited 13 April 2017. Accessed 24 April 2017. <https://en.wikipedia.org/wiki/Flip-flop_%28electronics%29>

  3. "Apollo Guidance Computer." Wikipedia. The Wikimedia Foundation. Edited 5 April 2017. Accessed 24 April 2017. <https://en.wikipedia.org/wiki/Apollo_Guidance_Computer>

Tags:    turing machine    computer science    computer engineering    apollo    assembly   

AWSome Day Seattle Notes: Part 2: Networking, Security, and Miscellany

Posted in Charlesreid1

permalink

These notes are also available on git.charlesreid1.com here or in https://git.charlesreid1.com/charlesreid1/aws.

AWSome Day Notes: Part 2: Networking, Security, and Miscellany

Following are some notes from Amazon's AWSome Day (Tuesday, February 27, 2018).

Nomenclature

Elastic: You'll see the word "elastic" on a lot of Amazon's services. The "elastic" concept refers to a service that is able to handle huge increases in traffic (Pokemon Go - number of users grew orders of magnitude faster/larger than what they designed for).

Virtual Private Cloud (VPC): The AWS equivalent of a virtual private network (VPN). A VPC is a virtual network that allows a given set of nodes in the same region and zone to create a virtual network to communicate privately.

Services

This document will give a brief summary of some of the popular cloud services. The model for most of these technologies is, the Apache Software Foundation will release an open-source big data project (can be installed/run by anyone). But since most people are using cloud providers anyway, the cloud providers offer their own ready-to-go implementation of these services. These run stable versions of the Apache software, wrapped by the cloud provider's API. This is a win-win because you don't have to fiddle with wrangling servers, and they can use their resources more wisely.

As an example, Apache Kafka is software for handling message streams. (Like a giant digital mail room - some services broadcast/publish messages, some services receive/subscribe to messages.) You can install Kafka locally or on a cluster, or rent cloud nodes and install it yourself. Or, you can use Kinesis on AWS, or PubSub on Google Cloud Platform (GCP), both of which are elastic (completely transparent to you) and charge for data throughput instead of compute time. Code written for Kafka can be uploaded and used without modification.

A few other important services, listing the Apache, AWS, and GCP equivalents:

Apache AWS GCP Description
Hadoop Kinesis Analytics Dataproc Running data-intensive parallel jobs
Spark Kinesis Analytics Dataproc Running data-intensive parallel jobs
HDFS S3 GC Storage Object-based file storage
Beam Kinesis Streams Dataflow Sream/batch data processing pipelines
Impala Redshift/Athena BigQuery Handles SQL queries on massive (>1 PB) data sets
Kafka Kinesis PubSub Message streaming

There are many other cloud services, some without a corresponding Apache project (e.g., Google's Machine Learning APIs or Amazon's text-to-speech API) but these five are common in big data ecosystems.

Networking

Cloud networking is like the condiment bar of cloud providers - customers don't pay for it, but they can help themselves.

Why set up a VPC?

  • Scaling - having the ability to connect nodes via network means you can scale up client-server services (e.g., databases/web servers)
  • Security - VPC traffic is encrypted and not visible to outsiders, even when it occurs over public channels. Services can be set up to listen only for traffic from the VPC. You can also connect from an outside box (e.g., your laptop) to the VPC using a VPN client.
  • Learning - you have to deal with some nitty gritty details, but learning how to set up virtual networks gives you a real education in network security and in how the internet works.

You can still accomplish a lot even with simple networking patterns.

Making a VPC: Plan

To create a VPC, you first define the entire VPC network, then define a subnet on the network. The subnet must have an internet gateway, a routing table, and DHCP/DNS added to it so that nodes on the subnet can access the outside internet and find each other.

Here is a visual depiction of the architecture:

         +--------------------------------------------------------------------------+
         | Whole Internet                                                           |
         |                                                                          |
         |                                                                          |
         |  +--------------------------------------------------------------------+  |
         |  |  Amazon                                                            |  |
         |  |                                                                    |  |
         |  |                                                                    |  |
         |  |                                                                    |  |
         |  |   +---------------------------------------------------+            |  |
         |  |   |   Virtual Private Cloud: WAN                      |            |  |
         |  |   |                                             +-----+-----+      |  |
         |  |   |   Network IP Block: 10.117.0.0/16           | Internet  |      |  |
         |  |   |                     10.117.*.*              | Gateway   |      |  |
         |  |   |                                             +-----+-----+      |  |
         |  |   |    +----------------------------------+           |            |  |
         |  |   |    |  VPC Subnet: LAN                 |           |            |  |
         |  |   |    |                                  |     +-----+-----+      |  |
         |  |   |    |  Subnet IP Block: 10.117.0.0/24  |     |  Routing  |      |  |
         |  |   |    |                   10.117.0.*     |     |  Table    |      |  |
         |  |   |    |                                  |     +-----+-----+      |  |
         |  |   |    |                                  |           |            |  |
         |  |   |    +----------------------------------+           |            |  |
         |  |   |                                             +-----+-----+      |  |
         |  |   |                                             |   DHCP    |      |  |
         |  |   |                                             +-----+-----+      |  |
         |  |   |                                                   |            |  |
         |  |   +---------------------------------------------------+            |  |
         |  |                                                                    |  |
         |  +--------------------------------------------------------------------+  |
         |                                                                          |
         +--------------------------------------------------------------------------+

Making a VPC: Network and Subnet

To sepcify the network and subnet IP ranges, you use CIDR notation, which is an IP address with zeroes in several blocks, and a suffix indicating which blocks are available for this network to assign. (For reference, A.B.C.D/8 means "the last 3 blocks" or B.C.D; A.B.C.D/16 means "the last 2 blocks" or C.D; and A.B.C.D/24 means "the last block" or .D).

The network IP address can be specified as:

10.X.0.0/16

where X is any number between 0 and 254 (when you include the reserved 255 value, that totals 256, since IP addresses are just 32 bit strings, 4 blocks of 8 bits each - 2^8 = 256). The subnet can be specified with the CIDR IP range:

10.X.Y.0/24

where X and Y are integers between 0 and 254. This means any node joining this subnet will have an IP address of the form 10.X.Y.*. Two IP addresses with the same X value are on the same VPC network; two IP addresses with the same Y value are on the same subnet.

In the example above, X = 117, and we have the VPC defined by:

10.117.0.0/16

and the subnet defined by:

10.117.0.0/24

Making a VPC: The Essentials

Once you've added the VPC and the subnet, you'll also need to add three essential services to the VPC network:

  • Internet gateway - this is something you create from the VPC section of the AWS console. It's like your home wifi router that's connected to the internet. actually, it is the VPC network router, and it is connected to the internet.
  • Routing table - this tells computers on the network how to find each other and how to find the gateway.
  • DHCP - domain host control protocol, this is the service that is used to hand out IP addresses
  • (Bundled with DHCP) DNS - domain name service, this is used to turn web URLs into IP addresses

Now you should be good to go. (Scratch that - make a security group first.)

Making a VPC: Security Group

One last step, after you've constructed the network, is to create a security group to control outside access to the VPC. This is an AWS-level firewall that will only allow traffic on ports you specify. Here we specify a single security group for every node on the network, and the security policy opens the following ports from and to computers in the same VPC:

  • 19999 - netdata
  • 9090 - prometheus
  • 3000 - grafana
  • 27017 - mongodb
  • 8080 - mongoexpress

Adding Nodes to the VPC

Test out your network by creating a t2.micro node and specifying the subnet you created as the network the node should connect to. Check to ensure you can access the internet.

A good schema to use is:

10.X.0.1      gateway
10.X.0.100    node0
10.X.0.101    node1
10.X.0.102    node2
...

Scaling Your Process

If you're SSHing into a machine, your automation is broken. - Alex the AWS trainer

There are some tricks to getting your process to scale, but the essential part is figuring out how to remove SSH from the process. Cloud OS images (e.g. Ubuntu) have cloud-init, which runs an init script on boot. More info in this Stack Overflow post. This must be a bash script and size is limited to 16 KB. If it takes > 10 minutes, AWS will treat it as hanging and kill the node, so keep it (relatively) simple.

Example of what you can do:

  • Check out a git repo with initailization scripts
  • Install a hard-coded copy of your SSH public (not private) key, so you can get passwordless access to the node later
  • Spawn a server or process in the background
  • Start a server or process in a screen so that later you can SSH into the machine and attach to the screen to monitor the progress of the job or the service

And so on.

SSH and Identity Management

Identity access management (IAM) and access controls are the most confusing part of any cloud platform - don't feel frustrated if you don't pick it up easily.

IAM is a way for you to provide limited, specified access to resources that you own/have created, without sharing your credentials.

IAM is useful for you to use yourself (e.g., if you are using Amazon Kinesis and you want it to access an Amazon Redshift database, you can create an IAM for your Amazon Kinesis script that gives it permission to access data in a particular Amazon Reshift database, and run your Kinesis script under that identity).

IAM is also useful for sharing resources with other AWS users. You can create a VPC and create an IAM group that allows people to manage the network, and add any network administrators to the group. This gives them full control over the network. You can also create an IAM group that gives members the ability to see, but not modify, the network (e.g., billing manager).

More info on IAM

Tricks with Disks

AMI (Amazon Machine Images) provide operating system images that you can use to initialize a new node. But you can also create your own AMIs. Use the EC2 console to create an AMI (image) from any of your running nodes; wait for the snapshot to complete (may take a few minutes); and now you can spawn a new node from the exact state of the existing node.

This is useful for a couple of tasks:

  • If you're doing a parameter study of an analysis technique and need to run a process in parallel on the same data set, you can download the data set and set up your code on a single node, create an image once you're finished, then spawn new nodes using that snapshot
  • If you already have a node with a huge data set that took hours to download and process, and you realize your node needs another 32 GB of RAM, or a few extra CPUs, you won't be able to resize it on the fly. But you can create an AMI from the running node, shut it down, and create a new node from the image.
  • If you created a node in region us-west-1a and another node in region us-east-1c and you want to connect them together on a VPC, you won't be able to create a VPC that spans regions. But you can create an AMI from one of the running nodes, shut it down, and spawn a new node in the correct region.
  • If you have a particularly large data set, a complicated workflow setup, or a custom operating system, you can create a public AMI and share it with others.
  • You can also privately share AMIs

AWS CLI and Boto3

AWS from the Command Line

If you've clicked the AWS web interface to death and are looking for a better way of interacting with AWS, you can use aws-cli, a command line interface to AWS. This is extremely convenient for scripting common tasks.

aws-cli

This is a really powerful way of interacting with S3 buckets, and makes copying files to/from S3 buckets as easy as copying files in local directories.

Boto3

The CLI is much more powerful than the web interface, but is still missing some of the options available through the web. For the complicated stuff, use boto3, the Python API for AWS services.

boto3 on github

boto3 documentation

The basic formula for boto scripts is, you create Python objects representing different consoles or resources (e.g., a VPC object or an EC2 object), and you call functions to perform actions on those resources or in those consoles. These functions can take complicated nested dictionaries as parameters and allow you to specify every option available in AWS.

To Infinity... And Beyond!

Other libraries like terraform build on boto3 and simplify the routine tasks of juggling information and configuration details, providing a kubernetes-like dashboard to run and manage an AWS cluster.

Instance Metadata

One last topic - how to get access to information about your EC2 node, from your EC2 node.

There are some special IP addresses - any IP address starting with 10 is reserved for private networks, as are networks starting with 192.168, and 127.0.0.1 is the IP address that points to yourself.

The IP address 169.254 is another reserved IP block, and is normally used for crossover connections (direct ethernet-to-ethernet connections). Given its uselessness in the cloud, AWS repurposed that IP address to store instance metadata.

This IP address exposes a restful API that you can call with curl. Information is organized hierarchically. For example:

$ curl http://169.254.169.254/latest/meta-data/

ami-id
ami-launch-index
ami-manifest-path
block-device-mapping/
hostname
iam/
instance-action
instance-id
instance-type
local-hostname
local-ipv4
mac
metrics/
network/
placement/
profile
public-hostname
public-ipv4
public-keys/
reservation-id
security-groups
services/

User data is also available:

http://169.254.169.254/latest/user-data

as are the IP addresses of network interfaces. The network interface can be specified if there are more than one. To get the private VPC IP, request the local-ipv4 address:

http://169.254.169.254/local-ipv4

and to get the public IP address, use public-ipv4:

http://169.254.169.254/public-ipv4

by assigning these values to environment variables, you can write and run a single script across multiple machines, and control the execution behavior across those machines.

Link to more info on instance data

Tags:    aws    cloud    vpc    containers    data engineering   

AWSome Day Seattle Notes: Part 1: The Basics

Posted in Charlesreid1

permalink

These notes are also available on git.charlesreid1.com here or in https://git.charlesreid1.com/charlesreid1/aws.

AWSome Day Notes: Part 1: The Basics

Following are some notes from Amazon's AWSome Day (Tuesday, February 27, 2018).

EC2 Costs and Scheduling

Cost of a node:

  • Important to understand Amazon's price model: users pay for access, not for hardware
  • Cost of AWS node is cost for on the spot access

Scheduling:

  • If you can anticipate your usage, you can schedule instances in advance, and get a discount
  • Discount of 50% for one-year reservation (if you keep it busy for 6 months, you've made your money back)
  • Spot instances also available - need to be robust to sudden starts/stops (good for embarrassingly parallel jobs)
  • Cheaper to anticipate your usage and plan ahead

EC2 Transfer Costs

EC2 Instances:

EC2-Internet:

  • Traffic going from the internet into a node is always free
  • Traffic going from the node out to the internet incurrs costs after 10 TB
  • Outbound traffic costs ~$90/TB

AWS Regions:

  • Traffic within a region does not incur costs (well... it's complicated)
  • Traffic between regions will incur costs

EC2-S3:

  • Transfer into an EC2 node from S3 bucket in same AWS region does not incur costs
  • Transfer out of an EC2 node into S3 bucket in same AWS region does not incur costs
  • (If they did charge you, they would be double-dipping...)

Note: the list of prices is like a legal document, so use the AWS Monthly Calculator to estimate monthly costs with more detail.

S3 Transfer Costs

  • See S3 Pricing - Data Transfer
  • Price model for storage is simliar to price model for AWS nodes: you pay for access, not for hardware
  • To give a sense of why, think about logistics of a large "disk farm": all the intensive operations are done by the head nodes, disks are just passive
  • Busier disk farm needs sophisticated hardware for parallel read/write, high-bandwidth network lines, fast encryption

S3 storage pricing:

  • Rule of thumb: ~$20/TB to store the data

S3-Internet:

  • Transfer into an S3 bucket from the internet is always free (getting stuff into the bucket is the easy part - that's how they get ya)
  • Transfer out of an S3 bucket to the internet costs ~$90/TB

S3-EC2:

  • Transfer out of an S3 bucket to most other Amazon regions costs ~$20/TB
  • Transfer out of an S3 bucket into an EC2 node in the same AWS region does not incur costs
  • Transfer into an S3 bucket from an EC2 node in the same AWS region does not incur costs

As mentioned above, this means you won't be double-charged for transferring data from an S3 bucket to an EC2 node, then from the EC2 node out to the internet.

S3 Storage Hierarchies

Continuing with the theme of planning ahead...

Storage hierarchies:

  • Biggest cost of storage is not disk space, it's transfer
  • Paying for speed, paying for timeliness, paying for on the spot access to your data
  • Your data will be cheaper if you're willing to wait a few minutes or deal with a slow connection

Storage hierarchies:

  • Standard (~$20/TB)
  • Infrequent access (~$13/TB) - less frequent access, but at same transfer speed
  • Glacier (~$4/TB) - delay of up to 12 hours (smaller files = faster), deleting data newer than 3 months incurrs costs

Glacier Pricing

Lifecycle rules: * Can create rules to move old data from S3 buckets into Glacier

EFS vs EBS vs S3

When do you use EFS, EBS, or S3?

Elastic Block Storage (EBS):

  • This is probably what you want
  • EBS is block storage for one EC2 node - designed for general purpose applications
  • Cost: ~$120/TB/mo

Elastic File System (EFS):

  • EFS is block storage for multiple EC2 nodes - designed for fast read-write operations, many incremental changes to files
  • "Elastic" part of EFS - can dynamically grow as hard drive grows (PB+ scale)
  • Hard drive on steroids - like plugging in a hard drive over a network, but big/fast/smart enough to be accessible to thousands+ of machines
  • Expensive: ~$300/TB/mo

S3:

  • S3 is object storage - it stores blobs of raw data, creates snapshots in time
  • If you change a single character of a large file, bucket has to create new shapshot
  • Booting from S3 as a hard disk would take you about a thousand years... don't do that
  • Cheapest: ~$20/TB

Cool but \($\):

  • You may see "appliances" mentioned in Amazon documentation - Amazon will ship you a physical data transfer appliance that encrypts and copies data on site (Snowball)
  • Can also purchase special network connections that bypass the public internet - like ISP putting alligator clips between your network lines and Amazon's network lines

Tags:    aws    cloud    vpc    containers    data engineering   

March 2022

How to Read Ulysses

July 2020

Applied Gitflow

September 2019

Mocking AWS in Unit Tests

May 2018

Current Projects