Friday, February 21, 2014

AWS RDS Operations Design Patterns (Part 1)

Having recently come across this otherwise excellent collection of AWS usage tips, I was a bit disappointed in the RDS guidance. To often I see the RDS service adopted as a solution to avoid hiring a full-time DBA or to outsource the operations overhead of EC2. While the latter is an excellent choice, expectations for what infrastructure level actions are available in a managed service is frequently overlooked, and the former is a complete misconception.

What RDS is

RDS is, first and foremost, a managed infrastructure and common automation. It is the EC2 you know and love, geographicly distributed, and managed through an API. You get your choice of instance class within a certain subset of instance types, a stripe of EBS volumes to match your capacity needs, and a bunch of traditional metrics and alarms.

What RDS is not

Many adoptors assume that with the above, they also have access to, or elimenate the need for, a DBA with expertise on their particular application's schema and performance profile. That's not the case and leads to early adoption failure, or worse yet, a production deployment dependent on an unsupported (yet unrestricted) engine feature.

The Ops tools box

In light of the previous comments, it sounds like the wrong tool for the job. On the contrary, it's the right tool for the right jobs. The key is to understand what those jobs are and when the limitations of a managed service have been reached.
Over the next few posts, I'll cover RDS features for high availability and disaster recovery, their impact, and key indicators for their use.

Snapshot and Point-in-Time Restores
Multi-AZ
Read Replication
ETL

Snapshot vs Point-in-Time Restores

The first and most basic tool for any operations team is backup and restore. RDS provides two native facilities for this, Manual Snapshots and Automatic Backups. Aside from the obvious difference in the name of the restore process and it's implied use, RTO is affected based on when you PITR.
Consider this: restoring from a full backup is faster and more deterministic when compared to a peicemeal restore. There's just too many variables involved. Was there a large DML? Schema changes? How many transactions are included? What is the transaction throughput of write ahead logging based on transaction size? All of these affect restores in the same way that lest work is required for a snapshot restore than a point-in-time restore in RDS.

Restoring 23 hours and 59 minutes of transactions could take longer and have a higher production impact in terms of availability when compared to losing data by rolling back to the last daily snapshot. ETL can close the data gap between newly ingested data and restored data.

General guidelines thus include regular restore testing, manual snapshot creation prior to large loads or DDL, and aligning the daily backup window with the lowest period of WriteIOPS. (More on Snapshot impact next time.)
If there's one thing I must stress, its that you should enable Automatic Backups. Now. Just stop and go enable it. (Preferably during your lowest daily WriteIOPS!) Seriously, no one's going to blame you if you opt in to a feature which barely costs anything.

In fact, this is so important, here's a link to help get your started: AWS Console: RDS

If you're the DBA, Developer, DevOps, or SysAdmin data is your company's lifeblood. The only person to blame if you don't take backups is you. So go set it up now.

Next Up: Multi-AZ

There's a time and a place for sychronous replication and that's a completly different article.

Friday, January 31, 2014

Introduction to DynamoDB with Node.js

To get started with DynamoDB and the AWS SDK for Node.js, leveraging IAM Roles in a simple t1.micro is a great place to start.

First, create an IAM Role with a policy which permits all DynamoDB APIs for now:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "dynamodb:*"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Once an instance is launched in that role, install Node and load the AWS SDK module:

$ npm install aws-sdk
$ node
>

Because we deployed the EC2 instance in a Role, the SDK automaticlly assumes the IAM Role for temporary credential management. So let's load the SDK and point to the desired region:

> var AWS = require('aws-sdk');
undefined  
> AWS.config.update({region: 'us-west-2'});
undefined

Having already created the sample tables from documentation based on the public DynamoDB documentation, it's fairly easy to use the quick sample to query the tables in the region:

> var db = new AWS.DynamoDB();
undefined  
> db.listTables(function(err, data) {
...    console.log(data.TableNames);
... });
{ service: ... }
> [ 'Forum', 'ProductCatalog', 'Reply', 'Thread' ]

And there we go, a simple start to the use of Node.js with IAM Roles in EC2 instance to list existing tables in a specific region.

Wednesday, January 29, 2014

Zero to Build Failed in 20 minutes

In the process of learning Node.js, I elected to build some basic projects in the language, but also to leverage source control (GitHub, of course) and Continuous Integration. The obvious choice for this project is Travis CI, which provides free build testing for open source repositories.

First, you'll need to setup a .travis.yml. I shamelessly reviewed a few favorite node project's configurations along with the Travis CI help before settling on the simple, concise:

language: node_js  
node_js:  
  - "0.11"
  - "0.10"
  - "0.8"
  - "0.6"

That's more than enough testing, as I'll probably only use the 0.10 branch and 0.11 once it's released.
So what does Travis CI do for Node builds? Well it executes npm test, by default. What on earth does that do? Well, there's really nothing helpful in the documentation, but again, by perusing a few favorite projects it executes a script in the root of the project named test.js or all scripts in a folder named test.

Alright, so then just save an empty file named test right? Not so much. The easiest way to get started with some test driven development is to assert what you expect your functions to return for certain known values.
Consider a simple Fibonacci series:

module.exports = function fib(n) {  
   if ((n === 1) || (n === 0)) {
      return n;
   } else {
      return fib(n-1) + fib(n-2);
   }
}

And then require the module and assert in test.js:

// Import libraries
var fib = require('./fib.js');  
var assert = require('assert');

// Exercise fib
assert.equal(fib(4),3);  
assert.equal(fib(5),5);  
assert.equal(fib(6),8);  
assert.equal(fib(7),13);

At this point, a git push to the repository, provided Travis CI is enabled for it, will result in the above code being executed in Travis CI's build environments and any uncaught exceptions will fail the build.
But consider this, if I purposely throw an exception then catch it, does my build fail when the application experience isn't actually passing?

module.exports = function buildHyjinx()  
{
try  
  {
     if(true) throw "empty";
  }
catch(err)  
  {
     console.log(err);
     return -1;
  }
}

Wait a minute! A build passes because we handled an exception appropriately? Yes. So build real life tests and include testing for functionality you don't yet have. Then fail your build, on purpose, or you'll never know what you haven't done yet.

The Evening Read