What RDS is
RDS is, first and foremost, a managed infrastructure and common automation. It is the EC2 you know and love, geographicly distributed, and managed through an API. You get your choice of instance class within a certain subset of instance types, a stripe of EBS volumes to match your capacity needs, and a bunch of traditional metrics and alarms.What RDS is not
Many adoptors assume that with the above, they also have access to, or elimenate the need for, a DBA with expertise on their particular application's schema and performance profile. That's not the case and leads to early adoption failure, or worse yet, a production deployment dependent on an unsupported (yet unrestricted) engine feature.The Ops tools box
In light of the previous comments, it sounds like the wrong tool for the job. On the contrary, it's the right tool for the right jobs. The key is to understand what those jobs are and when the limitations of a managed service have been reached.Over the next few posts, I'll cover RDS features for high availability and disaster recovery, their impact, and key indicators for their use.
- Snapshot and Point-in-Time Restores
- Multi-AZ
- Read Replication
- ETL
Snapshot vs Point-in-Time Restores
The first and most basic tool for any operations team is backup and restore. RDS provides two native facilities for this, Manual Snapshots and Automatic Backups. Aside from the obvious difference in the name of the restore process and it's implied use, RTO is affected based on when you PITR.Consider this: restoring from a full backup is faster and more deterministic when compared to a peicemeal restore. There's just too many variables involved. Was there a large DML? Schema changes? How many transactions are included? What is the transaction throughput of write ahead logging based on transaction size? All of these affect restores in the same way that lest work is required for a snapshot restore than a point-in-time restore in RDS.
Restoring 23 hours and 59 minutes of transactions could take longer and have a higher production impact in terms of availability when compared to losing data by rolling back to the last daily snapshot. ETL can close the data gap between newly ingested data and restored data.General guidelines thus include regular restore testing, manual snapshot creation prior to large loads or DDL, and aligning the daily backup window with the lowest period of WriteIOPS. (More on Snapshot impact next time.)
If there's one thing I must stress, its that you should enable Automatic Backups. Now. Just stop and go enable it. (Preferably during your lowest daily WriteIOPS!) Seriously, no one's going to blame you if you opt in to a feature which barely costs anything.
In fact, this is so important, here's a link to help get your started: AWS Console: RDSIf you're the DBA, Developer, DevOps, or SysAdmin data is your company's lifeblood. The only person to blame if you don't take backups is you. So go set it up now.