Continuity – Preventing data loss and production downtime

Backups

  1. This is an area where technology can be very useful. It is far easier to duplicate digital data than manual duplication of hard copy.
  2. Determine what your Recovery Time Objective (RTO) and Recovery Point Objectives (RPO) are. RPO is the amount of data you can stand to lose if a failure occurs. An example of this would be, if there is a failure, you want to lose no more than the last 24 hours of modified data. If your backups run every night and they finish at 11:00PM, and you have a failure at 10:59PM, you could potentially lose your most recent 24 hours of data, which would then have to be recreated or re-entered. RTO is when a failure occurs, how long does it take until your production is back up and running with your data restored up to the RPO. An example of an RTO might be 4 hours, and is often dependent on the amount of data you have to recover.
  3. Backing up is only half of backing up! Your backups should be verified on a regular basis, which means extracting several production files from the backup, opening the files and verifying they are the same size and contain the same data as the originals currently in production. An unverified backup is an unreliable backup.
  4. Plan a 3 phase backup for all critical production data. 
  • Leverage local Redundant Array of Independent Disk (RAID) technology to protect yourself against hard drive failure. 
  • Decide what software and hardware your backup solution requires to meet your RPOs and RTOs. 
  • Implement offsite backups of production data for disaster recovery. Think of this third step as fire and theft insurance for your data – i.e. not something you will use often for recovery but critical to have when you do need it. There are many solutions to meet this need ranging from rotating tape drives or external USB enclosures to fully managed over-the-internet backups to remote geographically disparate data centers.

Maintenance or Break/Fix

Decide if it makes more sense to approach preventing downtime from a proactive perspective or a reactive one. This is typically decided based on the numbers determined in the Cost, Value and ROI section of this library which should reflect your hourly cost of production downtime. If your downtime cost is high, it makes sense to perform preventative maintenance and proactive monitoring to prevent failure, instead of waiting for failure and then fixing the breakage as fast as possible. An often overlooked and important component of this is the stress factor. When you are unable to meet client deadlines or act on new opportunities because your production technology is down, how does it affect your stake holder morale?