11 Disaster Recovery
N2WS’s DR (Disaster Recovery) solution allows you to recover your data and servers in case of a disaster. DR will help you recover your data for whatever reason your system was taken out of service. N2WS flexibility allows users to copy their backup snapshots to multiple AWS regions as well as to various AWS accounts, combining cross-account and cross-region options.
What does that mean in a cloud environment like EC2? Every EC2 region is divided into AZs which use separate infrastructure (power, networking, etc.). Because N2WS uses EBS snapshots you will be able to recover your EC2 servers to other AZs. N2WS’s DR is based on AWS’s ability to copy EBS snapshots between regions and allows you the extended ability to recover instances and EBS volumes in other regions. You may need this ability if there is a full-scale outage in a whole region. But it can also be used to migrate instances and data between regions and is not limited to DR. If you use N2WS to take RDS snapshots, those snapshots will also be copied and will be available in other regions.
- DynamoDB Tables - DR for DynamoDB tables is currently not supported by AWS.
- Redshift Clusters - Currently N2WS does not support DR of Redshift clusters. If you enable DR on a policy containing Redshift clusters, they will be ignored at the DR stage. You can enable copying Redshift snapshots between regions automatically by enabling cross-region snapshots using the EC2 console.
In the DR Options screen, configure the following, and then select Save.
- Enable DR – Select to display additional fields.
- DR Frequency (backups) – Frequency of performing DR in terms of backups. On each backup, the default is to copy snapshots of all supported backups to other regions. To reduce costs, you may want to reduce the frequency. See section 11.4 for considerations in planning DR.
- DR Timeout (hours) – How long N2WS waits for the DR process on the policy to complete. DR copies data between regions over a WAN (Wide Area Network) which can take a long time. N2WS will wait on the copy processes to make sure they are completed successfully. If the entire DR process is not completed in a certain time frame, N2WS assumes the process is hanging and will declare it as failed. Twenty-four hours is the default and should be enough time for a few 1 TiB EBS volumes to copy. Depending on the snapshot, however, you may want to increase or decrease the time.
- Target Regions – List of regions of region or regions that you want to copy the snapshots of the policy to.
Things to know about the DR process:
- N2WS’s DR process runs in the background.
- It starts when the backup process is finished. N2WS determines then if DR should run and kicks off the process. In the Backup Monitor, you will see the ‘In Progress’ status.
- N2WS will wait until all copy operations are completed successfully before declaring the DR status as Completed as the actual copying of snapshots can take time.
- As opposed to the backup process that allows only one backup of a policy to run at one time, DR processes are completely independent. This means that if you have an hourly backup and it runs DR each time, if DR takes more than an hour to complete, the DR of the next backup will begin before the first one has completed.
- N2WS will keep all information of the original snapshots and the copied snapshots and will know how to recover instances and volumes in all relevant regions.
- The automatic retention process that deletes old snapshots will also clean up the old snapshots in other regions. When a regular backup is outside the retention window and its snapshots are deleted, so are the DR snapshots that were copied to other regions.
N2WS supports backup objects from multiple regions in one policy. In most cases, it would probably not be the best practice, but sometimes it is useful. When you choose a target region for DR, DR will copy all the backup objects from the policy which are not already in this region to that region. For example, if you back up an instance in Virginia and an instance in North California, and you choose N. California as a target region, only the snapshots of the Virginia regions will be copied to California. So, you can potentially implement a mutual DR policy: choose Virginia and N. California as target regions and the Virginia instance will be copied to N. California and vice versa. This can come in handy if there is a problem or an outage in one of these regions. You can always recover the instance in the other region.
- If the source and target regions are the same for DR, no action is required as the target region will default to the source.
- If the target region is different than the source, the target region must have a backup vault with the same name as the source and must be specified using a tag before the DR begins:
- For the "Default" vault, if this is the initial time copying a snapshot to the DR region, go to the AWS Backup console and activate the vault by selecting Backup vaults.
- For a non-default custom vault, a vault with the same name needs to be created in the DR region. For example, if the source region’s vault name is "Test", the DR region also must include a vault with the name "Test".
To set a custom vault name for cross-region EFS DR:
Before the DR, add a tag to the resource with the key
cpm_dr_backup_vaultand the value of the custom backup vault ARN:
Key=’cpm_dr_backup_vault:REGION’, Value =’BACKUP_VAULT_ARN’
Add a key for each target region that is different from the source.
To set a custom vault name for cross-account EFS DR:
Before the DR, add a tag to the resource with the key ‘cpm_dr_backup_vault’ and the value of the custom backup vault ARN:
Key=’cpm_dr_backup_vault:REGION:ACCOUNT_NUMBER’, Value =’BACKUP_VAULT_ARN’
Add a key for each target region that is different from the source.
This section describes the main concepts in planning your DR.
There are some fundamental differences between local backup and DR to other regions. It is important to understand the differences and their implications when planning your DR solution. The differences between storing EBS snapshots locally and copying them to other regions are:
- Copying between regions is transferring data over a WAN. It means that it will be much slower than moving data locally. A data transfer from the U.S to Australia or Japan will take considerably more time than a local copy.
- AWS will charge you for the data transfer between regions. This can affect your AWS costs, and the prices are different depending on the source region of the transfer. For example, in March 2013, transferring data out of U.S regions will cost 0.02 USD/GiB and can climb up to 0.16 USD/GiB out of the South America region.
As an extreme example: You have an instance with 4 x TiB EBS volumes attached to it. The volumes are 75% full. There is an average of 3% daily change in data for all the volumes. This brings the total size of the daily snapshots to around 100 GiB. Locally you take 4 backups a day. In terms of cost and time, it will not make much of a difference if you take one backup a day or four, which is true also for copying snapshots, since that operation is incremental as well. Now you want a DR solution for this instance. Copying it every time will copy around 100 GiB a day. You need to calculate the price of transferring 100 GiB a day and storing them at the remote region on top of the local region.
You want to define your recovery objectives both in local backup and DR according to your business needs. However, you do have to take costs and feasibility into consideration. In many cases, it is ok to say: For local recovery, I want frequent backups, four times a day, but for DR recovery it is enough for me to have a daily copy of my data. Or, maybe it is enough to have DR every two days. There are two ways to define such a policy using N2WS:
- In the definition of your policy, select the frequency in DR Frequency (backups). If the policy runs four times a day, configure DR to run once every four backups. The DR status of all the rest will be Skipped.
- Or, define a special policy for the DR process. If you have a
sqlserver1policy, define another one and name it something like
sqlserver1_dr. Define all targets and options the same as the first policy, but choose a schedule relevant for DR. Then define DR for the second policy. Locally it will not add any significant cost since it is all incremental, but you will get DR only once a day.
To perform DR recovery, you will need your N2WS server up and running. If the original server is alive, then you can perform recovery on it across regions. You want to prepare for the case where the N2WS server itself is down. You may want to copy your N2WS database across regions as well. Generally, it is not a bad idea to place your N2WS server in a different region than your other production data. N2WS has no problem working across regions and even if you want to perform recovery because of a malfunction in only one of the AZs in your region, if the N2WS server happens to be in that zone, it will not be available.
To make it easy and safe to back up the N2WS server database, there is a special policy named
cpmdata. Although N2WS supports managing multiple AWS accounts, the only account that can back up the N2WS server is the one that owns it, i.e., the account used to create it. Define a new policy and name it
cpmdata(case insensitive), and it will automatically create a policy that backs up the CPM data volume.
Not all options are available with the
cpmdatapolicy, but you can control Scheduling, Number of generations, and DR settings.
When setting these options, remember that at the time of recovery you will need the most recent copy of this database, since older ones may point to snapshots that no longer exist and not have newer ones yet. Even if you want to recover an instance from a week ago, you should always use the latest backup of the
DR recovery is similar to regular recovery with a few differences:
- When you selectRecover for a backup that includes DR (DR is in Completed state), you get the same Recovery Panel screen with the addition of a drop-down list.
- The DR Region default is Origin, which will recover all the objects from the original backup. It will perform the same recovery as a policy with no DR.
- When choosing one of the target regions, it will display the objects and will recover them in the selected region.
Volume recovery is the same in any region. For instance recovery, there are a few things that need consideration. An EC2 instance is typically related to other EC2 objects:
- Image ID (AMI)
- Key Pair
- Security Groups
- Kernel ID
- Ramdisk ID
These objects exist in the region of the original instance, but they do not mean anything in the target region. To launch the instance successfully, you will need to replace these original objects with ones from the target region:
- Image ID (AMI) - If you intend to recover the instance from a root device snapshot, you will not need a new image ID. If not (as in all cases with Windows and instance store-based instances), you will need to type a new image ID. If you use AMIs you prepared, you should also prepare them at your target regions and make their IDs handy when you need to recover. If needed, AMI Assistant can help you find a matching image. See section 10.3.4.
- Key Pair - You should have a key pair created with AWS Management Console ready so you will not need to create it when you perform a recovery.
- Security Groups - In a regular recovery, N2WS will remember the security groups of the original instance and use them as default. In DR recovery, N2WS cannot choose for you. You need to choose at least one, or the instance recovery screen will display an error. Security groups are objects you own, and you can easily create them in AWS Management Console. You should have them ready so you will not need to create them when you perform recovery. See section 16.2.4.
- Kernel ID - Linux instances need a kernel ID. If you are launching the instance from an image, you can leave this field empty, N2WS will use the kernel ID specified in the AMI. If you are recovering the instance from a root device snapshot, you need to find a matching kernel ID in the target region. If you do not do so, a default kernel will be used, and although the recovery operation will succeed and the instance will show as running in AWS Management Console, it will most likely not work. AMI Assistant can help you find a matching image in the target region. See section 10.3.4. When you find such an AMI, copy and paste its kernel ID from the AMI Assistant window.
- RAMDisk ID - Many instances do not need a RAM disk at all and this field can be left empty. If you need it, you can use AMI Assistant the same way you do for Kernel ID. If you’re not sure, use the AMI Assistant or start a local recovery and see if there is a value in the RAMDisk ID field.
N2WS can add a tag with an AMI ID to a resource during backup. The tag will hold the AMI ID that is expected to be present on the AWS account in case of recovery to a different AWS account.
Example of tag format that will be used only on the region/account combination specified:
Key = 'cpm_dr_recover_ami:REGION:ACCOUNT'; Value = 'ami-XXXXX'
In this case, the region and account are optional.
Example of tag format for a tag that will be used on any region/account combination:
Key = ‘cpm_dr_recover_ami’; Value = 'ami-XXXXX'
When this tag is found and there is no other proper option for instance recovery, N2WS then uses this AMI if the recovery region and account fits.
N2WS supports DR of encrypted EBS volumes. If you are using AWS KMS keys for encryption:
- N2WS will seek a KMS key in the target region, which has the same alias.
- The AWS ID of the DR account should be added to the ‘Other AWS accounts’ section on a Backup account.
To configure your cross-region DR:
Create a matching-alias key in the source and in the remote region for N2WS to use automatically in the DR copy process:
- If a matching key is not found in the target region, the DR process will fail.
- If the key uses the default encryption, then it will be copied to the other region with the default encryption key as well.
- N2WS supports copy of AMIs with encrypted volumes with the same logic it uses for volumes.
- N2WS supports cross-region DR of encrypted RDS databases, except for the Asia Pacific (Hong Kong) region.
To add the AWS ID of the DR account to the ‘Other AWS accounts’ section of KMS on a Backup account:
- 1.Log on to your Backup AWS account and navigate to the KMS console.
- 2.Select your Customer managed keys.
- 3.Go to the Other AWS accounts section.
- 4.Select Add other AWS accounts.
- 5.In the box, enter the AWS account ID of the DR account.
To support the usage of a custom encryption key for DR, do the following in AWS:
- 1.In the account where the custom key resides:
- 1.Go to KMS and browse to the key you wish to share.
- 2.Go to Other AWS accounts at the bottom of the page and select Add other AWS accounts.
- 3.Add the Id of the DR account you wish to share the key with.
- 2.Go to the volume you wish to copy to the DR account and/or region and add the following tag:
- 1.The tag’s “key” =
- 2.The tag’s “value” = The full arn of the encryption key you shared in step #1, for example,
- 3.If you perform cross-region DR, you will need to have a key for each region as AWS does not allow sharing encryption keys across regions. The tag’s “key” should include the region where the key is. For example, an Ohio key tag will be key =
cpm_dr_encryption_key:us-east-2, value =
Let’s assume a real disaster recovery scenario: The region of your operation is completely down. It means that you do not have your instances or EBS volumes, and you do not have your N2WS Server, as it is down with all the rest of your instances. Here is Disaster Recovery step by step:
- 1.In the AWS Management Console:
- 1.Find the latest snapshot of your
cpmdatapolicy by filtering snapshots with the string
cpmdata. N2WS always adds the policy name to any snapshot’s description.
- 2.Sort by Started in descending order and it will be the first one on the list.
- 3.Create a volume from this snapshot by right-selecting it and choosing Create Volume from Snapshot. You can give the new volume a name so it will be easy to find later.
- 3.As with the regular configuration of an N2WS server:
- 1.Connect to the newly created instance using HTTPS.
- 2.Approve the SSL certificate exception. Assuming the original instance still exists, N2WS will come up in recovery mode, which means that the new server will perform recovery and not backup.
- 3.If you are running the BYOL edition and need an activation key, most likely you do not have a valid key at the time, and you do not want to wait until you can acquire one from N2W Software. You can quickly register at N2WS Free Edition. In step 2 of the registration, use your own username, and type a strong password (section 16.2.3.) In step 3, choose the volume you just created for the CPM data volume. Afterward, complete the configuration.
- 4.With a working N2WS server, you can perform any recovery you need at the target (current) region:
- 1.Select the backup you want to recover.
- 3.Choose the target region from the drop-down list.
- 4.You can recover all the backed-up objects that are available in the region.
DR is a straightforward process. If DR fails, it probably means that either a copy operation failed, which is not common, or that the process timed-out. You can track DR’s progress in the Recovery Monitor
Log screen where every stage and operation during DR is recorded:
You can also view DR snapshot IDs and statuses in the
View Snapshots screen of the Backup Monitor:
Every DR snapshot is displayed with region information and the IDs of both the original and the copied snapshots. In the Snapshots list, you can choose to
Delete All AWS Snapshots in This Backup.
If DR fails, you will not be able to use DR recovery. However, some of the snapshots may exist and be recoverable. You can see them in the snapshots screen and, if needed, you can recover from them manually.
If DR keeps failing because of timeouts, you may need to increase the timeout value for the relevant policy. The default of 24 hours should be enough, but there may be a case with a very large amount of data, that may take longer.