AutoRecovery in the Public Cloud

by | 16 Jul 2020

“Everything fails, all the time.” – Werner Vogels

While VNS3 is extremely stable, it is not immune to the underlying hardware and network issues that public cloud vendors experience. VNS3 provides a variety of methods to achieve High Availability and instance replacement. However all of that takes place above the customer responsibility line. What can you do for your cloud deployment to protect yourself from the inevitable failures that take place below the line?

On top of the solutions offered by public cloud providers, Cohesive Networks offers a variety of methods for achieving instance and network recovery, whether it be BGP distance weighting, Cisco style Preferred Peer lists or our Management Server (VNS3:ms) which will programmatically replace a running instance or facilitate Active & Passive running of VNS3 instances. Keep an eye on this blog space for further discussions in these key areas.

AutoRecovery in AWS via CloudWatch

Amazon Web Services has perhaps the most comprehensive function for protecting yourself from underlaying failures. They offer what they call a CloudWatch alarm action. This monitor is tied to your instance ID, should AWS status checks fail, your instance will be brought up on new hardware, while retaining its instance ID, private IP, any Elastic IPs and all associated metadata. You get to set the periodicity of the check and the total checks that will kick off the migration. So if you need to have assurance that you instance will get moved to good hardware after as little as two minutes, you can set it as such. From a VNS3 perspective, this ensures that any IPSec tunnels will get reestablished, any overlay clients will reconnect and any route table rules pointing to the instance will maintain health once the instance has recovered. On top of all of this you can configure it to publish any alarm states to an SNS topic so that you receive notification should this occur. Cohesive Networks highly recommends that you set this up for all VNS3 controllers and Management Servers.

You can find out more about configuring AWS CloudWatch alarm actions here:

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/UsingAlarmActions.html

Service Healing in Azure Cloud

The Microsoft Azure cloud has the concept of “Service Healing.” While it is not user configurable, it is not dissimilar from AWS in that Azure has a method whereby it monitors the underlaying health of the virtual machines and hypervisors in it’s data centers and will auto recover virtual machines should they or their hypervisors fail. This process is is managed by their Fabric Controllers which themselves have built in fault tolerance. As of now Azure does not provide any user controls over this process nor notifications and the process can take up to 15 minutes to complete, since the first action is to reboot the physical server that the virtual machines run on and failing that will then proceed to migrate VMs to other hardware. Azure does state that they employ some level of deterministic methodologies for pro-active auto-recovery.

Live Migration in Google Cloud

The Google Cloud Platform has taken a fairly different approach. Over at the Google cloud all instances are set to “Live Migrate” by default. So should there be a hardware degradation and not a total failure, your VM will be migrated to to new hardware with some loss of performance during the process. If there is a total failure your VM will be rebooted onto new hardware. This also applies to any planned maintenance that might effect they underlaying hardware your VM is running on. As with AWS and Azure all of your instance identity will transfer with the VM such as IPs, volume data and metadata. Should you want to forgo the “Live Migration, you can configure your instances to just reboot onto new hardware. All failed hardware events in GCP are logged at the host level and can be alerted on.