Today’s question and answer style post comes after I had the chance to sit down with Bryan Heymann, Head of Cloud Architecture at Finastra, discussing his experience with Azure Site Recovery. Finastra builds and deploys technology on its open software architecture, our conversation focused on the organization’s journey to replace several disparate disaster recovery (DR) technologies with Azure Site Recovery.
To learn more about achieving resilience in Azure, refer to this whitepaper.
You have been on Azure for a few years now – before we get too deep in DR, can you start with some context on the cloud transformation that you are going through at Finastra?
We think of our cloud journey across three horizons. Currently, we're at “Horizon 0” – consolidating and migrating our core data centers to the cloud with a focus on embracing the latest technologies and reducing total cost of ownership (TCO.) The workloads are a combination of production sites and internal employee sites.
Initially, we went through a 6-month review with a third party to identify our datacenter strategy, and decided to select a public cloud. Ultimately, we realized that Microsoft would be a solid partner to help us on our journey. We moved some of our solutions to the cloud and our footprint has organically grown from there.
All this is the enabler to go to future “horizons” to ensure we continuously keep pace with the latest technology. Most importantly we’re looking to move up the value chain – so instead of us worrying about standing a server up, patching a server, auditing a server, identity on a server… we’re now ingesting and deploying the right policies for the service (not the server) and taking advantage of the availability, security, and disaster recovery options.
Exciting to hear about the journey you've taken so far. I believe DR is a requirement across all of those horizons, right?
Disaster recovery is front and center for us. We work closely with our clients to regularly test. At this point we have executed more than 50 test failovers. Disaster recovery and backups are non-negotiable standards in our shared environment.
What were you most skeptical about when it came to DR in Azure? What was it that helped you become convinced that Azure Site Recovery was the right choice?
We used just about every tool in our data centers and always had mixed results. We thought that Azure Site Recovery might be the same, but I was glad that we were wrong. We have a strong success rate and even wrote special dashboards to track our recovery point objective (RPO) for a holistic view on our posture! We were skeptical that Site Recovery would be point and click capable, and whether it would be able to keep up with the amount of change we have, when failing over from the East coast to the West coast. Our first DR test in Azure, over two years ago now, was actually wildly successful. We did not expect the low RPO that we saw and were delighted. I think this speaks volumes to Azure’s network backbone and how you handle replication, to be that performant.
We hear that from a lot of customers. Great to get further validation from you! Could you ‘double click’ on your onboarding experience with Azure Site Recovery, up to that first DR drill?
There wasn't any heavy onboarding, which is a good thing as it really wasn’t needed. It was so intuitive and easy for our team to use. The documentation was very accurate. The point and click capabilities of Site Recovery and the documentation enabled us to onboard and go. It has all been in line with what we needed, without surprises.
What kind of workloads are you protecting with Azure Site Recovery?
All of our virtual machines (VMs) across North America are using Site Recovery, everything from our lending space, to our payment space, to our core space. These systems support thousands of our customers, and each of those have their own end customers which would number in the millions – Site Recovery is our default disaster recovery mechanism across the whole fleet.
Wow, that’s a lot of customers and some sensitive financial spaces so no wonder disaster recovery is such a high priority for your teams. We regularly hear prospective customers asking whether Azure Site Recovery supports Linux – I'd love to understand if you have Linux-based applications using Site Recovery, and what your experience has been with those?
Actually, it was our very first application for which we deployed Azure Site Recovery – and it’s all Linux. Linux support for Site Recovery has been fantastic. We failover every six months, without any issues. The ease of use and the amount of times we have tested now has significantly increased. We pressed on our normal RPOs to get them down to very, very aggressive levels. Some of our Linux based applications are complex, but Site Recovery has worked without any issues.
You touched upon DR drills – I'd love to understand what your drill experience has been like?
The experience has been seamless and simple. The application itself may have some configurations that need to be considered during DR drills, such as post-scripts, but those are hammered out quickly. We try to do drills every six months, but at least once every year.
Which features of Azure Site Recovery do you like the most?
I love that I can fail across regions. I also love the granular recovery point reporting. It allows us to see where we may or may not be seeing problems. I'm not sure we ever got that from any other tools, it’s very powerful and it’s graphical user interface based – and any Joe could do it, it's not hard to select a VM and replicate it to another region. I especially like the fact that we are only charged for storage in the far side region so, financially, there's not an impact of having warm standbys and still we are able to hit our RPO.
If you were to go through this journey all over again with Azure Site Recovery, is there anything that you would have done differently?
I would have liked to get our knowledge base and plans in place for a month longer before implementing it. It's just so easy that we were able to blow through most of it, but we did miss a couple of simple things early on which were easily fixed later on our journey. We found out quickly we didn't want standard hard drives, we wanted premium for example.
Looking forward, how do you plan to leverage Azure Site Recovery?
We recently used Azure Site Recovery to move a customer in our payment space from on premises to Azure – we will now get those machines on Site Recovery across Azure regions, we're not going to rebuild the entire platform. It's obviously the de facto to get us out, and it is the standard for regional disaster recovery for VMs there. There is no other product used.
People ask me what keeps me up at night, there are really two things. “Are we secure?” and “Can we recover?” – I call it the butterfly effect. When you come in each morning, are you confident that if you cratered a datacenter, you could come up in a different one? I can confidently answer that with yes. We could fail out to another region, with all our data. That's a pretty nice spot to be in, especially when you're sitting in a hyperscale cloud. I know that I have storage replication. I know that I own the network links. To allow somebody to run this stuff on our behalf was a mindset change, but it has really been a positive experience for us.