At Microsoft, we understand the trust customers put in us by running their most critical workloads on Microsoft Azure. Whether they are retailers with their online stores, healthcare providers running vital services, financial institutions processing essential transactions, or technology partners offering their solutions to other enterprise customers—any downtime or impact could lead to business loss, social services interruptions, and events that could damage their reputation and affect the end-user confidence. In this blog post, we will discuss some of the design principles and characteristics that we see among the customer leaders we work with closely to enhance their critical workload availability according to their specific business needs.
A commitment to reliability with Azure
As we continue making investments that drive platform reliability and quality, there remains a need for customers to evaluate their technical and business requirements against the options Azure provides to meet availability goals through architecture and configuration. These processes, along with support from Microsoft technical teams, ensure you are prepared and ready in the event of an incident. As part of the shared responsibility model, Azure offers customers various options to enhance reliability. These options involve choices and tradeoffs, such as possible higher operational and consumption costs. You can use the flexibility of cloud services to enable or disable some of these features if your needs change. In addition to technical configuration, it is essential to regularly check your team’s technical and process readiness.
“We serve customers of all sizes in an effort to maximize their return on investment, while offering support on their migration and innovation journey. After a major incident, we participated in executive discussions with customers to provide clear contextual explanations as to the cause and reassurances on actions to prevent similar issues. As product quality, stability, and support experience are important focus areas, a common outcome of these conversations is an enhancement of cooperation between customer and cloud provider for the possibility of future incidents. I’ve asked Director of Executive Customer Engagement, Bryan Tang, from the Customer Support and Service team to share more about the types of support you should seek from your technical Microsoft team & partners.”—Mark Russinovich, CTO, Azure.
Design principles
Key elements to building a reliable workload begin with establishing an agreed available target with your business stakeholders, as that would influence your design and configuration choices. As you continue to measure uptime against baseline, it is critical to be ready to adopt any new services or features that can benefit your workload availability given the pace of Cloud innovation. Finally, adopt a Continuous Validation approach to ensure your system is behaving as designed when incidents do occur or identify weak points early, along with your team’s readiness upon major incidents to partner with Microsoft on minimizing business disruptions. We will go into more details on these design principles:
- Know and measure against your targets
- Continuously assess and optimize
- Test, simulate, and be ready
Know and measure against your targets
Azure customers may have outdated availability targets, or workloads that don’t have targets defined with business stakeholders. To cover the targets mentioned more extensively, you can refer to the business metrics to design resilient Azure applications guide. Application owners should revisit their availability targets with respective business stakeholders to confirm those targets, then assess if their current Azure architecture is designed to support such metrics, including SLA, Recovery Time Objective (RTO), and Recovery Point Objective (RPO). Different Azure services, along with different configurations or SKU levels, carry different SLAs. You need to ensure that your design does, at a minimum, reflect:
- Defined SLA versus Composite SLA: Your workload architecture is a collection of Azure services. You can run your entire workload based on infrastructure as a service (IaaS) virtual machines (VMs) with Storage and Networking across all tiers and microservices, or you can mix your workloads with PaaS such as Azure App Service and Azure Database for PostgreSQL, they all provide different SLAs to the SKUs and configurations you selected. To assess their workload architecture, we asked customers about their SLA. We found that some customers had no SLA, some had an outdated SLA, and some had unrealistic SLAs. The key is to get a confirmed SLA from your business owners and calculate the Composite SLA based on your workload resources. This shows you how well you meet your business availability objectives.
Continuously assess options and be ready to optimize
One of the most significant drivers for cloud migration is the financial benefits, such as shifting from Capital Expenditure to Operating Expenditure and taking advantage of the economies cloud providers operating at scale. However, one often-overlooked benefit is our continued investment and innovation in the newest hardware, services, and features.
Many customers have moved their workloads from on-premises to Azure in a quick and simple way, by replicating workload architecture from on-premises to Azure, without using the extra options and features Azure offers to improve availability and performance. Or we see customers treating their Cloud architecture as pets versus cattle, instead of seeing them as resources that work together and can be changed with better options when they are available. We fully understand customer preference, habit, and maybe the worries of black-box as opposed to managing your own VMs where you do maintenance or security scans. However, with our ongoing innovation and commitment to providing platform as a service (PaaS) and software as a service (SaaS), it gives you opportunities to focus your limited resources and effort on functions that make your business stand out.
- Architecture reliability recommendations and adoption:
- We make every effort to ensure you have the most specific and latest recommendations through various channels, our flagship channel through Azure Advisor, which now also supports the Reliability Workbook, and we partner closely with engineering to ensure any additional recommendations that might take time to work into workbook and Azure Advisor are available to your consideration through Azure Proactive Resiliency Library (APRL). These collectively provide a comprehensive list of documented recommendations for the Azure services you leverage for your considerations.
- Security and data resilience:
- While the previous point focuses on configurations and options to leverage for the Azure components that make up your application architecture, it is just as critical to ensure your most critical asset is protected and replicated. Architecture gives you a solid foundation to withstand failure in cloud service level failure, it is as critical to ensure you have the necessary data and resource protection from any accidental or malicious deletes. Azure offers options such as Resource Locks, enabling soft delete on your storage accounts. Your architecture is as solid as the security and identity access management applied to it as an overall protection.
- Assess your options and adopt:
- While there are many recommendations that can be made, ultimately, implementation remains your decision. It is understandable that changing your architecture might not just a matter of modifying your deployment template, as you want to ensure your test cases are comprehensive, and it may involve time, effort, and cost to run your workloads. Our field is prepared to help you with exploring options and tradeoffs, but the decision is ultimately yours to enhance availability to meet the business requirements of your stakeholders. This mentality to change is not limited to reliability, but also other aspects of Well-Architected Framework, such as Cost Optimization.
Test, simulate, and be ready
Testing is a continuous process, both at a technical and process level, with automation being a key part of the process. In addition to a paper-based exercise in ensuring the selection of the right SKUs and configurations of cloud resources to strive for the right Composite SLA, applying Chaos Engineering to your testing helps find weaknesses and verify readiness otherwise. The criticality of monitoring your application to detect any disruptions and react to quickly recover, and finally, knowing how to engage Microsoft support effectively, when needed, can help set the proper expectations to your stakeholders and end users in the event of an incident.
- Continuous validation-Chaos Engineering: Operating a distributed application, with microservices and different dependencies between centralized services and workloads, having a chaos mindset helps inspire confidence in your resilient architecture design by proactively finding weak points and validating your mitigation strategy. For customers that have been striving for DevOps success through automation, continuous validation (CV) became a critical component for reliability, besides continuous integration (CI) and continuous delivery (CD). Simulating failure also helps you to understand how your application would behave with partial failure, how your design would respond to infrastructure issues, and the overall level of impact to end users. Azure Chaos Studio is now generally available to assist you further with this ongoing validation.
- Detect and react: Ensure your workload is monitored at the application and component level for a comprehensive health view. For instance, Azure Monitor helps collecting, analyzing, and responding to monitoring data from your cloud and on-premises environments. Azure also offers a suite of experiences to keep you informed about the health of your cloud resources in Azure Status that informs you of Azure service outages, Service Health that provides service impacting communications such as planned maintenance, and Resource Health on individual services such as a VM.
- Incident response plan: Partner closely with our technical support teams to jointly develop an incident response plan. The action plan is essential to developing shared accountability between yourself and Microsoft as we work towards resolution of your incident. The basics of who, what, when for you and us to partner through a quick resolution. Our teams are ready to run test drill with you as well to validate this response plan for our joint success.
Ultimately, your desired reliability is an outcome that you can only achieve if you take into account all these approaches and the mentality to update for optimization. Building application resilience is not a single feature or phase, but a muscle that your teams will build, learn, and strengthen over time. For more details, please check out our Well Architected Framework guidance to learn more and consult with your Microsoft team as their only objective is you realizing full business value on Azure.