This post was co-authored by Anubhav Mehendru, Group Engineering Manager, Kaizala.
Mobile-only workers depend on Microsoft Kaizala—a simple and secure work management and mobile messaging app—to get the work done. Since COVID-19 has forced many of us to work from home across the world, Kaizala usage has surged close to 3x from pre-COVID-19. While this is a good opportunity for the product to grow, it has increased pressure on the engineering team to ensure that the service scales along with the increased usage while maintaining the customer promised SLA of 99.99 percent.
Today, we’re sharing some of the learnings about managing and scaling an enterprise grade secure productivity app and the backend service behind it.
Foundation of Kaizala
Kaizala is a productivity tool primarily targeted for mobile-only users and is based on Microservice architecture with Microsoft Azure as the core cloud platform. Our workload runs on Azure Cloud Services, with Azure SQL DB and Azure Blob Storage used for primary storage. We use Azure Cache for Redis to handle caching, and Azure Service Bus and Azure Notification Hub manages async processing of events. Azure Active Directory (Azure AD) is used for our user authentication. We use Azure Data Explorer and Azure Monitoring for data analytics. Azure Pipelines is used for automated safe deployments where we can deploy updates rapidly multiple times in a week with high confidence.
We follow a safe deployment process, ensuring minimal customer impact, and stage wise release of new feature and optimizations with full control on exposure and rollback ability.
In addition, we use a centralized configuration management system where all our config can be controlled, such as exposure of a new feature to a set of users/groups or tenants. We fine grained control on msg processing rate, receipt processing, user classification, priority processing, slow down non-core functionalities etc. This allows us to rapidly prototype new feature and optimization over a user set.
Key resiliency strategies
We employ the following key resilience strategies:
API rate limit
To protect our existing service from misuse, we need to control the incoming calls coming from multiple clients within a safe limit. We incorporated a rate limiter entirely based on in-memory caching that does the work with negligible latency impact on customer operations.
Optimized caching
To provide optimal user experience, we created a generic in-memory caching infra where multiple compute nodes are able the quickly sync back the state changes using Azure Redis PubSub. Using this a significant number of external API calls were avoided which effectively reduced our SQL load.
Prioritize critical operations
In case of overload of service due to heavy customer traffic, we prioritize the critical customer operations such as messaging over non-core operations such as receipts.
Isolation of core components
Core system components that support messaging are now totally isolated from other non-core parts so that any overload does not impact the core messaging operations. The isolation is done at every resource level such as separate compute nodes, separate service bus for event processing and totally separate storage for non-core operations.
Reduction in intra node communication
We made multiple enhancements in our message processing system where we significantly reduced scenarios of intra node communication that caused a heavy intra node dependency and slows down the entire message processing.
Controlled service rollout
We made several changes in our rollout process to ensure controlled rollout of new features and optimizations to minimize and negative customer impact. The deployments moved to early morning slots where the customer load is minimal to prevent any downtime.
Monitoring and telemetry
We setup specific monitoring dashboards to give a quick overview of service health that monitor important parameters, such as CPU consumption, thread count, garbage collection (GC) rate, rate of incoming messages, unprocessed messages, lock contention rate, and connected clients.
GC rate
We have finetuned the options to control the rate of Gen2 GC happening in a cloud service as per the needs of the web and worker instances to ensure minimal latency impact of GC during customer operations.
Node partitioning
Users need to be partitioned across multiple nodes to distribute the ownership responsibility using a consistent hashing mechanism. This master ownership helps in ensuring that only required user's information is stored in the in-memory cache on a particular node.
Active passive user
In large group messaging operations, there are always users who are actively using the app while a lot of users are not active. Our idea is to prioritize message delivery for active users so that the last bucket active user received the message fast.
Serialization optimization
Default JSON serialization is costly when the input output operations are very frequent and burn precious CPU cycles. ProtoBuf offers a fast binary serialization protocol that was leveraged to optimize the operations for large data structures.
Scaling compute
We re-evaluated our compute usage in our internal multiple test and scale environments and judiciously reduced the compute node SKU to optimize as per the needs and optimize cost of goods sold (COGS). While most of our traffic in an Azure region is during the day time, there is minimal load at the night where we leverage to do heavy tasks, such as redundant data cleanup, cache cleanups, GC, database re-indexing, and compliance jobs.
Scaling storage
With increasing scale, the load of receipts became huge on the backend service and was consuming a lot of storage. While critical operations required highly consistent data, the requirement is less for non-critical operations. We moved the receipt to highly available No-SQL storage, which costs a tenth of the SQL storage.
Queries for background operations were spread out lazily to reduce the overall peak load on SQL storage. Certain non-critical Operations were moved from being strongly consistent to eventual consistency model to flatten the peak storage load, thus creating more capacity for additional users.
Our future plans
As the COVID-19 situation continues to be grave, we are expecting an accelerated pace of Kaizala adoption from multiple customers. To keep up with the increase in messaging load and high customer usage, we are working on new enhancements and optimizations to ensure that we remain ahead of the curve including:
- Developing alternative messaging flows where users actively using the app can directly pull group messages even if the backend system is overloaded. Message delivery is prioritized for active users over passive users.
- Aggressively working on distributed in-memory caching of data entities to enable fast user response and alternative designs to keep cache in sync while minimizing stale data.
- Moving to container-based deployment model from the current virtual machine (VM)-based model to bring more agility and reduce operational cost.
- Exploring alternative storage mechanism which scale well with massive write operations for large consumer groups supporting batched data flush in a single connection.
- Actively exploring ideas around active-active service configuration to minimize downtime due to data center outages and minimize Recovery Time Objective (RTO) and Recovery Point Objective (RPO).
- Exploring ideas around moving some of the non-core functionalities to passive scale units to utilize the standby compute/storage resources there.
- Evaluating the dynamic scaling abilities of Azure Cloud services where we can automatically reduce the number of compute nodes during nighttime hours where our user load is less than fifth of the peak load.