Time and again, we have seen how AI helps companies accelerate what’s possible by streamlining operations, personalizing customer interactions, and bringing new products and experiences to market. The shifts in the last year around generative AI and foundation models are accelerating the adoption of AI within organizations as companies see what technologies like Azure OpenAI Service can do. They’ve also pointed out the need for new tools and processes, as well as a fundamental shift in how technical and non-technical teams should collaborate to manage their AI practices at scale.
This shift is often referred to as LLMOps (large language model operations). Even before the term LLMOps came into use, Azure AI had many tools to support healthy LLMOps already, building on its foundations as an MLOps (machine learning operations) platform. But during our Build event last spring, we introduced a new capability in Azure AI called prompt flow, which sets a new bar for what LLMOps can look like, and last month we released the public preview of prompt flow’s code-first experience in the Azure AI Software Development Kit, Command Line Interface, and VS Code extension.
Today, we want to go into a little more detail about LLMOps generally, and LLMOps in Azure AI specifically. To share our learnings with the industry, we decided to launch this new blog series dedicated to LLMOps for foundation models, diving deeper into what it means for organizations around the globe. The series will examine what makes generative AI so unique and how it can meet current business challenges, as well as how it drives new forms of collaboration between teams working to build the next generation of apps and services. The series will also ground organizations in responsible AI approaches and best practices, as well as data governance considerations as companies innovate now and towards the future.
From MLOps to LLMOps
While the latest foundation model is often the headline conversation, there are a lot of intricacies involved in building systems that use LLMs: selecting just the right models, designing architecture, orchestrating prompts, embedding them into applications, checking them for groundedness, and monitoring them using responsible AI toolchains. For customers that had started on their MLOps journey already, they’ll see that the techniques used in MLOps pave the way for LLMOps.
Unlike the traditional ML models which often have more predictable output, the LLMs can be non-deterministic, which forces us to adopt a different way to work with them. A data scientist today might be used to control the training and testing data, setting weights, using tools like the responsible AI dashboard in Azure Machine Learning to identify biases, and monitoring the model in production.
Most of these techniques still apply to modern LLM-based systems, but you add to them: prompt engineering, evaluation, data grounding, vector search configuration, chunking, embedding, safety systems, and testing/evaluation become cornerstones of the best practices.
Like MLOps, LLMOps is also more than technology or product adoption. It’s a confluence of the people engaged in the problem space, the process you use, and the products to implement them. Companies deploying LLMs to production often involve multidisciplinary teams across data science, user experience design, and engineering, and often include engagement from compliance or legal teams and subject matter experts. As the system grows, the team needs to be ready to think through often complex questions about topics such as how to deal with the variance you might see in model output, or how best to tackle a safety issue.
Overcoming LLM-Powered application development challenges
Creating an application system based around an LLM has three phases:
- Startup or initialization—During this phase, you select your business use case and often work to get a proof of concept up and running quickly. Selecting the user experience you want, the data you want to pull into the experience (e.g. through retrieval augmented generation), and answering the business questions about the impact you expect are part of this phase. In Azure AI, you might create an Azure AI Search index on data and use the user interface to add your data to a model like GPT 4 to create an endpoint to get started.
- Evaluation and Refinement—Once the Proof of Concept exists, the work turns to refinement—experimenting with different meta prompts, different ways to index the data, and different models are part of this phase. Using prompt flow you’d be able to create these flows and experiments, run the flow against sample data, evaluate the prompt’s performance, and iterate on the flow if necessary. Assess the flow’s performance by running it against a larger dataset, evaluate the prompt’s effectiveness, and refine it as needed. Proceed to the next stage if the results meet the desired criteria.
- Production—Once the system behaves as you expect in evaluation, you deploy it using your standard DevOps practices, and you’d use Azure AI to monitor its performance in a production environment, and gather usage data and feedback. This information is part of the set you then use to improve the flow and contribute to earlier stages for further iterations.
Microsoft is committed to continuously improving the reliability, privacy, security, inclusiveness, and accuracy of Azure. Our focus on identifying, quantifying, and mitigating potential generative AI harms is unwavering. With sophisticated natural language processing (NLP) content and code generation capabilities through (LLMs) like Llama 2 and GPT-4, we have designed custom mitigations to ensure responsible solutions. By mitigating potential issues before application production, we streamline LLMOps and help refine operational readiness plans.
As part of your responsible AI practices, it’s essential to monitor the results for biases, misleading or false information, and address data groundedness concerns throughout the process. The tools in Azure AI are designed to help, including prompt flow and Azure AI Content Safety, but much responsibility sits with the application developer and data science team.
By adopting a design-test-revise approach during production, you can strengthen your application and achieve better outcomes.
How Azure helps companies accelerate innovation
Over the last decade, Microsoft has invested heavily in understanding the way people across organizations interact with developer and data scientist toolchains to build and create applications and models at scale. More recently, our work with customers and the work we ourselves have gone through to create our Copilots have taught us much and we have gained a better understanding of the model lifecycle and created tools in the Azure AI portfolio to help streamline the process for LLMOps.
Pivotal to LLMOps is an orchestration layer that bridges user inputs with underlying models, ensuring precise, context-aware responses.
A standout capability of LLMOps on Azure is the introduction of prompt flow. This facilitates unparalleled scalability and orchestration of LLMs, adeptly managing multiple prompt patterns with precision. It ensures robust version control, seamless continuous integration, and continuous delivery integration, as well as continuous monitoring of LLM assets. These attributes significantly enhance the reproducibility of LLM pipelines and foster collaboration among machine learning engineers, app developers, and prompt engineers. It helps developers achieve consistent experiment results and performance.
In addition, data processing forms a crucial facet of LLMOps. Azure AI is engineered to seamlessly integrate with any data source and is optimized to work with Azure data sources, from vector indices such as Azure AI Search, as well as databases such as Microsoft Fabric, Azure Data Lake Storage Gen2, and Azure Blob Storage. This integration empowers developers with the ease of accessing data, which can be leveraged to augment the LLMs or fine-tune them to align with specific requirements.
And while we talk a lot about the OpenAI frontier models like GPT-4 and DALL-E that run as Azure AI services, Azure AI also includes a robust model catalog of foundation models including Meta’s Llama 2, Falcon, and Stable Diffusion. By using pre-trained models through the model catalog, customers can reduce development time and computation costs to get started quickly and easily with minimal friction. The broad selection of models lets developers customize, evaluate, and deploy commercial applications confidently with Azure’s end-to-end built-in security and unequaled scalability.
LLMOps now and future
Microsoft offers a wealth of resources to support your success with Azure, including certification courses, tutorials, and training material. Our courses on application development, cloud migration, generative AI, and LLMOps are constantly expanding to meet the latest innovations in prompt engineering, fine-tuning, and LLM app development.
But the innovation doesn’t stop there. Recently, Microsoft unveiled Vision Models in our Azure AI model catalog. With this, Azure’s already expansive catalog now includes a diverse array of curated models available to the community. Vision includes image classification, object segmentation, and object detection models, thoroughly evaluated across varying architectures and packaged with default hyperparameters ensuring solid performance right out of the box.
As we approach our annual Microsoft Ignite Conference next month, we will continue to post updates to our product line. Join us this November for more groundbreaking announcements and demonstrations and stay tuned for our next blog in this series.