Microsoft makes it easier to build popular language representation model BERT at large scale

This post is co-authored by Rangan Majumder, Group Program Manager, Bing and Maxim Lukiyanov, Principal Program Manager, Azure Machine Learning.

Today we are announcing the open sourcing of our recipe to pre-train BERT (Bidirectional Encoder Representations from Transformers) built by the Bing team, including code that works on Azure Machine Learning, so that customers can unlock the power of training custom versions of BERT-large models using their own data. This will enable developers and data scientists to build their own general-purpose language representation beyond BERT.

The area of natural language processing has seen an incredible amount of innovation over the past few years with one of the most recent being BERT. BERT, a language representation created by Google AI language research, made significant advancements in the ability to capture the intricacies of language and improved the state of the art for many natural language applications, such as text classification, extraction, and question answering. The creation of this new language representation enables developers and data scientists to use BERT as a stepping-stone to solve specialized language tasks and get much better results than when building natural language processing systems from scratch.

The broad applicability of BERT means that most developers and data scientists are able to use a pre-trained variant of BERT rather than building a new version from the ground up with new data. While this is a reasonable solution if the domain’s data is similar to the original model’s data, it will not deliver best-in-class accuracy when crossing over to a new problem space. For example, training a model for the analysis of medical notes requires a deep understanding of the medical domain, providing career recommendations depend on insights from a large corpus of text about jobs and candidates, and legal document processing requires training on legal domain data. In these cases, to maximize the accuracy of the Natural Language Processing (NLP) algorithms one needs to go beyond fine-tuning to pre-training the BERT model.

Additionally, to advance language representation beyond BERT’s accuracy, users will need to change the model architecture, training data, cost function, tasks, and optimization routines. All these changes need to be explored at large parameter and training data sizes. In the case of BERT-large, this can be quite substantial as it has 340 million parameters and trained over 2.5 billion Wikipedia and 800 million BookCorpus words. To support this with Graphical Processing Units (GPUs), the most common hardware used to train deep learning-based NLP models, machine learning engineers will need distributed training support to train these large models. However, due to the complexity and fragility of configuring these distributed environments, even expert tweaking can end up with inferior results from the trained models.

To address these issues, Microsoft is open sourcing a first of a kind, end-to-end recipe for training custom versions of BERT-large models on Azure. Overall this is a stable, predictable recipe that converges to a good optimum for developers and data scientists to try explorations on their own.

“Fine-tuning BERT was really helpful to improve the quality of various tasks important for Bing search relevance,” says Rangan Majumder, Group Program Manager at Bing, who led the open sourcing of this work. “But there were some tasks where the underlying data was different from the original corpus BERT was pre-trained on, and we wanted to experiment with modifying the tasks and model architecture. In order to enable these explorations, our team of scientists and researchers worked hard to solve how to pre-train BERT on GPUs. We could then build improved representations leading to significantly better accuracy on our internal tasks over BERT. We are excited to open source the work we did at Bing to empower the community to replicate our experiences and extend it in new directions that meet their needs.”

“To get the training to converge to the same quality as the original BERT release on GPUs was non-trivial,” says Saurabh Tiwary, Applied Science Manager at Bing. “To pre-train BERT we need massive computation and memory, which means we had to distribute the computation across multiple GPUs. However, doing that in a cost effective and efficient way with predictable behaviors in terms of convergence and quality of the final resulting model was quite challenging. We’re releasing the work that we did to simplify the distributed training process so others can benefit from our efforts.”

Results

To test the code, we trained BERT-large model on a standard dataset and reproduced the results of the original paper on a set of GLUE tasks, as shown in Table 1. To give you estimate of the compute required, in our case we ran training on Azure ML cluster of 8xND40_v2 nodes (64 NVidia V100 GPUs total) for 6 days to reach listed accuracy in the table. The actual numbers you will see will vary based on your dataset and your choice of BERT model checkpoint to use for the upstream tasks.

Table1. GLUE development set results. Google BERT results are evaluated by using published BERT models on development set. The “average” column is simple average over the table results. F1 scores are reported for QQP and MRPC, Spearman correlations are reported for STS-B, and accuracy scores are reported for the other tasks. The results for tasks with smaller dataset sizes have significant variation and may require multiple fine-tuning runs to reproduce the results.

The code is available in open source on the Azure Machine Learning BERT GitHub repo. Included in the repo is:

A PyTorch implementation of the BERT model from Hugging Face repo.
Raw and pre-processed English Wikipedia dataset.
Data preparation scripts.
Implementation of optimization techniques such as gradient accumulation and mixed precision.
An Azure Machine Learning service Jupyter notebook to launch pre-training of the model.
A set of pre-trained models that can be used in fine-tuning experiments.
Example code with a notebook to perform fine-tuning experiments.

With a simple “Run All” command, developers and data scientists can train their own BERT model using the provided Jupyter notebook in Azure Machine Learning service. The code, data, scripts, and tooling can also run in any other training environment.

Summary

We could not have achieved these results without leveraging the amazing work of the researchers before us, and we hope that the community can take our work and go even further. If you have any questions or feedback, please head over to our GitHub repo and let us know how we can make it better.

Learn how Azure Machine Learning can help you streamline the building, training, and deployment of machine learning models. Start free today.

Microsoft makes it easier to build popular language representation model BERT at large scale

Results

Summary

The Microsoft Azure Team

AT&T and Microsoft scale trillion-token workloads with Microsoft Foundry and AMD

Azure Databricks delivers proven business value

GPT-5.6 now available in Microsoft Foundry

Explore Microsoft Foundry

Microsoft makes it easier to build popular language representation model BERT at large scale

Results

Summary

The Microsoft Azure Team

Related posts

AT&T and Microsoft scale trillion-token workloads with Microsoft Foundry and AMD

Azure Databricks delivers proven business value

GPT-5.6 now available in Microsoft Foundry

Explore Microsoft Foundry