Custom entity extraction with text analytics
Named entity recognition is a key area of research in machine learning and natural language processing. Entity extraction pulls searchable named entities from unstructured text. In practice, it’s used to answer many real-world questions, such as whether a tweet contains a person’s name and location, whether a company is named in a news article, or whether a specific product is mentioned in a review. Entity extraction is particularly useful when applied to areas with intensive use of domain-specific terminology, such as healthcare, legal and regulatory documentation, or the sciences.
Unstructured text, such as that found in documents, tweets, or product reviews, usually requires preprocessing before it can be analyzed. Preprocessing includes removing duplicate or special characters, replacing inflected words with their base form, or case conversion. Once the text has been cleaned, it’s processed to pull out the most important or frequent terms. This is an important step in reducing the complexity of the solution; each word is a feature that the model must handle, and this step lowers the number of features from billions to thousands. Next, a word dictionary is created. The dictionary combines the words (unigrams) used to train the model with the number of times the unigrams appear in the text corpus. Metrics are then calculated to measure the frequency and importance of each unigram.
Build and train
Once a word dictionary has been created, the data is split into training and validation sets, and model building begins. Commonly used algorithms in entity extraction range from Two-Class Logistic Regression to Multiclass Decision Forests. More recently, entity extraction solutions have been using powerful deep-learning techniques such as Long Short-Term Memory recurrent neural networks with unsupervised word embedding, learning algorithms that train the neural network from an unlabeled training corpus. Recall, precision, and F1 scores are used to score models. Ultimately, a well-performing entity extraction model, trained on a small set of examples, will be able to generalize understanding from any arbitrary text passed to the system.
When the most effective variant of a model has been identified, that model will be deployed in such a way that it can be consumed by an application. Often this means the model is deployed as a web service with a REST endpoint. The model can then be called by line-of-business applications or analytics software. Depending on business needs, web services are consumed in two modes: in a request-response fashion or in a batch mode. Processing text in real time is a great option for scoring text that comes in as tweets or comments on a product website where the text must be scored for use in an application. Use batch mode to process larger amounts of data efficiently; for example, to examine a regulatory text corpus or volumes of historical data.