Custom entity extraction with text analytics
Named entity recognition is an important area of research in machine learning and natural language processing (NLP). Entity extraction extracts searchable knowledge from unstructured text. In practice, it can be used to answer many real-world questions such as determining whether a tweet contains a specific person’s name and location, determining if companies are mentioned in a news article, or understanding whether specific products have been mentioned in reviews. Entity extraction can be particularly useful when applied to areas with intensive use of domain specific terminology such as healthcare, legal and regulatory documentation, or the sciences.
Unstructured text such as documents, tweets, or product reviews usually requires preprocessing before it can be analyzed. Preprocessing can include steps such as removing duplicate or special characters, replacing inflected words with their base form, case conversion, etc. Once the text has been cleaned, it is processed to pull out the most important or frequent terms. This is an important step in reducing the complexity of the solution: each word is a ‘feature’ that the model has to handle, and this step can reduce the features from billions to thousands. Next, a word dictionary is created. The dictionary combines the words (unigrams) that will be used to train the model, and a count of the times the unigram appears in the text corpus. A number of metrics are then calculated to measure frequency and importance of each unigram.
Build & train
Once a word dictionary has been created, the data is split into training and validation sets and model building can begin. Commonly used algorithms in entity extraction range from Two-Class or Multi-Class Logistic Regression, to Multiclass Decision Forests. More recently though, entity extraction solutions have been using powerful deep learning techniques such as Long Short-Term Memory (LSTM) recurrent neural networks with in concert with unsupervised word embedding learning algorithms that train the neural network from an unlabeled training corpus. Recall, precision and f-scores are used to score models. Ultimately, a well performing entity extraction model, trained on a small set of examples, will be able to generalize understanding from any arbitrary new text passed to the system.
When the most effective variant of a model has been identified, that model will be deployed in such a manner that it can be consumed by an application. Often this means the model will be deployed as a web service with a REST endpoint. The model can then be called by Line of Business applications or analytics software. Depending on the business needs, web services be consumed in two modes: RRS (request-response service) and BES (batch execution service). RRS will allow you to process text in real time, a great option for scoring text that might come in as tweets or comments to a product’s web site. BES can be used to process larger amounts of data in an efficient manner, and might lend itself for example to examining a large regulatory text corpus.