Natural language processing (NLP) is the go-to area when someone wants to deploy language-related tasks on computing systems. With audio and text being the core format of information, a number of algorithms and concepts are presented in NLP to deal with them. One such concept module is information extraction (IE). This area of NLP is associated with the task of automatically extracting information from machine-level metadata, or even unstructured data.
Usually, unstructured data is complex and large. It requires efficient readability along with being quick. This is where IE comes into the picture. Techniques such as named-entity recognition (NER) in IE process organises textual information efficiently. In this article, we look into what NER is and see how research studies have developed NER algorithms with the Wikipedia database.
What Is NER?
NER is a process in which the systems or algorithms identify, classify and link entities in the text with entities in other knowledge bases. This entity-linking part is the core task of NER. It basically means that cluttered (unstructured data) text is linked to knowledge bases (structured data) to make it comprehensible.
For example, if we input a sentence like “Christina is working on a new project”, the NER algorithm designed to identify nouns recognises and classifies “Christina” and “project” as nouns. So large sets of nouns, which act as entities for knowledge bases are already collected for the algorithm. All the algorithm now does is link the nouns in the sentence from the knowledge base.
What we have seen here is known as the ‘entity linking algorithms’. These conceptual algorithms have piqued interest among NLP researchers. A number of analytical studies have been done and tested on various knowledge bases like Wikipedia, Twitter and other websites.
Wikipedia As The Knowledge Base In Studies And Frameworks
As mentioned earlier, many NER-related studies have used Wikipedia (and its API too!) to develop efficient entity-linking algorithms. One study by Milan Dojchinovski and team from the University of Economics, Prague, have developed NER systems based on Wikipedia’s Search API as well as Apache Lucene search API (entityclassifier.eu is one framework).
They have also worked with other entity-linking variations like most frequent-sense method, co-occurrence based linking and explicit semantic analysis-based linking, to analyse efficiency on Wikipedia Search and Lucene Search. Their study was presented in the Text Analysis Conference Knowledge Base Population (TAC KBP) 2013.
The study mentioned above worked with variations and combinations of concepts in NER. However, the earlier analysis used illustrations such as graphs for entity linking. A study by X Han and team is one example. They design a graph-based method “which can model and exploit the global interdependence between different entity linking decisions” with Wikipedia as the knowledge base.
This means, inference-reference-based entities on a broader perspective, that is, entities linking other entities. By considering an example from Wikipedia results, they establish a semantic relationship required for entities to chart out the reference graph. Now, this method is compared with conventional linking methods like Wikify!, among others, and fared better in terms of precision versus word recall.
- Dexter: An open source entity-linking framework developed by researchers at ISTI-CNR, Italy, Dexter identifies text fragments in a document referring to entities present in Wikipedia. The linking process is divided into three steps, text fragment identification, disambiguation and ranking, which forms the core module in the software. Since Dexter is open source, it becomes easy to measure and analyse various existing entity linking algorithms.
- Babelfy: A multilingual open source framework, Babelfy has a web interface and a RESTful API to perform entity-linking as well as word sense disambiguation (WSD) to address various problems in computational linguistics. One noteworthy advantage it provides is the integration with Java, the most used programming language in IT. Developed by researchers at Sapienza University of Rome, Italy, Babelfy is intended for everyone and provides a simple, user-friendly interface. You can check out Babelfy here.
- DBPedia: More of a knowledge base, DBPedia is an open source community project developed and supported by numerous researchers from various fields. Wikipedia forms the major information base in DBPedia and supports more than 100 language versions. It is also used for entity linking and other NLP applications through browser-based interactions. SQL is the means to interact in DBPedia. With web queries pervading the online space, DBPedia is certainly helpful in complex tasks in text analysis.
NER has been in existence for the past two decades. With NLP advancements on the rise, text analysis will definitely improve vastly be it in terms of languages, jargons, contexts and so on. Text-based applications no longer need to be computation-intensive or hard to develop as NLP is simplifying all these processes.