Last updated December 8, 2020
In AI Origins & Evolution

Exploring Named-Entity Recognition With Wikipedia

Published on August 24, 2018
by Abhishek Sharma

Natural language processing (NLP) is the go-to area when someone wants to deploy language-related tasks on computing systems. With audio and text being the core format of information, a number of algorithms and concepts are presented in NLP to deal with them. One such concept module is information extraction (IE). This area of NLP is associated with the task of automatically extracting information from machine-level metadata, or even unstructured data.

Usually, unstructured data is complex and large. It requires efficient readability along with being quick. This is where IE comes into the picture. Techniques such as named-entity recognition (NER) in IE process organises textual information efficiently. In this article, we look into what NER is and see how research studies have developed NER algorithms with the Wikipedia database.

What Is NER?

NER is a process in which the systems or algorithms identify, classify and link entities in the text with entities in other knowledge bases. This entity-linking part is the core task of NER. It basically means that cluttered (unstructured data) text is linked to knowledge bases (structured data) to make it comprehensible.

For example, if we input a sentence like “Christina is working on a new project”, the NER algorithm designed to identify nouns recognises and classifies “Christina” and “project” as nouns. So large sets of nouns, which act as entities for knowledge bases are already collected for the algorithm. All the algorithm now does is link the nouns in the sentence from the knowledge base.

What we have seen here is known as the ‘entity linking algorithms’. These conceptual algorithms have piqued interest among NLP researchers. A number of analytical studies have been done and tested on various knowledge bases like Wikipedia, Twitter and other websites.

Wikipedia As The Knowledge Base In Studies And Frameworks

As mentioned earlier, many NER-related studies have used Wikipedia (and its API too!) to develop efficient entity-linking algorithms. One study by Milan Dojchinovski and team from the University of Economics, Prague, have developed NER systems based on Wikipedia’s Search API as well as Apache Lucene search API (entityclassifier.eu is one framework).

They have also worked with other entity-linking variations like most frequent-sense method, co-occurrence based linking and explicit semantic analysis-based linking, to analyse efficiency on Wikipedia Search and Lucene Search. Their study was presented in the Text Analysis Conference Knowledge Base Population (TAC KBP) 2013.

The study mentioned above worked with variations and combinations of concepts in NER. However, the earlier analysis used illustrations such as graphs for entity linking. A study by X Han and team is one example. They design a graph-based method “which can model and exploit the global interdependence between different entity linking decisions” with Wikipedia as the knowledge base.

This means, inference-reference-based entities on a broader perspective, that is, entities linking other entities. By considering an example from Wikipedia results, they establish a semantic relationship required for entities to chart out the reference graph. Now, this method is compared with conventional linking methods like Wikify!, among others, and fared better in terms of precision versus word recall.

On the other hand, there have been various frameworks for entity linking using Wikipedia, with Dexter, Babelfy and DBPedia being the popular ones among them.

Dexter: An open source entity-linking framework developed by researchers at ISTI-CNR, Italy, Dexter identifies text fragments in a document referring to entities present in Wikipedia. The linking process is divided into three steps, text fragment identification, disambiguation and ranking, which forms the core module in the software. Since Dexter is open source, it becomes easy to measure and analyse various existing entity linking algorithms.

*Dexter’s architecture (Image courtesy: ISTI CNR)*

Babelfy: A multilingual open source framework, Babelfy has a web interface and a RESTful API to perform entity-linking as well as word sense disambiguation (WSD) to address various problems in computational linguistics. One noteworthy advantage it provides is the integration with Java, the most used programming language in IT. Developed by researchers at Sapienza University of Rome, Italy, Babelfy is intended for everyone and provides a simple, user-friendly interface. You can check out Babelfy here.
DBPedia: More of a knowledge base, DBPedia is an open source community project developed and supported by numerous researchers from various fields. Wikipedia forms the major information base in DBPedia and supports more than 100 language versions. It is also used for entity linking and other NLP applications through browser-based interactions. SQL is the means to interact in DBPedia. With web queries pervading the online space, DBPedia is certainly helpful in complex tasks in text analysis.

Conclusion

NER has been in existence for the past two decades. With NLP advancements on the rise, text analysis will definitely improve vastly be it in terms of languages, jargons, contexts and so on. Text-based applications no longer need to be computation-intensive or hard to develop as NLP is simplifying all these processes.

Access all our open Survey & Awards Nomination forms in one place >>

Abhishek Sharma

I research and cover latest happenings in data science. My fervent interests are in latest technology and humor/comedy (an odd combination!). When I'm not busy reading on these subjects, you'll find me watching movies or playing badminton.

Watch More

Exploring Named-Entity Recognition With Wikipedia

What Is NER?

Wikipedia As The Knowledge Base In Studies And Frameworks

Conclusion

Abhishek Sharma

Download our Mobile App

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

3 Ways to Join our Community

Telegram group

Discord Server

Subscribe to our Daily newsletter

Get our daily awesome stories & videos in your inbox

Recent Stories

World's Biggest Media & Analyst firm specializing in AI

Advertise with us

AIM publishes every day, and we believe in quality over quantity, honesty over spin. We offer a wide variety of branding and targeting options to make it easy for you to propagate your brand.

Branded Content

AIM Brand Solutions, a marketing division within AIM, specializes in creating diverse content such as documentaries, public artworks, podcasts, videos, articles, and more to effectively tell compelling stories.

Corporate Upskilling

ADaSci Corporate training program on Generative AI provides a unique opportunity to empower, retain and advance your talent

Hackathons

With MachineHack you can not only find qualified developers with hiring challenges but can also engage the developer community and your internal workforce by hosting hackathons.

Talent Assessment

Conduct Customized Online Assessments on our Powerful Cloud-based Platform, Secured with Best-in-class Proctoring

Research & Advisory

AIM Research produces a series of annual reports on AI & Data Science covering every aspect of the industry. Request Customised Reports & AIM Surveys for a study on topics of your interest.

Conferences & Events

Immerse yourself in AI and business conferences tailored to your role, designed to elevate your performance and empower you to accomplish your organization’s vital objectives.