LOADING

Type to search

IIT Guwahati Unveils Wikipedia Name Error Detection Tool at AI Summit

IITs News Top story

IIT Guwahati Unveils Wikipedia Name Error Detection Tool at AI Summit

Share

Indian Institute of Technology Guwahati Researchers have developed a multilingual and scalable method to identify and correct Surface Name Errors (SNEs) in Wikipedia, thus helping improve information reliability for both human users and artificial intelligence systems.

Wikipedia is a free, multilingual online encyclopaedia created and maintained by a global community of volunteers through open collaboration. A surface name refers to the text used in Wikipedia articles to mention or link to another entity. A Surface Name Error (SNE) occurs when this text is incorrect. For example, using a misspelled word like “Parise” to link to the page for Paris.

A study conducted by the IIT Guwahati research team found that about 3% to 6% of all entity mentions in Wikipedia contain Surface Name Errors. While these errors may appear minor, they have significant implications.

For human users, an incorrect surface name can reduce the perceived credibility and reliability of the information provided. Similarly, many machine learning and deep learning models use Wikipedia as a core dataset. Such errors in surface names can negatively impact AI tasks and model performance.

To address this challenge, Prof. Amit Awekar, Associate Professor, Department of Computer Science and Engineering, along with this then M. Tech student Mr. Anuj Khare (batch of 2022), built a method that uses mathematical frequency patterns, making it adaptable across languages.

The developed method follows a three-step approach to classify SNEs.

The first step included scanning Wikipedia and converting every link into a quadruplet containing information on:

·         the page where the link appears

·         The page it points to

·         The surface name used in the link

·         The surrounding textual context

In the next step, the developed method reviewed the surface name and considered it correct only if:

·         It appeared at least 10 times

·         It accounted for at least 5% of all links pointing to a specific page

Surface names that did not meet these criteria were flagged as potential errors.

In the final step, it categorised the detected errors into “typing mistakes”, such as “Gawahati” instead of “Guwahati”, or “entity span errors”, where extra or incorrect words are mistakenly included in the link.

The researchers tested the developed method on eight languages, including English, Sanskrit, German, Italian, Urdu, Hindi, Marathi, and Gujarati, and found accurate outcomes.

Speaking about the real-world application of the developed method, Prof. Amit Awekar, said, This work shows us that we should not be trusting the data from the web blindly, both for human use and training AI models. Good data is the beginning of any good AI model and downstream application.”

To validate the developed method, the research team compared snapshots of English Wikipedia from 2018 and 2022 and found that about 30% of the errors predicted by the method had been corrected on Wikipedia over four years, confirming its accuracy.

Wikipedia is maintained by volunteers worldwide, and the developed method can help editors identify hidden typos and linking errors that might otherwise remain unnoticed for years.

To further validate the accuracy of this method, it is notable that the Wikipedia community has accepted more than 99% of the manual corrections suggested by the researchers.

By combining scalable data processing with practical validation through the Wikipedia community, the IIT Guwahati team has demonstrated an effective approach to strengthening digital knowledge systems.

Tags:

You Might also Like

Leave a Comment

Your email address will not be published. Required fields are marked *