New research highlights how error-ridden data used to train AI is

Adi Gaskell26 Jul 2021

520 2 minutes read

Originally posted on The Horizons Tracker.

The world is awash with data, and it’s tempting to think that this data is what’s used to train the AI systems that are increasingly prevalent around the world. New research¹ from MIT highlights how not only is AI often trained on relatively small samples of curated data, but this data often contains errors that undermine the training delivered to machine learning algorithms.

Indeed, across 10 of the most-cited datasets used by scientists to train machine learning systems, the researchers found that 3% of the data was mislabeled or inaccurate.

Misleading data

It has long been suspected that the data used to train AI systems is not what it could be, but until now no one has been able to quantify just how poor it is. The researchers assess 10 datasets that collectively have been cited over 100,000 times. These include both Amazon’s reviews dataset and the hugely popular ImageNet.

The researchers also developed a demo tool to allow users to examine the different datasets to examine them for the different types of errors they include. These errors include:

Mislabeled images, like one breed of dog being confused for another or a baby being confused for a nipple.
Mislabeled text sentiment, like Amazon product reviews described as negative when they were actually positive.
Mislabeled audio of YouTube videos, like an Ariana Grande high-note being classified as a whistle.

Interestingly, models that are traditionally viewed as being weaker, such as ResNet-18, appeared to have lower error rates than their more complex peers, such as ResNet-50. The authors argue that machine learning scientists should only consider simple models if their real-world dataset has an error rate of around 10%.

The study builds upon previous work done by MIT into the reliability and accuracy of datasets to create confidence in the learning undertaken by AI. They use confident learning in this study, which is a sub-field of machine learning that is concerned with being able to find and quantity label noise within datasets, to algorithmically identify all of the label errors before humans get around to verifying the data.

The team has also made it easy for other researchers to replicate their results and find label errors in their own datasets using cleanlab, an open-source python package.

Article source: New Research Highlights How Error-Ridden Data Used To Train AI Is.

Header image source: Label Errors in ML Test Sets.

Reference:

Northcutt, C. G., Athalye, A., & Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749. ↩

Rate this post

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.

Misleading data

Adi Gaskell

Related Articles

Will AI kill our creativity? It could – if we don’t start to value and protect the traits that make us human

Why ChatGPT isn’t conscious – but future AI systems might be

KM + AI = ?

Move over, agony aunt: study finds ChatGPT gives better advice than professional columnists