New Research Highlights How Error-Ridden Data Used To Train AI Is

The world is awash with data, and it’s tempting to think that this data is what’s used to train the AI systems that are increasingly prevalent around the world.  New research from MIT highlights how not only is AI often trained on relatively small samples of curated data, but this data often contains errors that undermine the training delivered to machine learning algorithms.

Indeed, across 10 of the most-cited datasets used by scientists to train machine learning systems, the researchers found that 3% of the data was mislabeled or inaccurate.

Misleading data

It has long been suspected that the data used to train AI systems is not what it could be, but until now no one has been able to quantify just how poor it is.  The researchers assess 10 datasets that collectively have been cited over 100,000 times.  These include both Amazon’s reviews dataset and the hugely popular ImageNet.

The researchers also developed a demo tool to allow users to examine the different datasets to examine them for the different types of errors they include.  These errors include:

  • Mislabeled images, like one breed of dog being confused for another or a baby being confused for a nipple.
  • Mislabeled text sentiment, like Amazon product reviews described as negative when they were actually positive.
  • Mislabeled audio of YouTube videos, like an Ariana Grande high-note being classified as a whistle.

Interestingly, models that are traditionally viewed as being weaker, such as ResNet-18, appeared to have lower error rates than their more complex peers, such as ResNet-50.  The authors argue that machine learning scientists should only consider simple models if their real-world dataset has an error rate of around 10%.

The study builds upon previous work done by MIT into the reliability and accuracy of datasets to create confidence in the learning undertaken by AI.  They use confident learning in this study, which is a sub-field of machine learning that is concerned with being able to find and quantity label noise within datasets, to algorithmically identify all of the label errors before humans get around to verifying the data.

The team has also made it easy for other researchers to replicate their results and find label errors in their own datasets using cleanlab, an open-source python package.

Facebooktwitterredditpinterestlinkedinmail