While training a CNN model using CIFAR-10 via "keras.datasets.cifar10.load_data()", I noticed that some images seem to have incorrect ground-truth labels. This leads to misleading training results and incorrect evaluation.

import numpy as np import matplotlib.pyplot as plt import keras

(X_train, Y_train), (X_test, Y_test) = keras.datasets.cifar10.load_data() X_train = X_train / 255.0 X_test = X_test / 255.0

class_labels = ['airplane','automobile','bird','cat','deer', 'dog','frog','horse','ship','truck']

plt.figure(figsize=(12,8)) for j, i in enumerate(np.random.randint(0, 10000, 6)): plt.subplot(2,3,j+1) plt.imshow(X_test[i]) plt.axis('off') plt.title(f"Actual={class_labels[Y_test[i][0]]} ({Y_test[i][0]})") plt.show()

Some images appear to be mislabeled. For example:

 An airplane image labeled as frog

 A horse image labeled as cat

A ship image labeled as truck

Image

Comment From: dhantule

Hi @RohanKumar101, thanks for reporting this.

The keras.datasets.cifar10.load_data() function provides a direct wrapper around the original CIFAR-10 dataset, the dataset is loaded as-is from the original source.

CIFAR-10 is known to have a small percentage of labels that may be noisy or ambiguous. If you need better label quality, consider using enhanced or cleaned versions of the dataset.

Comment From: RohanKumar101

Hi @dhantule , thanks for the clarification.

I understand that "keras.datasets.cifar10.load_data()" is just a wrapper around the original CIFAR-10 dataset and that some amount of label noise is expected.

It might still be helpful if the documentation for "keras.datasets.cifar10" included a short note mentioning that CIFAR-10 is known to have some mislabeled samples, with a reference to the original dataset paper or known discussions on label noise. This would make it clearer for new users who encounter these issues while training models.

Thanks again for the quick response!

Comment From: amitsrivastava78

Contributions are welcome for this!