The evaluate() method from the Model API returns accuracy and loss values incoherent with the ones computed during training.
Issue happening with: - tensorflow==2.18.0, keras==3.8.0, numpy==2.0.2 - tensorflow==2.17.1, keras==3.5.0, numpy==1.26.4
Issue not happening with: - tensorflow==2.12.0, keras==2.12.0, numpy==1.23.5
This has been tested both with and without a GPU, on two personal computers and on a Google Colab VM.
Example of this behavior on the validation values:
Training
1500/1500 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.9496 - loss: 0.0078 - val_accuracy: 0.9581 - val_loss: 0.0065
evaluate()
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9519 - loss: 0.0075
Code to produce the issue:
import numpy as np
import tensorflow as tf
import keras
seed = 123
np.random.seed(seed)
tf.random.set_seed(seed)
keras.utils.set_random_seed(seed)
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = np.expand_dims(x_train / 255.0, -1)
x_test = np.expand_dims(x_test / 255.0, -1)
y_train_one_hot = keras.utils.to_categorical(y_train, 10)
y_test_one_hot = keras.utils.to_categorical(y_test, 10)
model = keras.models.Sequential([
keras.layers.Input((28, 28, 1)),
keras.layers.Flatten(),
keras.layers.Dense(64, activation="relu"),
keras.layers.Dense(10, activation="softmax")
])
model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["accuracy"])
model.summary()
model.fit(
x=x_train,
y=y_train_one_hot,
batch_size=40,
epochs=2,
validation_data=(x_test, y_test_one_hot)
)
mse = keras.losses.MeanSquaredError()
# Incorrect values. Changing the batch_size to 40 or 1 does not remove the issue:
print("\nResults from evaluate():")
model.evaluate(x_test, y_test_one_hot, batch_size=None)
print("\nBypassing evaluate():") # those values match the ones computed during training!
print("accuracy =", np.mean(np.argmax(model.predict(x_test), axis=1) == y_test))
print("mse = %.4f" % float(mse(model.predict(x_test), y_test_one_hot)))
Comment From: Carath
Additional information: the problem doesn't seem to arise when passing batch_size=None
and steps=1
to evaluate().
Maybe there is an issue with one of the internal callbacks ? If anything the correct values for the loss and accuracy should be returned no matter which batch_size is used (as long as it divides the number of samples).
Comment From: sonali-kumari1
Hi @Carath -
I have tried to replicate the issue and the problem doesn't seem to arise when passing batch_size=None
and steps=1
to model.evaluate()
because the entire dataset is being processed in a single batch. In order to maintain a standard batch_size
, you can use verbose=2
in model.evaluate()
to ensure consistent results and more detailed metric aggregation during evaluation. Attaching gist for your reference. Thanks!
Comment From: Carath
Thank you for your answer.
I fail to see how this really fixes the issue, many users might just call the evaluate() method and be confused by the mismatching values. Moreover it seems quite weird that changing the verbosity level of the method yields completely different values to be printed.
I also don't think this is a batch size issue, as the returned values obtained when passing return_dict=True
are actually correct (up to 1e-4) despite the printed values being wrong:
print(model.evaluate(x_test, y_test_one_hot, batch_size=32, return_dict=True))
Output:
313/313 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.9519 - loss: 0.0075
{'accuracy': 0.9581000208854675, 'loss': 0.006478920113295317}
I propose to set verbose=2
as default argument to evaluate() as to not cause user confusion, until the underlying issue is fixed.
Comment From: sonali-kumari1
Hi @Carath -
Changing the verbosity level of the method yields slightly different values because verbose=1
will display progress bar after each batch while verbose=2
will show summary metrics of training and validation at the end of each epoch. Avoid using batch_size
in model.evaluate()
as batch generation are handled internally. You can refer to this documentation for more details.
Comment From: Carath
Created a pull request to prevent the incorrect behavior to affect most users.
On a side note, replies to this thread seem to be (a least partially) coming from a chat bot, I don't believe this to be really appropriate if that is the case.
Comment From: sonali-kumari1
Hi @Carath -
As mentioned by @fchollet in the PR, it might be the progress bar callback being delayed by one batch or something. You are welcome to contribute a fix for this issue!
Comment From: tripathi-genius
Is the issue open for contribution? I want to contribute.
Comment From: sonali-kumari1
Hi @tripathi-genius - Feel free to open a PR to fix this issue and ensure the issue is linked in the PR. Thanks!