Keras Incorrect values returned by the evaluate method in Model API

The evaluate() method from the Model API returns accuracy and loss values incoherent with the ones computed during training.

Issue happening with: - tensorflow==2.18.0, keras==3.8.0, numpy==2.0.2 - tensorflow==2.17.1, keras==3.5.0, numpy==1.26.4

Issue not happening with: - tensorflow==2.12.0, keras==2.12.0, numpy==1.23.5

This has been tested both with and without a GPU, on two personal computers and on a Google Colab VM.

Example of this behavior on the validation values:

Training
1500/1500 ━━━━━━━━━━━━━━━━━━━━ 2s 1ms/step - accuracy: 0.9496 - loss: 0.0078 - val_accuracy: 0.9581 - val_loss: 0.0065

evaluate()
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 2ms/step - accuracy: 0.9519 - loss: 0.0075

Code to produce the issue:

import numpy as np
import tensorflow as tf
import keras

seed = 123
np.random.seed(seed)
tf.random.set_seed(seed)
keras.utils.set_random_seed(seed)

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x_train = np.expand_dims(x_train / 255.0, -1)
x_test = np.expand_dims(x_test / 255.0, -1)
y_train_one_hot = keras.utils.to_categorical(y_train, 10)
y_test_one_hot = keras.utils.to_categorical(y_test, 10)

model = keras.models.Sequential([
    keras.layers.Input((28, 28, 1)),
    keras.layers.Flatten(),
    keras.layers.Dense(64, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])

model.compile(optimizer="Adam", loss="mean_squared_error", metrics=["accuracy"])
model.summary()

model.fit(
    x=x_train,
    y=y_train_one_hot,
    batch_size=40,
    epochs=2,
    validation_data=(x_test, y_test_one_hot)
)

mse = keras.losses.MeanSquaredError()

# Incorrect values. Changing the batch_size to 40 or 1 does not remove the issue:
print("\nResults from evaluate():")
model.evaluate(x_test, y_test_one_hot, batch_size=None)

print("\nBypassing evaluate():") # those values match the ones computed during training!
print("accuracy =", np.mean(np.argmax(model.predict(x_test), axis=1) == y_test))
print("mse = %.4f" % float(mse(model.predict(x_test), y_test_one_hot)))

Comment From: Carath

Additional information: the problem doesn't seem to arise when passing batch_size=None and steps=1 to evaluate().

Maybe there is an issue with one of the internal callbacks ? If anything the correct values for the loss and accuracy should be returned no matter which batch_size is used (as long as it divides the number of samples).

Comment From: sonali-kumari1

Hi @Carath -

I have tried to replicate the issue and the problem doesn't seem to arise when passing batch_size=None and steps=1 to model.evaluate() because the entire dataset is being processed in a single batch. In order to maintain a standard batch_size, you can use verbose=2 in model.evaluate() to ensure consistent results and more detailed metric aggregation during evaluation. Attaching gist for your reference. Thanks!

Comment From: Carath

Thank you for your answer.

I fail to see how this really fixes the issue, many users might just call the evaluate() method and be confused by the mismatching values. Moreover it seems quite weird that changing the verbosity level of the method yields completely different values to be printed.

I also don't think this is a batch size issue, as the returned values obtained when passing return_dict=True are actually correct (up to 1e-4) despite the printed values being wrong:

print(model.evaluate(x_test, y_test_one_hot, batch_size=32, return_dict=True))

Output:

313/313 ━━━━━━━━━━━━━━━━━━━━ 0s 1ms/step - accuracy: 0.9519 - loss: 0.0075 
{'accuracy': 0.9581000208854675, 'loss': 0.006478920113295317}

I propose to set verbose=2 as default argument to evaluate() as to not cause user confusion, until the underlying issue is fixed.

Comment From: sonali-kumari1

Hi @Carath -

Changing the verbosity level of the method yields slightly different values because verbose=1will display progress bar after each batch while verbose=2will show summary metrics of training and validation at the end of each epoch. Avoid using batch_size in model.evaluate() as batch generation are handled internally. You can refer to this documentation for more details.

Comment From: Carath

Created a pull request to prevent the incorrect behavior to affect most users.

On a side note, replies to this thread seem to be (a least partially) coming from a chat bot, I don't believe this to be really appropriate if that is the case.

Comment From: sonali-kumari1

Hi @Carath -

As mentioned by @fchollet in the PR, it might be the progress bar callback being delayed by one batch or something. You are welcome to contribute a fix for this issue!

Comment From: tripathi-genius

Is the issue open for contribution? I want to contribute.

Comment From: sonali-kumari1

Hi @tripathi-genius - Feel free to open a PR to fix this issue and ensure the issue is linked in the PR. Thanks!