Keras evaluate() shows incorrect metric values in progress bar vs returned results

When using model.evaluate(), the metric values displayed in the progress bar differ from the values returned by the method. There appears to be double averaging happening - the batch values are already averaged, but the progress bar shows an additional average of these averages. Code to reproduce:

Code to reproduce:

import tensorflow as tf
import numpy as np

# Modify the model to output the same as the input
model = tf.keras.Sequential([
    tf.keras.layers.Lambda(lambda x: x)  # Lambda layer to pass input directly to output
])

# Compile the model with MAE as the metric
model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# Dummy data for evaluation
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], dtype=float)
y = np.zeros_like(x)  # Dummy target values

results = model.evaluate(x, y, verbose=1, batch_size=1)
print("Evaluation results:", results)

Output:

10/10 ━━━━━━━━━━━━━━━━━━━━ 3s 4ms/step - loss: 17.8182 - mae: 3.4545
Evaluation results: [38.5, 5.5]

Expected behavior: The metric values shown in the progress bar should match the final returned results (or at least be clearly documented if this difference is intentional).

Issue:

Progress bar shows loss: 17.8182, MAE: 3.4545

Returned values show loss: 38.5, MAE: 5.5

The correct values should be the returned ones (38.5 and 5.5 respectively), as these match manual calculations

The progress bar seems to be averaging already-averaged batch values

Environment: TensorFlow 2.19

Additional notes: This discrepancy can be confusing for users who rely on the progress bar metrics during evaluation.

Comment From: sonali-kumari1

Hi @ielenik -

I have tested your code with the latest version of keras(3.9.2) and tensorflow(2.19.0) in this gist and I was able to reproduce the mismatch between the metric values shown in the progress bar and values returned byevaluate() method. However, when I tested with keras(2.15.0) and tensorflow(2.15.0) using this gist, the results were consistent between the progress bar and final evaluation output. We will look into this and update you. Thanks!

Comment From: abheesht17

This was fixed in https://github.com/keras-team/keras/pull/21331.

Here is a notebook that shows it's working fine: https://colab.research.google.com/gist/abheesht17/fffe166a406c0d9976310f51fd25edb2/keras-progbar-evaluate-output.ipynb.

Comment From: ahasselbring

I don't think that #21331 solves the real problem. The Progbar shouldn't do any averaging, because - if I see it correctly - all stateless metrics are already "converted" to stateful metrics here: https://github.com/keras-team/keras/blob/503bcf56180d49641db2afcd3d57c4c20e3be3bf/keras/src/trainers/compile_utils.py#L89 Then the stateful_metrics argument/attribute can be removed.

This would also match previous tf.keras behavior, if I read this line correctly: https://github.com/keras-team/tf-keras/blob/c79cc0ef7df3d4066791f7cde38199cfa23e26f0/tf_keras/callbacks.py#L1129 (all metrics were marked "stateless" by the ProgbarLogger).

On the other hand, if Progbar should continue to support stateless and stateful values, it isn't correct to replace the final value. If the metric is really stateless, it should simply not be added when finalize is true.

Unfortunately, this will break models that use custom train_step/test_steps and return stateless metrics. These would previously be averaged (but since evaluate returns the last result of test_step, stateless metrics don't really work anyway - at least I can't imagine a metric that would be useful when only computed on the final batch).