Hi! I have stumbled upon a pretty annoying issue when training a model using tf.distribute.MirroredStrategy.

Fundamentally, after a certain number of steps in each epoch, all metrics go "nan", but the model is actually training fine under the hood.

After looking through and debugging keras' code that implements metrics, I finally found it.

keras.metrics.Mean has "total" and "count" mirrored variables that are reduced through sum: "total" just accumulates the state of the metric, "count" is used to divide "total" and give back the correct average result. When running on multiple gpus it seems that "total" could increase too drastically, overflowing, resulting in the nan metrics.

keras.metrics.Sum might have the same issue.

If you guys at keras have any idea on what path to follow to fix this, I would be happy to contribute if necessary.

P.S.: this issue could be related to some tensorflow issues describing this same behaviour, like https://github.com/tensorflow/tensorflow/issues/90686

Comment From: dhantule

Hi @gianlucasama, thanks for reporting this.

Could you please provide some reproducible code.

Comment From: gianlucasama

I created a simple test script to reproduce the issue.

I am running: tensorflow==2.17.0 keras==3.9.0 numpy==1.26.4

Hardware: Lambda cloud's 8 V100 GPUs machine

Inside the latest gpu-enabled tensorflow docker container.

nvcc output: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Wed_Nov_22_10:17:15_PST_2023 Cuda compilation tools, release 12.3, V12.3.107 Build cuda_12.3.r12.3/compiler.33567101_0

nvidia-smi output: NVIDIA-SMI 570.148.08
Driver Version: 570.148.08
CUDA Version: 12.8 And the 8 gpus

Test script:

import tensorflow as tf
import numpy as np
import keras

strategy = tf.distribute.MirroredStrategy()

num_samples = 1000000
x_train = np.random.random((num_samples, 32)).astype(np.float32)
y_train = np.random.random((num_samples, 10)).astype(
    np.float32
)

with strategy.scope():
    model = keras.Sequential(
        [
            keras.layers.Dense(64, activation="relu"),
            keras.layers.Dense(10, activation=None),
        ]
    )
    model.compile(
        optimizer=keras.optimizers.Adam(),
        loss=keras.losses.MeanSquaredError(),
    )

model.fit(x_train, y_train, epochs=1000, batch_size=8, verbose=1)

Metrics go nan after some specific number of steps, but again the model should be training fine under the hood.

Comment From: gianlucasama

By upgrading to keras==3.10 I am not getting the issue it seems, so it could be a thing you guys fixed in the last update maybe? Nevertheless it's certainly a weird and annoying bug to watch out for as it makes training logs practically useless and it seems to have been an issue for others going back years.

Comment From: dhantule

Hi @gianlucasama, thanks for your response, Can we close this issue since it's working with keras 3.10.0?

Comment From: github-actions[bot]

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.