Keras Performance bug on memory with func keras.layers.LSTMCell

Bug Issue

I found a performance bug on memory with func keras.layers.LSTMCell, the doc of LSTMCell shows its description as below:

https://github.com/keras-team/keras/blob/ce0d2788b76119bc778f3d094816b0a9fc2b9748/keras/src/layers/rnn/lstm.py#L27-L29

See the repro below, with TensorFlow 2.19.0,

We can see the memory used increased from 164608 to 165120 after transferring the value of recurrent_activation from 'sigmoid' to 'hard_sigmoid':

Repro 1

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    for gpu in gpus:
        tf.config.experimental.set_memory_growth(gpu, True)
# Main Code -->
batch_size, timesteps, input_dim = 32, 10, 5
input_data = tf.random.normal([batch_size, timesteps, input_dim])
lstm_cell = tf.keras.layers.LSTMCell(units=64, activation='relu', dropout=0.2, recurrent_activation='sigmoid') ## Choice 1: sigmoid
# lstm_cell = tf.keras.layers.LSTMCell(units=64, activation='relu', dropout=0.2, recurrent_activation='hard_sigmoid') ## Choice 2: hard_sigmoid
rnn_layer = tf.keras.layers.RNN(lstm_cell, return_sequences=True)
output = rnn_layer(input_data)
# Main Code <--
memory = 0
for i in range(len(gpus)):
    memory += tf.config.experimental.get_memory_usage('GPU:%d' % i)
print("Memory Used:", memory)

Output 1

## For Choice 1: sigmoid
Memory Used: 164608
## For Choice 2: hard_sigmoid
Memory Used: 165120

To tell the truth, I'm not sure if the difference in Output 1 is expected.

But in my opiinon, the memory used in Repro 1 should not change, for that both activation functions have identical memory characteristics (element-wise operations with same input/output tensor dimensions):

For sigmoid func, codes shown here:

https://github.com/keras-team/keras/blob/ce0d2788b76119bc778f3d094816b0a9fc2b9748/keras/src/activations/activations.py#L482-L506

For hard_sigmoid func, codes shown here:

https://github.com/keras-team/keras/blob/ce0d2788b76119bc778f3d094816b0a9fc2b9748/keras/src/activations/activations.py#L519-L539

To verify my opinion above, I tried some more repos below, which shows that the shape and type between two func are matched with each other:

Repro 2

import keras
import tensorflow as tf
batch_size, timesteps, input_dim = 32, 10, 5
input_data = tf.random.normal([batch_size, timesteps, input_dim])
y1 = keras.src.ops.hard_sigmoid(input_data)
y2 = keras.src.ops.sigmoid(input_data)
print("y1:", y1.shape, '\n', y1)
print("y2:", y2.shape, '\n', y2)

Output 2

y1: (32, 10, 5) 
 tf.Tensor(
[[[0.5575269  0.54280454 0.661077   0.7442135  0.54959637]
  [0.590985   0.15000606 0.4363846  0.6136034  0.47918186]
  [0.38167724 0.3758458  0.19742984 0.44487846 0.22357215]
  ...
  [0.53210396 0.3301783  0.49853078 0.5353122  0.43356502]
  [0.58335984 0.48384008 0.470204   0.2632173  0.2183072 ]
  [0.63430065 0.3395311  0.62688303 0.47066614 0.22561677]]], shape=(32, 10, 5), dtype=float32)
y2: (32, 10, 5) 
 tf.Tensor(
[[[0.5854437  0.5638562  0.72441375 0.81233907 0.5738503 ]
  [0.63318616 0.10910036 0.4057187  0.6641003  0.4688133 ]
  [0.32961282 0.32192805 0.1399842  0.41806313 0.15995441]
  ...
  [0.5480076  0.26523578 0.4977962  0.552771   0.40164632]
  [0.6224967  0.47577906 0.45542464 0.19455245 0.15575518]
  [0.6912147  0.27631527 0.6816356  0.45611244 0.1616097 ]]], shape=(32, 10, 5), dtype=float32)

Thanks for noting!

Comment From: sonali-kumari1

Hi @ILCSFNO - Thanks for pointing this out. I tested you code with latest version of keras(3.11.3) and confirmed that memory usage for hard_sigmoid is slightly higher than that of sigmoid. To understand this difference, I printed the computational graph of both activations:

sigmoid function:

['Placeholder', 'Const', 'Const', 'Const', 'Fill', 'Const', 'Const', 'Fill', 'Const', 'Const', 'Const', 'StridedSlice', 
'Const', 'Transpose', 'Const', 'Const', 'TensorListReserve', 'Const', 'TensorListFromTensor', 'Const', 'Const', 
'Const', 'StridedSlice', 'Placeholder', 'ReadVariableOp', 'MatMul', 'Placeholder', 'ReadVariableOp', 'MatMul', 
'AddV2', 'Placeholder', 'ReadVariableOp', 'AddV2', 'Const', 'Split', 'Sigmoid', 'Sigmoid', 'Mul', 'Tanh', 'Mul', '
AddV2', 'Sigmoid', 'Tanh', 'Mul', 'Const', 'Const', 'TensorListReserve', 'Const', 'Const', 'Const', 'Const', 
'Const', 'Range', 'Const', 'Max', 'Const', 'While', 'Const', 'TensorListStack', 'Const', 'Const', 'Const', '
StridedSlice', 'Const', 'Transpose', 'Identity', 'NoOp']

hard_sigmoid function:

['Placeholder', 'Const', 'Const', 'Const', 'Fill', 'Const', 'Const', 'Fill', 'Const', 'Const', 'Const', 'StridedSlice', 
'Const', 'Transpose', 'Const', 'Const', 'TensorListReserve', 'Const', 'TensorListFromTensor', 'Const', 'Const', 
'Const', 'StridedSlice', 'Placeholder', 'ReadVariableOp', 'MatMul', 'Placeholder', 'ReadVariableOp', 'MatMul', 
'AddV2', 'Placeholder', 'ReadVariableOp', 'AddV2', 'Const', 'Split', 'Const', 'AddV2', 'Relu6', 'Const', 'RealDiv',
 'Const', 'AddV2', 'Relu6', 'Const', 'RealDiv', 'Mul', 'Tanh', 'Mul', 'AddV2', 'Const', 'AddV2', 'Relu6', 'Const', 
'RealDiv', 'Tanh', 'Mul', 'Const', 'Const', 'TensorListReserve', 'Const', 'Const', 'Const', 'Const', 'Const', 'Range', 
'Const', 'Max', 'Const', 'While', 'Const', 'TensorListStack', 'Const', 'Const', 'Const', 'StridedSlice', 'Const', 
'Transpose', 'Identity', 'NoOp']

The difference here is sigmoid uses direct sigmoid operation but hard_sigmoid uses multiple ops like Relu6, RealDiv for piecewise linear approximation of the sigmoid activation. This likely expands the computational graph for hard_sigmoid function, leading to more intermediate tensors and slightly high memory usage.