Bug Issue
I found a performance bug on memory with func keras.layers.LSTMCell, the doc of LSTMCell shows its description as below:
https://github.com/keras-team/keras/blob/ce0d2788b76119bc778f3d094816b0a9fc2b9748/keras/src/layers/rnn/lstm.py#L27-L29
See the repro below, with TensorFlow 2.19.0,
We can see the memory used increased from 164608 to 165120 after transferring the value of recurrent_activation from 'sigmoid' to 'hard_sigmoid':
Repro 1
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
# Main Code -->
batch_size, timesteps, input_dim = 32, 10, 5
input_data = tf.random.normal([batch_size, timesteps, input_dim])
lstm_cell = tf.keras.layers.LSTMCell(units=64, activation='relu', dropout=0.2, recurrent_activation='sigmoid') ## Choice 1: sigmoid
# lstm_cell = tf.keras.layers.LSTMCell(units=64, activation='relu', dropout=0.2, recurrent_activation='hard_sigmoid') ## Choice 2: hard_sigmoid
rnn_layer = tf.keras.layers.RNN(lstm_cell, return_sequences=True)
output = rnn_layer(input_data)
# Main Code <--
memory = 0
for i in range(len(gpus)):
memory += tf.config.experimental.get_memory_usage('GPU:%d' % i)
print("Memory Used:", memory)
Output 1
## For Choice 1: sigmoid
Memory Used: 164608
## For Choice 2: hard_sigmoid
Memory Used: 165120
To tell the truth, I'm not sure if the difference in Output 1 is expected.
But in my opiinon, the memory used in Repro 1 should not change, for that both activation functions have identical memory characteristics (element-wise operations with same input/output tensor dimensions):
For sigmoid func, codes shown here:
https://github.com/keras-team/keras/blob/ce0d2788b76119bc778f3d094816b0a9fc2b9748/keras/src/activations/activations.py#L482-L506
For hard_sigmoid func, codes shown here:
https://github.com/keras-team/keras/blob/ce0d2788b76119bc778f3d094816b0a9fc2b9748/keras/src/activations/activations.py#L519-L539
To verify my opinion above, I tried some more repos below, which shows that the shape and type between two func are matched with each other:
Repro 2
import keras
import tensorflow as tf
batch_size, timesteps, input_dim = 32, 10, 5
input_data = tf.random.normal([batch_size, timesteps, input_dim])
y1 = keras.src.ops.hard_sigmoid(input_data)
y2 = keras.src.ops.sigmoid(input_data)
print("y1:", y1.shape, '\n', y1)
print("y2:", y2.shape, '\n', y2)
Output 2
y1: (32, 10, 5)
tf.Tensor(
[[[0.5575269 0.54280454 0.661077 0.7442135 0.54959637]
[0.590985 0.15000606 0.4363846 0.6136034 0.47918186]
[0.38167724 0.3758458 0.19742984 0.44487846 0.22357215]
...
[0.53210396 0.3301783 0.49853078 0.5353122 0.43356502]
[0.58335984 0.48384008 0.470204 0.2632173 0.2183072 ]
[0.63430065 0.3395311 0.62688303 0.47066614 0.22561677]]], shape=(32, 10, 5), dtype=float32)
y2: (32, 10, 5)
tf.Tensor(
[[[0.5854437 0.5638562 0.72441375 0.81233907 0.5738503 ]
[0.63318616 0.10910036 0.4057187 0.6641003 0.4688133 ]
[0.32961282 0.32192805 0.1399842 0.41806313 0.15995441]
...
[0.5480076 0.26523578 0.4977962 0.552771 0.40164632]
[0.6224967 0.47577906 0.45542464 0.19455245 0.15575518]
[0.6912147 0.27631527 0.6816356 0.45611244 0.1616097 ]]], shape=(32, 10, 5), dtype=float32)
Thanks for noting!
Comment From: sonali-kumari1
Hi @ILCSFNO -
Thanks for pointing this out. I tested you code with latest version of keras(3.11.3) and confirmed that memory usage for hard_sigmoid is slightly higher than that of sigmoid. To understand this difference, I printed the computational graph of both activations:
sigmoid function:
['Placeholder', 'Const', 'Const', 'Const', 'Fill', 'Const', 'Const', 'Fill', 'Const', 'Const', 'Const', 'StridedSlice',
'Const', 'Transpose', 'Const', 'Const', 'TensorListReserve', 'Const', 'TensorListFromTensor', 'Const', 'Const',
'Const', 'StridedSlice', 'Placeholder', 'ReadVariableOp', 'MatMul', 'Placeholder', 'ReadVariableOp', 'MatMul',
'AddV2', 'Placeholder', 'ReadVariableOp', 'AddV2', 'Const', 'Split', 'Sigmoid', 'Sigmoid', 'Mul', 'Tanh', 'Mul', '
AddV2', 'Sigmoid', 'Tanh', 'Mul', 'Const', 'Const', 'TensorListReserve', 'Const', 'Const', 'Const', 'Const',
'Const', 'Range', 'Const', 'Max', 'Const', 'While', 'Const', 'TensorListStack', 'Const', 'Const', 'Const', '
StridedSlice', 'Const', 'Transpose', 'Identity', 'NoOp']
hard_sigmoid function:
['Placeholder', 'Const', 'Const', 'Const', 'Fill', 'Const', 'Const', 'Fill', 'Const', 'Const', 'Const', 'StridedSlice',
'Const', 'Transpose', 'Const', 'Const', 'TensorListReserve', 'Const', 'TensorListFromTensor', 'Const', 'Const',
'Const', 'StridedSlice', 'Placeholder', 'ReadVariableOp', 'MatMul', 'Placeholder', 'ReadVariableOp', 'MatMul',
'AddV2', 'Placeholder', 'ReadVariableOp', 'AddV2', 'Const', 'Split', 'Const', 'AddV2', 'Relu6', 'Const', 'RealDiv',
'Const', 'AddV2', 'Relu6', 'Const', 'RealDiv', 'Mul', 'Tanh', 'Mul', 'AddV2', 'Const', 'AddV2', 'Relu6', 'Const',
'RealDiv', 'Tanh', 'Mul', 'Const', 'Const', 'TensorListReserve', 'Const', 'Const', 'Const', 'Const', 'Const', 'Range',
'Const', 'Max', 'Const', 'While', 'Const', 'TensorListStack', 'Const', 'Const', 'Const', 'StridedSlice', 'Const',
'Transpose', 'Identity', 'NoOp']
The difference here is sigmoid uses direct sigmoid operation but hard_sigmoid uses multiple ops like Relu6, RealDiv for piecewise linear approximation of the sigmoid activation. This likely expands the computational graph for hard_sigmoid function, leading to more intermediate tensors and slightly high memory usage.