Summary
When using the keras.layers.Attention
layer with mask
support and KERAS_BACKEND=torch
, a shape mismatch occurs between scores
and scores_mask
, leading to a broadcasting error during training.
Code to Reproduce
import os
os.environ["KERAS_BACKEND"] = "torch"
import numpy as np
from keras.layers import Attention, Add, GlobalAveragePooling1D, Dropout, Lambda, Embedding, Input, Dense
from keras.models import Model
from keras.utils import plot_model
from keras import ops
WORDS = 1000
MAXLEN = 50
#simulates a corpus
x_train = np.random.randint(low=1, high=WORDS+1, size=(20000,MAXLEN))
x_test = np.random.randint(low=1, high=WORDS+1, size=(20000,MAXLEN))
y_train = np.random.randint(low=0, high=2, size=(20000, 1))
y_test = np.random.randint(low=0, high=2, size=(20000, 1))
#simulates paddinf
x_train[:, np.random.randint(low=30, high=MAXLEN, size=(20000,))] = 0
i = Input((MAXLEN,))
e = Embedding(WORDS + 1, 60, mask_zero=True, name='base_emb')(i)
len_s = Lambda(lambda x: ops.expand_dims(ops.arange(start=0, stop=MAXLEN, step=1), axis=0), output_shape=(MAXLEN,))(i)
pos_e = Embedding(MAXLEN, 60, mask_zero=True, name='base_pos')(len_s)
sum = Add()([e, pos_e])
dq = Dense(60)(sum)
dk = Dense(60)(sum)
mask = Lambda(lambda x: x != 0, name='mask')(i)
att = Attention(name='att')([dq, dk], mask=[mask, mask])
attd = Dropout(0.1)(att)
d = GlobalAveragePooling1D()(attd)
d = Dropout(0.1)(d)
d = Dense(100)(d)
d = Dropout(0.1)(d)
d = Dense(1, activation='sigmoid')(d)
model = Model(i, d)
model.summary()
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=3, validation_data=(x_test, y_test))
Error Output
RuntimeError: Unable to automatically build the model...
Exception encountered when calling Attention.call().
The size of tensor a (50) must match the size of tensor b (32) at non-singleton dimension 1
Arguments received by Attention.call():
• inputs=['torch.Tensor(shape=torch.Size([32, 50, 60])', 'torch.Tensor(shape=torch.Size([32, 50, 60])']
• mask=['torch.Tensor(shape=torch.Size([32, 50])', 'torch.Tensor(shape=torch.Size([32, 50])']
• training=False
• return_attention_scores=False
• use_causal_mask=False
Root Cause
In keras/src/layers/attention/attention.py
, lines 174–179:
if scores_mask is not None:
padding_mask = ops.logical_not(scores_mask)
max_value = 65504.0 if scores.dtype == "float16" else 1.0e9
scores -= max_value * ops.cast(padding_mask, dtype=scores.dtype)
Here, scores_mask
has shape [batch, seq_len]
(e.g., [32, 50]
), while scores has shape [batch, seq_len, dim]
(e.g., [32, 50, 60]
). This causes a broadcasting error.
Suggested Fix / Workaround
Adding a dimension to the mask, if necessary, resolves the issue:
if scores_mask is not None:
padding_mask = ops.logical_not(scores_mask)
max_value = 65504.0 if scores.dtype == "float16" else 1.0e9
if len(padding_mask.shape) == 2:
padding_mask = ops.expand_dims(padding_mask, axis=-2)
scores -= max_value * ops.cast(padding_mask, dtype=scores.dtype)
This change ensures compatibility with the scores tensor shape. It should not interfere with causal masking behavior (use_causal_mask=True)
, which should already produce properly shaped masks.
Environment Keras Version: 3.10.0 Torch Version: 2.7.1+cu128' Python Version: 3.11.13 OS: Windows 11 24H2
Additional Notes The issue does not occur when use_causal_mask=True.
This may only affect the Torch backend due to differing broadcasting semantics or internal tensor shape expectations.
The issue seems to be independent of the backend. I tested it with Tensorflow backed in Google Colab. Please let me know if you’d like me to submit a pull request with the proposed fix.