Summary

When using the keras.layers.Attention layer with mask support and KERAS_BACKEND=torch, a shape mismatch occurs between scores and scores_mask, leading to a broadcasting error during training.

Code to Reproduce

import os
os.environ["KERAS_BACKEND"] = "torch"


import numpy as np
from keras.layers import Attention, Add, GlobalAveragePooling1D, Dropout, Lambda, Embedding, Input, Dense
from keras.models import Model
from keras.utils import plot_model
from keras import ops

WORDS = 1000
MAXLEN = 50

#simulates a corpus
x_train = np.random.randint(low=1, high=WORDS+1, size=(20000,MAXLEN))
x_test = np.random.randint(low=1, high=WORDS+1, size=(20000,MAXLEN))
y_train = np.random.randint(low=0, high=2, size=(20000, 1))
y_test = np.random.randint(low=0, high=2, size=(20000, 1))

#simulates paddinf
x_train[:, np.random.randint(low=30, high=MAXLEN, size=(20000,))] = 0

i = Input((MAXLEN,))
e = Embedding(WORDS + 1, 60, mask_zero=True, name='base_emb')(i)

len_s = Lambda(lambda x: ops.expand_dims(ops.arange(start=0, stop=MAXLEN, step=1), axis=0), output_shape=(MAXLEN,))(i)
pos_e = Embedding(MAXLEN, 60, mask_zero=True, name='base_pos')(len_s)

sum = Add()([e, pos_e])
dq = Dense(60)(sum)
dk = Dense(60)(sum)

mask = Lambda(lambda x: x != 0, name='mask')(i)

att = Attention(name='att')([dq, dk], mask=[mask, mask])
attd = Dropout(0.1)(att)

d = GlobalAveragePooling1D()(attd)
d = Dropout(0.1)(d)
d = Dense(100)(d)
d = Dropout(0.1)(d)
d = Dense(1, activation='sigmoid')(d)


model = Model(i, d)
model.summary()
model.compile(loss='binary_crossentropy', optimizer='nadam', metrics=['accuracy'])

model.fit(x_train, y_train, epochs=3, validation_data=(x_test, y_test))

Error Output

RuntimeError: Unable to automatically build the model...
Exception encountered when calling Attention.call().

The size of tensor a (50) must match the size of tensor b (32) at non-singleton dimension 1

Arguments received by Attention.call():
• inputs=['torch.Tensor(shape=torch.Size([32, 50, 60])', 'torch.Tensor(shape=torch.Size([32, 50, 60])']
• mask=['torch.Tensor(shape=torch.Size([32, 50])', 'torch.Tensor(shape=torch.Size([32, 50])']
• training=False
• return_attention_scores=False
• use_causal_mask=False

Root Cause

In keras/src/layers/attention/attention.py, lines 174–179:

if scores_mask is not None:
    padding_mask = ops.logical_not(scores_mask)
    max_value = 65504.0 if scores.dtype == "float16" else 1.0e9
    scores -= max_value * ops.cast(padding_mask, dtype=scores.dtype)

Here, scores_mask has shape [batch, seq_len] (e.g., [32, 50]), while scores has shape [batch, seq_len, dim] (e.g., [32, 50, 60]). This causes a broadcasting error.

Suggested Fix / Workaround

Adding a dimension to the mask, if necessary, resolves the issue:

if scores_mask is not None:
    padding_mask = ops.logical_not(scores_mask)
    max_value = 65504.0 if scores.dtype == "float16" else 1.0e9
    if len(padding_mask.shape) == 2:
        padding_mask = ops.expand_dims(padding_mask, axis=-2)
    scores -= max_value * ops.cast(padding_mask, dtype=scores.dtype)

This change ensures compatibility with the scores tensor shape. It should not interfere with causal masking behavior (use_causal_mask=True), which should already produce properly shaped masks.

Environment Keras Version: 3.10.0 Torch Version: 2.7.1+cu128' Python Version: 3.11.13 OS: Windows 11 24H2

Additional Notes The issue does not occur when use_causal_mask=True.

This may only affect the Torch backend due to differing broadcasting semantics or internal tensor shape expectations.

The issue seems to be independent of the backend. I tested it with Tensorflow backed in Google Colab. Please let me know if you’d like me to submit a pull request with the proposed fix.