Hi all,

I'm developing a TF/Keras model on Vertex AI. I am able to successfully (albeit slowly) train the model on my Apple-silicon laptop locally, when I package the code up in a container and run it on an a2-highgpu-1g instance in GCP, I receive an error:

INFO    2025-05-14 16:59:22 -0700       workerpool0-0   DEBUGGING: Checking batch dtypes...
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   Batch dtypes:
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   field_value_string_id: <dtype: 'int32'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   event_timestamp: <dtype: 'int64'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   is_numeric: <dtype: 'int32'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   event_type_id: <dtype: 'int32'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   field_name_id: <dtype: 'int32'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   event_idx: <dtype: 'int32'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   field_value_numeric: <dtype: 'float32'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   token_idx: <dtype: 'int32'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   DEBUGGING: Finished checking batch dtypes.
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   Building model...
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   event_type_lookup: {'<PAD>': 0, '<UNKNOWN>': 1, 'charge_event': 2, 'static': 3, 'subscription_event': 4, 'box_feedback': 5}
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   type(event_type_lookup): <class 'dict'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   len(event_type_lookup): 6 <class 'int'>
INFO    2025-05-14 16:59:22 -0700       workerpool0-0   inputs['event_type_id'] dtype: <dtype: 'int32'>
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0       main()
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0     File "/root/scripts/train.py", line 286, in main
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0       model = build_transformer_model(
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0               ^^^^^^^^^^^^^^^^^^^^^^^^
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0     File "/root/model/build.py", line 74, in build_transformer_model
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0       event_type_embed = tf.keras.layers.Embedding(input_dim=int(len(event_type_lookup)), output_dim=embedding_dim, mask_zero=True, name='event_type_embed')(inputs['event_type_id'])
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0     File "/usr/local/lib/python3.12/dist-packages/tf_keras/src/utils/traceback_utils.py", line 70, in error_handler
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0       raise e.with_traceback(filtered_tb) from None
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0     File "/usr/lib/python3.12/random.py", line 336, in randint
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0       return self.randrange(a, b+1)
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0              ^^^^^^^^^^^^^^^^^^^^^^
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0     File "/usr/lib/python3.12/random.py", line 312, in randrange
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0       istop = _index(stop)
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0               ^^^^^^^^^^^^
ERROR   2025-05-14 16:59:22 -0700       workerpool0-0   TypeError: 'float' object cannot be interpreted as an integer
INFO    2025-05-14 17:00:43 -0700       service Finished tearing down training program.
INFO    2025-05-14 17:00:43 -0700       service Job failed.

The offending code snippet is this:

def build_transformer_model(input_dims, event_type_lookup, field_name_lookup, string_value_lookup,
                            num_heads=12, num_blocks=12, ff_dim=64, dropout_rate=0.1, embedding_dim=64,
                            position_dim=32, classifier_head_dim=64, class_0_count=None, class_1_count=None):

    print("event_type_lookup:", event_type_lookup)
    print("type(event_type_lookup):", type(event_type_lookup))
    print("len(event_type_lookup):", len(event_type_lookup), type(len(event_type_lookup)))

    # --- Input layers ---
    inputs = {}
    for feature_name, (shape, dtype) in input_dims.items():
        inputs[feature_name] = tf.keras.layers.Input(shape=shape, name=feature_name, dtype=dtype)

    print("inputs['event_type_id'] dtype:", inputs['event_type_id'].dtype)

    # --- Embedding layers ---
    event_type_embed = tf.keras.layers.Embedding(input_dim=int(len(event_type_lookup)), output_dim=embedding_dim, mask_zero=True, name='event_type_embed')(inputs['event_type_id'])
    field_name_embed = tf.keras.layers.Embedding(input_dim=int(len(field_name_lookup)), output_dim=embedding_dim, mask_zero=True, name='field_name_embed')(inputs['field_name_id'])
    string_value_embed = tf.keras.layers.Embedding(input_dim=int(len(string_value_lookup)), output_dim=embedding_dim, mask_zero=True, name='string_value_embed')(inputs['field_value_string_id'])

I am thoroughly puzzled how this is happening. My only thought is some kind of version difference between what is running in my local python environment and the packages loaded up in the container.

Dockerfile:

FROM nvcr.io/nvidia/tensorflow:25.02-tf2-py3 AS base

# Install gcsfuse
RUN apt-get update && apt-get install -y \
    curl \
    gnupg \
    lsb-release \
    && echo "deb https://packages.cloud.google.com/apt gcsfuse-$(lsb_release -c -s) main" | tee /etc/apt/sources.list.d/gcsfuse.list \
    && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - \
    && apt-get update \
    && apt-get install -y gcsfuse \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Create mount point directory
RUN mkdir -p /gcs

WORKDIR /root

COPY requirements.txt /root/

RUN pip install --upgrade pip
RUN pip install -r requirements.txt

FROM base

COPY . /root/

ENV PYTHONPATH=/root

ENTRYPOINT [ "python", "scripts/train.py" ]

requirements.txt:

google-cloud-bigquery==3.31.0
google-cloud-bigquery-storage==2.31.0
google-cloud-aiplatform[autologging]==1.92.0
tensorflow[and-cuda]==2.19.0
keras==3.9.2
tqdm==4.67.1

How could a float end up in that randrange call?

Comment From: austin-threadbeast

(I've also verified embedding_dim is an int at runtime, the only other parameter to the Embedding layer)

Comment From: sonali-kumari1

Hi @austin-threadbeast -

Thanks for reporting this issue and providing detailed logs. The TypeError: 'float' object cannot be interpreted as an integer arises when a float value is passed to a function which expects an integer, most commonly randrange(), random.randint() or range(). The error traceback shows error originating from pythons's random.py file within a randrange() call.

Since you have confirmed that embedding_dim is an integer and you are explicitly casting input_dim of all three embedding layers to integer, the error doesn't seem to arise from the code you have shared so far. Could you please share the full build_transformer_model function so we can trace where the float value is being passed.

Comment From: github-actions[bot]

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.