From the docstring of evaluate's return in the base trainer class:

    Returns:
        Scalar test loss (if the model has a single output and no metrics)
        or list of scalars (if the model has multiple outputs
        and/or metrics). The attribute `model.metrics_names` will give you
        the display labels for the scalar outputs.

However, the implementation is inconsistent with model.metric_names

@property
def metrics_names(self):
    return [m.name for m in self.metrics]

The above implementation DOES NOT get the submetrics for compiled metrics. What this means is that, if you have multiple compiled metrics, model.evaluate and model.metrics_names will not be lists of the same size, which is what is expected from the docstring shared above.

Looking at the base Trainer class, I can see the function _flatten_metrics_in_order has more of what I would expect. And looking at the tensorflow Trainer evaluate, this function is exactly what gets called for the output of evaluate (which is why it does not match the docs):

def _flatten_metrics_in_order(self, logs):
    """Turns `logs` dict into a list as per key order of `metrics_names`."""
    metric_names = []
    for metric in self.metrics:
        if isinstance(metric, CompileMetrics):
            metric_names += [
                sub_metric.name for sub_metric in metric.metrics
            ]
        else:
            metric_names.append(metric.name)
    # the rest of the implementation here is irrelevant

Hence, to interact correctly with evaluate, the below should be the implementation of metrics_names

@property
def metrics_names(self):
    metric_names = []
    for metric in self.metrics:
        if isinstance(metric, CompileMetrics):
            metric_names += [
                sub_metric.name for sub_metric in metric.metrics
            ]
        else:
            metric_names.append(metric.name)
    return metric_names

However, since metrics_names is a part of the exposed API, this could easily break other peoples code if they depend on it. So maybe this should be exposed through a different route?

Regardless, I'll make my own utility function for now for my needs, but seems that this should be addressed directly in keras eventually. Maybe even just with a docs update?