SIP: Semantic Layer Support in Apache Superset

Abstract

This proposal introduces changes to Apache Superset to better support semantic layers and external data modeling systems. The changes include (1) the definition of an “Explorable” interface (a Python protocol), and (2) the introduction of a new class of connections for semantic layers and similar systems.

Motivation

Semantic layers are a powerful way to expose well curated metric and related dimensions to users, allowing for an improved user experience when presenting data: instead of focusing on datasets, semantic layers in general operate on a higher level of abstraction, exposing curated metrics as first class citizens. Once a user has selected one or more metrics that they're interested in, most modern semantic layers allow them to slice and dice metrics by presenting associated dimensions, automatically performing joins between the underlying data sources. This workflow, where metrics are curated and their dimensions are freely available, allows users to focus on the metrics that matter to them, while providing confidence that the underlying data is correct and well defined.

Because Superset is fundamentally dataset-centric, integrations with semantic layers have been timid so far. When they exist, they usually represent the semantic layer as pseudo-database, exposing one or more pseudo-datasets that represent models in the semantic layer. Cube, for example, uses the Postgres wire protocol to expose cubes as pseudo-tables that can be queried in Superset. Minerva, an inhouse semantic layer from AirBnB, uses a different approach, exposing all metrics and dimensions as a single dataset, with custom overrides to indicate to the users which metrics and dimensions are compatible. Other experimental integrations (with MetricFlow and DataJunction) used a similar approach, though they haven’t been contributed to OSS yet.

There are a couple limitations in Superset's architecture that create friction when integrating with semantic layers and modern data platforms:

Datasets are the Superset Semantic Layer

In order to explore data in Superset we need some kind of semantic layer that informs us which columns are available, which ones can be filtered, which ones are groupable, etc. In Superset, that semantic layer is the dataset, a thin abstraction that allows users to define metrics, declare derived columns, and add metadata that informs the UI and prevents expensive queries from running (group bys on high cardinality columns, for example).

Because the dataset is the native semantic layer, adding an external source from a semantic layer as a dataset to Superset is unlikely to work as expected, since we’re adding a second semantic layer on top of the first one. For example, the semantic layer might not allow for adhoc metrics or computed columns, making it incompatible with the Superset dataset editor, as well as making the experience in Explore unintuitive and error-prone.

For these “semantic datasets” (or “semantic models”) we likely want to disable the Superset semantic layer, since the metadata is already defined externally:

Users shouldn't be able to add metrics nor calculated columns when editing them, unless the semantic layer supports adhoc expressions.
Similar to the "Sync columns from source", metrics should have a button to "Sync metrics from source", to fetch new or updated metrics from the semantic layer.
When explored, users should not be allowed to add adhoc metrics, derived columns, or use "Custom SQL", unless the semantic layer supports adhoc expressions.
When explored, not all metrics and dimensions might be compatible, requiring a mechanism for disabling metrics/columns as the user selects metrics/columns, similar to the inhouse approach used by Minerva.

This suggests that these semantic models should not be represented as standard datasets in Superset, given that in order to make it work we need to remove all the value that datasets provide — the semantics.

Query generation

The current flow for chart creation is tightly coupled not only with SQL but with SQLAlchemy. The frontend sends a "semantic request" to the backend, via the QueryObject, indicating which columns, metrics, filters, and other parameters the user has selected. The backend then generates a SQLAlchemy query from this payload. The query generation is highly dependent on the specific database engine, so the get_sqla_query method inspects several attributes of the DB engine spec. Once this query is generated, it is transpiled to the target dialect using SQLAlchemy, and passed to the DB engine spec for execution.

This approach works reasonably well for traditional databases, but it creates friction when integrating with semantic layers that do not use SQL or SQLAlchemy. For example, if a semantic layer uses GraphQL or REST APIs, the current flow requires parsing the generated SQL and building a custom request. This is the case for the experimental MetricFlow DB engine spec, which has this flow:

Explore controls
       ↓
  QueryObject
       ↓
SQLAlchemy query
       ↓
   pseudo-SQL
       ↓
Shillelagh handler
       ↓
 GraphQL request

For Snowflake, even though it exposes a SQL interface to semantic views, the flow looks like this:

Explore controls
       ↓
  QueryObject
       ↓
SQLAlchemy query
       ↓
   pseudo-SQL
       ↓
  sqlglot AST
       ↓
   actual SQL
       ↓
    execute

The SQL parsing step is necessary for Snowflake because Explore builds the query on top of either a table or a subquery, but in Snowflake it should be a UDTF (user defined table function):

-- SQL generated by Explore
SELECT "Item.Brand", "Store.State", "StoreSales.TotalSalesQuantity"
FROM pseudo_table
GROUP BY "Item.Brand", "Store.State";

-- final SQL after parsing and manipulating the AST
SELECT * FROM SEMANTIC_VIEW(
    TPCDS_SEMANTIC_VIEW_SM
    DIMENSIONS Item.Brand, Store.State
    METRICS StoreSales.TotalSalesQuantity
);

This is not only inefficient and brittle but also limits the flexibility of how queries can be executed against different data sources.

Proposed Change

In order to properly support semantic layers we need to move away from the current solutions based on pseudo-databases and custom DB engine specs. Instead, we should implement first class support for semantic layers, bypassing the need for using a dataset when exploring data. This will allow users to choose between the semantics provided by Superset datasets, or an external system.

The first change will be the introduction of an Explorable interface, defined as a Python protocol. This interface is in some ways similar to the existing ExploreMixin, that was added when we decided to support Query objects in explore. The problem with the current ExploreMixin is that it’s too tightly-coupled with SQLAlchemy and datasets, and has no clear separation between the functionality needed for data exploration vs. query generation.

The new Explorable, on the other hand, is concerned only with chart building:

@runtime_checkable
class Explorable(Protocol):

    @property
    def name(self) -> str:
        """Display name for this explorable."""
        ...

    @property
    def uid(self) -> str:
        """Unique identifier for this explorable."""
        ...

    def get_columns(self) -> set[ColumnInfo]:
        """
        Get available columns/dimensions for exploration.

        Returns:
            List of column metadata objects that can be used as dimensions.
        """
        ...

    def get_metrics(self) -> set[MetricInfo]:
        """
        Get available metrics for exploration.

        Returns:
            List of metric metadata objects that can be aggregated.
        """
        ...

    def get_values(
        self,
        dimension: str,
        top_n: int | None,
        filters: set[dict[str, Any]] | None,
    ) -> list[Any]:
        """
        Get distinct values for a dimension/column.

        Useful for populating filter dropdowns and understanding data cardinality.

        Args:
            dimension: Name of the dimension/column to get values for
            top_n: Optional limit on number of values to return (e.g., top 100 by frequency)
            filters: Optional filters to apply before getting values

        Returns:
            List of distinct values for the dimension, optionally limited to top_n
        """
        ...

    def get_dataframe(self, query_object: QueryObject) -> DataFrame:
        """
        Execute a query and return results as a pandas DataFrame.

        This method abstracts away SQL generation, allowing semantic layers
        and other non-SQL data sources to process queries directly.

        Args:
            query_object: The query specification from the Explore interface

        Returns:
            DataFrame containing the query results

        Raises:
            QueryObjectValidationError: If the query object is invalid
            DataSourceError: If there's an error fetching data
        """
        ...

    def check_compatibility(
        self,
        selected_metrics: set[str],
        selected_dimensions: set[str],
    ) -> CompatibilityResult:
        """
        Check compatibility of metrics and dimensions for semantic layers.

        For traditional SQL databases, this can return all metrics/dimensions
        as compatible. Semantic layers can implement logic to determine which
        combinations are valid based on their metadata.

        Args:
            selected_metrics: List of metric names currently selected
            selected_dimensions: List of dimension names currently selected

        Returns:
            CompatibilityResult indicating which other metrics/dimensions
            are compatible with the current selection, and which are not
            (with optional reasons).
        """
        ...

Note the get_dataframe method, which returns a Pandas dataframe from a QueryObject. This allows exploring data from sources that are not SQL based, as well as decoupling the SQL generating from SQLAlchemy. This gives a much simpler flow:

Explore controls
       ↓
  QueryObject
       ↓
  Explorable
       ↓
   DataFrame

And even though we don’t need to, we should move the current SQL generation logic (superset.models.helpers.get_sqla_query) into the base DB engine spec; it would make sense, since the method is tightly coupled with DB engine spec attributes anyway. This would allow for DB engine specs to implement their own SQL generation logic, for example, using sqlglot instead.

In addition to the Explorable we would also add models for a new class of connections for semantic layers. This would be similar to existing database connections, but with a few key differences:

There will be no concept of a SQLAlchemy URI. The connection info should have a well defined schema, which could vary wildly between different semantic layers. For example, Malloy could point to a GitHub repository and a database; Snowflake would require parameters to build a SQLAlchemy engine, just like today; and MetricFlow would require an API key and an optional customer URL. This is similar to how some DB engine specs today use the BasicParametersMixin for a better experience when defining the connection parameters.
The association between a given instance of a semantic layer and its implementation will be explicit. Today, there’s only an implicit mapping between a Database instance and the engine spec that should be used with it; we infer the DB engine spec based on the scheme of the SQLAlchemy URI, which has caused problems in the past, especially because early implementations used only the backend name, and not the driver.

An important note is that this change would not be very different from the initial versions of Superset, where we had different connectors for Druid (via REST and JSON, before its SQL API was introduced) and SQLAlchemy. Having different connectors offered some challenges, mostly when defining adhoc metrics (since for Druid the user would have to provide JSON). I hope that we can avoid these problems by ensuring a consistent interface in Explore that works across all types of Explorable, while still allowing for flexibility in how queries are executed.

Finally, we would also need a lightweight model for explorables, containing metadata about them: UUID, name, parent connection, timeout, default metric when exploring.

The introduction of semantic layers should open up new interesting workflows. A deployment of Superset could have a traditional connection to Snowflake, allowing powers users to run SQL directly, as well as defining semantic views in SQL Lab. These semantic views could then be exposed to other users via the semantic layer connection, providing a curated collection of metrics and dimensions that non power users would have access to. Both connections to Snowflake, via DB engine spec and via semantic layer, would have their purposes and target audiences.

Implementation Plan

Phase 1: implement the `Explorable`

Define the Explorable protocol.
Rename ExploreMixin to SupersetExplorable and implement protocol.
Hook up Explore to the generic Explorable.

Phase 2: introduce semantic layers

Add feature flag for enabling semantic layer support.
Add new models
CRUDIE (create, read, update, delete, import, export) for semantic layers.
DAR for semantic layers and semantic models.
Allow users to explore semantic models.

Phase 3: UI/UX Updates

Update Explore to support semantic models:
- No adhoc metrics, derived columns, or custom SQL when not supported.
- Implement reactive UI for metric/dimension compatibility matrix.

All of these phases require considerable work, and should be done with constant feedback from the community when it comes to terminology, UI, and UX.

Alternative Approaches Considered

We considered using the plugin architecture for semantic layers. While potentially valuable, the proposed approach provides the necessary flexibility without the complexity of a full plugin system.

Apache Superset [SIP-182] Semantic Layer Support in Apache Superset