Apache Superset [SIP-115] Parsing SQL expression block for virtual datasets into separate statements

[SIP-115] Proposal for Pre SQL block for virtual datasets

Motivation

Currently huge piece of SQL functionality couldn't be used in virtual datasets due to the fact that dataset query is being put into from section in a pattern "from ("SQL query") as virtual_table. This limits plenty of SQL dialects from using non SELECT statements in a dataset query including for example temp tables, variables or WITH statements. For example Postgre SQL does support WITH statements inside of a SELECT statement but NS SQL does not. Also it would be very useful to be able to actually create temporary tables in a dataset query but as far as I know there is no SQL dialect that supports CREATE statements inside of SELECT ones.

P.S. Actually the whole idea is to make it so that any possible query that can be completed in sql lab should be a valid query for a virtual dataset.

Proposed Change

The change that is being proposed is to parse SQL statement under the hood to distinguish actual final SELECT statement from all the preceding auxillary statements and put the auxillary part in the beginning of every query being built by viz plugins later.

So user should be able to use the following query as a virtual dataset query:

DROP TABLE IF EXISTS #tmp;

DECLARE @t1 TABLE (itmID INT, itmIDComp INT);
INSERT @t1 VALUES (1,10), (2,10);

DECLARE @t2 TABLE (itmID INT, itmIDComp INT);
INSERT @t2 VALUES (3,10), (4,10);

WITH vw AS
(
    SELECT itmIDComp, itmID
    FROM @t1

    UNION ALL

    SELECT itmIDComp, itmID
    FROM @t2
)
, r AS
(
    SELECT t.itmID AS itmIDComp
           , NULL AS itmID
           , CAST(0 AS BIGINT) AS N
           , 1 AS Lvl
    FROM (SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4) AS t (itmID)

UNION ALL

SELECT t.itmIDComp
    , t.itmID
    , ROW_NUMBER() OVER(PARTITION BY t.itmIDComp ORDER BY t.itmIDComp, t.itmID) AS N
    , Lvl + 1
FROM r
    JOIN vw AS t ON t.itmID = r.itmIDComp
)

CREATE TEMP TABLE #tmp
(Lvl INT, N INT);

INSERT INTO #tmp
SELECT Lvl, N FROM r;

----------------SPLIT POINT------------------

SELECT 
   Lvl,
   MAX(N) as max_n
FROM #tmp
GROUP BY
   Lvl;

And everything that is above the split point should be at least automaticcaly put at the beginning of every query generated by viz plugins.

New or Changed Public Interfaces

None should be needed.

New dependencies

To be filled.

Migration Plan and Compatibility

To be filled.

Rejected Alternatives

Rejected due to overcomplicating the UI while also making it less intuitive and userfriendly.

Proposed change is to include a new "Pre SQL" field that would contain SQL that would be put before all SELECT statements for every query on that dataset.

Comment From: john-bodley

@TechAuditBI ideally from a UI/UX perspective it would be great if there was only one SQL field which the backend would parse—be that via sqlparse or sqlglot—to extract the pre-SQL.

cc @betodealmeida

Comment From: betodealmeida

Yeah, I agree that automatically splitting the CTE from the query is a better UX.

In general, I think it would be good if Superset had an understanding of the virtual dataset when exploring it. For example, if I have a virtual dataset like this:

SELECT country, COUNT(*)
FROM t
GROUP BY 1

It would be nice when creating a chart Superset parsed the query and understood that there are metrics and dimensions in it already. Maybe there's a mode that would automatically populate the UI so that COUNT(*) shows as a metric, and country shows as a selected dimension. Today, Superset would try to build another aggregation on top of this aggregation, which maybe is what the user want, but in my experience going from SQL to viz you really want to explore the unaggregated query, and be able to modify it in the UI.

Using a parser to move the CTE around when exploring the virtual dataset sounds like a good first step in that direction.

To me, the bigger problem of requiring a pre_sql is that it would make it cumbersome to go directly from SQL Lab to visualization, requiring the user to manually split their query.

Comment From: TechAuditBI

It would be nice when creating a chart Superset parsed the query and understood that there are metrics and dimensions in it already. Maybe there's a mode that would automatically populate the UI so that COUNT(*) shows as a metric, and country shows as a selected dimension. Today, Superset would try to build another aggregation on top of this aggregation, which maybe is what the user want, but in my experience going from SQL to viz you really want to explore the unaggregated query, and be able to modify it in the UI.

@betodealmeida Speaking of parsing a query into metrics and dimensions. I think that such level of parsing would actually reduce resulting functionality. Because if I want to calculate COUNT(*) grouped by country I could simply use the table "t" as a dataset. But if my goal was to actually aggregate my data before creating a viz I would not be able to do that anymore.

Comment From: TechAuditBI

@rusackas @john-bodley I've updated the description. Please check whether I've skipped any important parts.

Comment From: betodealmeida

@betodealmeida Speaking of parsing a query into metrics and dimensions. I think that such level of parsing would actually reduce resulting functionality. Because if I want to calculate COUNT(*) grouped by country I could simply use the table "t" as a dataset. But if my goal was to actually aggregate my data before creating a viz I would not be able to do that anymore.

Right, but sometimes you aggregate your data too much before going to viz, or you realize only then that a dimension that was not included is also important, so it would be nice to be able to click a button and change that. I'm not sure how the UI/UX would work, or even if it's something users really want, but it's something we'd be able to do.

Comment From: ETselikov

We've just realised that this parsing functionality would actually allow users to create temp tables in their virtual datasets and those temp table could be reused by charts running queries later. So this would be some kind of "dataset caching".

P.S. I'm the author of this PR this is just my personal account

Comment From: TechAuditBI

@supersetbot orglabel

Comment From: rusackas

This has now been brought up for official discussion via the dev@ list.

Comment From: mistercrunch

Any solution that requires parsing SQL is bound to intricately fail at times. Given that another idea would be to have the user be explicit about what goes above the SELECT statement. You could imagine having two TEXTAREAs in the dataset editor, say one label pre-query and one labeled sub-query (we'd probably need a better explanation for this, but that's one option).

Another option would be to use hints and have the parse look for hints

--__PRE_QUERY_BLOCK__
DECLARE stuff
WITH crazy_sub_query AS (
  SELECT * from bananas
)
--__END_BLOCK__
SELECT * FROM crazy_sub_query

Then even today you could have the config.SQL_QUERY_MUTATOR do something like (this was GPT-generated, didn't take time to validate)

import re

def extract_and_move_block(sql):
    # Define the block markers
    start_marker = "--__PRE_QUERY_BLOCK__"
    end_marker = "--__END_BLOCK__"

    # Find the start and end indices of the block
    start_index = sql.find(start_marker)
    end_index = sql.find(end_marker)

    # If both markers are found
    if start_index != -1 and end_index != -1:
        # Extract the block content
        block_content = sql[start_index + len(start_marker):end_index].strip()

        # Remove the block from the original SQL
        sql_without_block = sql[:start_index] + sql[end_index + len(end_marker):]

        # Move the block to the top of the SQL
        new_sql = start_marker + "\n" + block_content + "\n" + end_marker + "\n" + sql_without_block

        return new_sql
    else:
        return sql

# Example usage
sql_input = """
--__PRE_QUERY_BLOCK__
DECLARE stuff
WITH crazy_sub_query AS (
  SELECT * from bananas
)
--__END_BLOCK__
SELECT * FROM crazy_sub_query
"""

new_sql = extract_and_move_block(sql_input)
print(new_sql)

Comment From: rusackas

Does this need to be put up for a VOTE? Let me know if you think it's ready, and I'll be happy to do so.

Comment From: mistercrunch

From my perspective the SIP is lacking important details. Allowing people to run whatever is a pre-query block might not be something that administrators want to allow.

CREATE TABLE IF NOT EXISTS super_massive_unefficient_analyst_query_that_scans_a_petabyte_table AS 
SELECT * FROM massive_petabyte_table;

For me to vote positive, there should be a database configuration flag (off by default) that's called something like ENABLE_PRE_QUERY_BLOCK_FOR_VIRTUAL_DATASETS: false (could be in extra_metdata json blob).

Or maybe it fits in here: Screenshot 2024-09-26 at 6 07 20 PM

Oh, that an a better separator, maybe --__QUERY_START__

Comment From: mcdogg17

We have implemented this SIP in our PR. https://github.com/apache/superset/pull/30518

Comment From: rusackas

On either the SIP or the PR, could you add the details requested by others? The PR is quite helpful in moving the discussion forward, but we need more PR description (examples, edge cases, etc) and on the SIP we need more discussion addressing @mistercrunch 's comments and considering alternatives (e.g. running a Saved Query as a pre-flight step for a virtual dataset).

Hopefully with the above taken care of, we can consider this ready for a Vote and get the PR merged.

Comment From: rusackas

Not sure if there's any remaining interest in carrying this forward, but if not, it'll eventually be closed due to inactivity.

Comment From: ETselikov

Not sure if there's any remaining interest in carrying this forward, but if not, it'll eventually be closed due to inactivity.

There is! We are currently working on implementing all the necessary features and so on

Comment From: TechAuditBI

One of the core improvements this feature achieves is giving users an ability to create temp tables in virtual datasets queries. Temp tables can drastically improve query performance.

As of now we are lacking some control over the feature by administrators if I've understood @mistercrunch correctly. @mcdogg17 will provide more detailed description of how the feature works and we'll be able to move on.

Comment From: TechAuditBI

@mistercrunch Well, basically we just split a virtual dataset query into blocks by ";" and run everything that is not the last block before an actual viz query. After that we just use the last query block as superset would regularly. So expressions available to use in any part of the query are still regulated by checkboxes in DB parameters. F.e. if DML expressions are not allowed for DB user won't be able to use them in a "pre-query".

All in all our main goal was mostly to allow users to create temp tables and use data returning procedures in their queries to lower overall data receiving times for dashboards. Also sometimes running a big nested query require db server to allocate huge amounts of memory to compute. Temp tables are used to overcome such a limitation.

I think that we've managed to achieve our goal while avoiding adding some unnecessary UI elements.

Comment From: TechAuditBI

As a direction for improvement we see such a feature as precomputing temp table(s) for the whole dashboard and then querying them by charts. However that is a much trickier task to solve. But would allow to improve loading times drastically in a bunch of scenarios.

Comment From: TechAuditBI

@mistercrunch @betodealmeida Planning to move this SIP to vote soon, waiting for your feedback.

Comment From: mistercrunch

This is an interesting problem, but to do it right, we'd need a deeper refactor to unify the way we interact with analytics databases across SQL Lab and Explore. Right now, SQL Lab has a bunch of custom and database-specific logic for executing blocks of SQL, while Explore doesn’t have this capability at all. Both parts of the app have their own approaches to things like SQL parsing, enforcing data access rules, caching, and handling different database backends, leading to code duplication and divergence.

If we’re adding block execution logic to Explore, it seems like the right move would be to centralize this logic instead of re-implementing it there. Ideally, we push most of this down into superset/db_engine_specs/, making it the core layer for SQL parsing and execution. Both SQL Lab and Explore could then interact with it through a shared, higher-level API.

I realize this is a bigger effort than the immediate SIP, but it feels like the right path forward to avoid accumulating more tech debt. This might also fit into a broader push to formalize a proper Data Access Layer in Superset—a standalone Python app that manages all this logic cleanly. That’s probably its own SIP, but this work could be a step in that direction.

Some pointers to the relevant diverging/duplicated code: - handling multiple block in sql lab - needs refactor towards db_engine_specs/base.py - Database.get_df - this runs pretty deep in the call stack, and also use db_engine_spec down the call stack, but has a lot of Explore-specific logic

Comment From: betodealmeida

And for context, we've done a lot of working in unifying how we create a SQLAlchemy engine, and get connection or inspector — this was needed for SSH tunneling, which needs to set up the tunnel for the connection and tear it down after the engine has been used. But one thing we haven't unified yet is running queries; as I commented in your draft PR not all databases support multiple sequential statements, eg, and there are other nuances because we support 50+ different databases.

Comment From: TechAuditBI

Thanks a lot for the feedback! We'll need some time to process it and return with further steps proposals.

Comment From: mistercrunch

Wanted to note that what we recommend is a pretty intricate surgery, but happy to provide guidance if you are willing to take it on. One of the first steps would be to refactor all of the logic handling analytics database interactions, ssh-tunneling, SQL parsing and validation and [maybe caching too?] inside db_engine_specs/base.py.

Comment From: mcdogg17

Thanks for the detailed feedback! We've decided to take on the implementation of the functionality you suggested. However, to proceed effectively, we would need more detailed recommendations. It would be great to discuss the plan of action in more depth—could you provide more guidance on the key areas that should be refactored first?

Comment From: mistercrunch

Open to chat, not sure what the right time/place is can DM you my email as I hate to put it here on github where bot/scraper can pick it up for spamming. Maybe coordinate on Superset Slack, or come to one of the community meetings that's on the community calendar - I try to attend most of them.

Comment From: mcdogg17

I was messaging on Slack, waiting for a reply.

Comment From: mcdogg17

Hey! I'm just waiting for your response. @mistercrunch

Comment From: rusackas

Not sure if you and @mistercrunch had a chance to has this out, but it seems it's not quite ready for a VOTE yet, is that correct?

Comment From: TechAuditBI

Not sure if you and @mistercrunch had a chance to has this out, but it seems it's not quite ready for a VOTE yet, is that correct?

Unfortunately no it is not ready for a VOTE yet. We have some troubles actually realising what functionality is expected from us to be developed. Max apparently is quite busy at the moment so our discussion is hardly progressing at all.

Comment From: u35253

Example use case

I'll give an example that would be useful here, if multi-statement "virutal (custom sql) Datasets" were to be supported in Superset.

Problem statement: On my database provider, the way my Superset is set up: CTEs that query the exact same immediately upstream CTE will re-run the whole logic of their source CTE. It recomputes the same exact data each time after the first. Database caching exists only for the whole-query level if it was already run, but actually does not help for this. E.g., adding 2 CTEs that both process the "main" CTE (which, at times, can be very convenient), will then TRIPLE the runtime, whether it was a "first" or "cached" run overall. What was a 5 second query becomes 15 seconds; a 10 second query becomes 30 seconds; a 30 second query becomes 1.5 minutes; a 1 minute query becomes a 3 minute query. What was once "tolerable" either for development or for actual use becomes "less so".
Proposed solution: Allow multi-query virtual datasets. That way, the First Statement could run a CACHE TABLE statement one time in my database's dialect This writes the "main CTE" result to disk, making a disk cache. Then, when the next two CTEs consume from that object (in the Second Statement), they could read the tiny, pre-computed dataset. That would allow the "full query" to run about as long as it takes to run just the "main" CTE (e.g., no multiplier is applied to the runtime).

Commentary: Overall, in this example, without multi-statement Dataset support, the current approach tends toward the direction of "make the whole query run fast no matter what, so that tripling it is not perceptible", among other possibilities. I accept that. But, there could be an opportunity for certain use cases where multi-statement datasets could be very, very convenient.

This example does not get into the implementation needs in the Superset code, much less for the support of 50+ datasource types, or overall requirements. Commenting here so that this pattern can be considered when this Issue gets reviewed again.