Apache Superset [SIP-180] Implement a "Cost Estimation" Pre-Query Check

Since Superset enables users to run SQL queries directly against large-scale data warehouses like Google BigQuery, Snowflake, and Amazon Redshift. A significant risk in this environment is that a user, particularly one less familiar with SQL optimization, could inadvertently write a query that scans terabytes of data, incurring substantial and unexpected financial costs. Superset currently lacks a built-in "guardrail" to prevent this.

Proposed Solution: This feature would integrate Superset with the "dry run" or query cost estimation APIs that are provided by most major cloud data warehouses. Before executing a query from the SQL Lab or a dashboard, Superset would first send the query to the warehouse's estimation endpoint. If the estimated data to be scanned or the computed cost exceeds a user-configurable threshold, Superset would present a clear warning to the user, such as: "This query will process an estimated 5TB of data. Are you sure you wish to continue?".

Impact: This contribution would add a critical layer of financial governance and safety to Superset, making it a more robust and trustworthy tool for enterprise environments. It would prevent costly mistakes, encourage more efficient query writing, and provide administrators with greater control over resource usage.

Comment From: betodealmeida

I like the idea!

We do have a query cost estimation feature, supported only in BigQuery these days, it requires the ESTIMATE_QUERY_COST feature flag to be enabled. We could extend it to have another flag with a threshold, adding the hebavior you suggested.

Comment From: mistercrunch

I ran an analysis in Claude Code, here's what it had to say after parsing through the code:

Analysis

Current Implementation

The ESTIMATE_QUERY_COST feature already exists in Superset but requires two conditions: 1. Feature flag: ESTIMATE_QUERY_COST must be enabled in superset/config.py 2. Per-database setting: cost_estimate_enabled: true must be added to the database's extra attributes

Currently Supported Databases

BigQuery - Full support with dry run API (returns data processed in B/KB/MB/GB)
PostgreSQL - Uses EXPLAIN command (returns startup and total cost)
Presto/Trino - Uses EXPLAIN (TYPE IO, FORMAT JSON) (returns detailed metrics)

How It Works

User clicks "Estimate cost" button in SQL Lab
Frontend calls /api/v1/sqllab/estimate/ endpoint
Database engine spec's estimate_query_cost() method executes the appropriate estimation command
Results are formatted and displayed in a modal

Key Limitation

The feature currently only displays costs - it doesn't implement the threshold warning system suggested in this issue. This would be a valuable enhancement.

Implementation Path for Threshold Warnings

To implement the proposed warning system: 1. Add a new config parameter like QUERY_COST_WARNING_THRESHOLD with sub-settings per metric type 2. Modify QueryEstimationCommand to check thresholds after estimation 3. Update the frontend to display warnings before query execution 4. Consider making thresholds configurable per database or per user role

Extending to Other Databases

The architecture is well-designed for extension. To add support for databases like Snowflake or Redshift: 1. Override get_allow_cost_estimate() to return True 2. Implement estimate_statement_cost() to execute the database's cost estimation command 3. Parse and format the results appropriately

Many major databases could be supported since they have EXPLAIN capabilities: Snowflake, Redshift, MySQL, Oracle, SQL Server, Databricks, and ClickHouse.

Comment From: betodealmeida

I started a PR here: https://github.com/apache/superset/pull/34385

Comment From: rusackas

Seems like things are already well in motion here, but not with the official SIP process. Would you mind opening a DISCUSS thread on the dev@ mailing list? Let me know here or on Superset Slack if you'd like help in doing so. Thanks!