Since Superset enables users to run SQL queries directly against large-scale data warehouses like Google BigQuery, Snowflake, and Amazon Redshift. A significant risk in this environment is that a user, particularly one less familiar with SQL optimization, could inadvertently write a query that scans terabytes of data, incurring substantial and unexpected financial costs. Superset currently lacks a built-in "guardrail" to prevent this.
Proposed Solution: This feature would integrate Superset with the "dry run" or query cost estimation APIs that are provided by most major cloud data warehouses. Before executing a query from the SQL Lab or a dashboard, Superset would first send the query to the warehouse's estimation endpoint. If the estimated data to be scanned or the computed cost exceeds a user-configurable threshold, Superset would present a clear warning to the user, such as: "This query will process an estimated 5TB of data. Are you sure you wish to continue?".
Impact: This contribution would add a critical layer of financial governance and safety to Superset, making it a more robust and trustworthy tool for enterprise environments. It would prevent costly mistakes, encourage more efficient query writing, and provide administrators with greater control over resource usage.
Comment From: betodealmeida
I like the idea!
We do have a query cost estimation feature, supported only in BigQuery these days, it requires the ESTIMATE_QUERY_COST
feature flag to be enabled. We could extend it to have another flag with a threshold, adding the hebavior you suggested.
Comment From: mistercrunch
I ran an analysis in Claude Code, here's what it had to say after parsing through the code:
Analysis
Current Implementation
The ESTIMATE_QUERY_COST feature already exists in Superset but requires two conditions:
1. Feature flag: ESTIMATE_QUERY_COST
must be enabled in superset/config.py
2. Per-database setting: cost_estimate_enabled: true
must be added to the database's extra attributes
Currently Supported Databases
- BigQuery - Full support with dry run API (returns data processed in B/KB/MB/GB)
- PostgreSQL - Uses
EXPLAIN
command (returns startup and total cost) - Presto/Trino - Uses
EXPLAIN (TYPE IO, FORMAT JSON)
(returns detailed metrics)
How It Works
- User clicks "Estimate cost" button in SQL Lab
- Frontend calls
/api/v1/sqllab/estimate/
endpoint - Database engine spec's
estimate_query_cost()
method executes the appropriate estimation command - Results are formatted and displayed in a modal
Key Limitation
The feature currently only displays costs - it doesn't implement the threshold warning system suggested in this issue. This would be a valuable enhancement.
Implementation Path for Threshold Warnings
To implement the proposed warning system:
1. Add a new config parameter like QUERY_COST_WARNING_THRESHOLD
with sub-settings per metric type
2. Modify QueryEstimationCommand
to check thresholds after estimation
3. Update the frontend to display warnings before query execution
4. Consider making thresholds configurable per database or per user role
Extending to Other Databases
The architecture is well-designed for extension. To add support for databases like Snowflake or Redshift:
1. Override get_allow_cost_estimate()
to return True
2. Implement estimate_statement_cost()
to execute the database's cost estimation command
3. Parse and format the results appropriately
Many major databases could be supported since they have EXPLAIN capabilities: Snowflake, Redshift, MySQL, Oracle, SQL Server, Databricks, and ClickHouse.