Please make sure you are familiar with the SIP process documented here. The SIP will be numbered by a committer upon acceptance.
[SIP] Proposal for Save SQL lab query result and create sharable link
Motivation
Currently, Superset has the function to save the query and create a sharable link, but the query results is not associated with the sharable link, whoever open the sharable link need to rerun the query to get results. It is a waste of compute power and time for our customers. We are proposing to save query result along when user save the query, and associate the result along with the sharable link so whoever open the link can see the result.
Proposed Change
- Modify the /api/v1/saved_query/ POST function to include the results id within the request parameter.
- Modify the /api/v1/saved_query/{savedQueryId} GET function to return the results id within the response parameter.
- Modify the saved query metadata db to include the results id field.
- Modify the sql lab UI to fetch results if there is results ID returned from get saved query API.
New or Changed Public Interfaces
- Modify the /api/v1/saved_query/ POST function to include the results id within the request parameter.
- Modify the /api/v1/saved_query/{savedQueryId} GET function to return the results id within the response parameter.
New dependencies
No
Migration Plan and Compatibility
the metadata db will add a new field results_id, we need to migrate existing db records to include a empty string for this field.
Rejected Alternatives
Another approach not rejected but list here, we can create a new API to save query results and new API to create sharable link to share results so as to not modify the existing APIs.
Comment From: dosubot[bot]
Hey @xiaoshan1213! 👋 I'm here to help you with bug fixes, questions, and becoming a contributor. Let's squash some bugs together!
To implement the proposed changes in the SIP, you can follow these steps:
1. Modify the /api/v1/saved_query/ POST
Function
Update the request parameters to include the results_id
field. Here is an example of how to modify the post_data
dictionary:
post_data = {
"schema": "schema1",
"label": "label1",
"description": "some description",
"sql": "SELECT col1, col2 from table1",
"db_id": example_db.id,
"results_id": some_results_id, # Add this line
}
2. Modify the /api/v1/saved_query/{savedQueryId} GET
Function
Update the response parameters to include the results_id
field. Here is an example of how to modify the expected result:
expected_result = {
"id": saved_query.id,
"catalog": None,
"database": {"id": saved_query.database.id, "database_name": "examples"},
"description": "cool description",
"changed_by": None,
"changed_on_delta_humanized": "now",
"created_by": {
"first_name": saved_query.created_by.first_name,
"id": saved_query.created_by.id,
"last_name": saved_query.created_by.last_name,
},
"sql": "SELECT col1, col2 from table1",
"sql_tables": [{"catalog": None, "schema": None, "table": "table1"}],
"schema": "schema1",
"label": "label1",
"template_parameters": None,
"results_id": saved_query.results_id, # Add this line
}
data = json.loads(rv.data.decode("utf-8"))
for key, value in data["result"].items():
if key != "changed_on":
assert value == expected_result[key]
3. Modify the Saved Query Metadata Database
Add a new results_id
field to the saved_query
table. Here is an Alembic migration script to achieve this:
from alembic import op
import sqlalchemy as sa
# Add a new column 'results_id' to the 'saved_query' table
def upgrade():
op.add_column('saved_query', sa.Column('results_id', sa.Integer, nullable=True))
# Remove the 'results_id' column from the 'saved_query' table
def downgrade():
op.drop_column('saved_query', 'results_id')
References
- Modifying the
/api/v1/saved_query/ POST
function to includeresults_id
[1]. - Modifying the
/api/v1/saved_query/{savedQueryId} GET
function to includeresults_id
[1]. - Adding a new
results_id
field to thesaved_query
table [2][3].
These changes will ensure that the query results are saved and associated with the sharable link, allowing users to view the results without rerunning the query.
To continue talking to Dosu, mention @dosu.
Comment From: rusackas
@xiaoshan1213 would you want to move this forward through the SIP / ASF consensus process? Let us know on slack if you need help with that. Thanks!
Comment From: rusackas
There are a few open questions here, such as: • Where are the results being stored/cached to avoid re-running the query • How do we deal with things like the data changing since the query was run (i.e. cache invalidation) • How do we address security implications, e.g. row level security, PII leakage, etc.
Please open a Discuss thread on the Dev list if you want to move forward with this proposal. Thanks!
Comment From: xiaoshan1213
Hi @rusackas thanks for the response, I just joined the slack, which channel should I use to open the discuss thread?
Comment From: rusackas
@xiaoshan1213 Unfortunately, Slack doesn't count for official ASF consensus. There needs to be a DISCUSS thread on the Superset Dev Mailing List.
If you haven't already, you can subscribe to the Apache Superset Dev Mailing Listserv (dev@superset.apache.org, with public visibility here) by sending an email to dev-subscribe@superset.apache.org.
Here's an example of one such thread on the public archives: https://lists.apache.org/thread/ytv9vx7wrk07xxjg4m4kx5rx0zl5whx1
Comment From: michael-s-molina
Thanks @xiaoshan1213 for the SIP. I share the same concerns as @rusackas related to stale data and security implications.
It is a waste of compute power and time for our customers.
I believe this should be resolved by the caching layer instead of a SQL Lab specific feature given that you have the same type of requirements when sharing links to charts, dashboards and SQL Lab queries.
Comment From: xiaoshan1213
Hi thanks for the response, I opened a email thread to discuss, but haven't got any response from there yet. And to answer the questions above • Where are the results being stored/cached to avoid re-running the query -- Superset has backend preserve feature, and in our case we configure to store in S3. • How do we deal with things like the data changing since the query was run (i.e. cache invalidation) -- For this case, we can bring up some notification on UI about this result is retrieved at what date and time, customers can choose to rerun the query. • How do we address security implications, e.g. row level security, PII leakage, etc. -- For now, we are not considering this, since once customers within query_lab role, they should be able to access the query results or shared results.
Comment From: xiaoshan1213
Hi, any update on this thread? what happens after moving into discuss thread? should I send a email thread?
Comment From: rusackas
I think the discussion has run quiet since not many people are seeking the functionality, and because security implications have not been solved.
For now, we are not planning to move this forward, since once customers within query_lab
role, they should be able to access the query results or shared results. Maybe this can be revisited after the new security model is in place, but it's hard to foresee.
Of course, it's always possible to do this on a fork, but we're not ready to accept the responsibility in Superset at the moment.
Comment From: rusackas
This might also intersect with the GLOBAL_ASYNC_QUERIES SIP, which will store query results in Redis... maybe we can figure out a way once that's in place.
Comment From: xiaoshan1213
Thanks for the response. My understanding is there should be no security issue, cause this change will only allow those users already in query_lab role to access the query results and share it with other users in that role group. If user not in that group, they will get access denied error by default.
Comment From: rusackas
Hi! Are you still interested in contributing this? It's been quiet here for a while... we can put it up for a Vote if you want to see it through.
Comment From: rusackas
I just prodded the Discussion thread... let's see what happens :D Thanks for your patience here.
Comment From: rusackas
How do we address security implications, e.g. row level security, PII leakage, etc.```
For now, we are not considering this, since once customers within query_lab role, they should be able to access the query results or shared results
I'll be voting -1 unless we can consider how to handle security details here.
Comment From: rusackas
If you want to share results without worrying about security, you can export CSV for someone else to import.