Apache Superset [SIP-150] Use existing Athena Presto functionality for large downloads from S3

[SIP-150] Proposal for using Athena Presto functionality for large downloads of CSVs

Motivation

Superset inbuilt download functionality is slow for larger file, attempting to download a 100,000 row CSV from Presto Athena AWS DB took just shy of 9 mins from start to completion.

However, this can be vastly sped up with little extra overhead cost to Superset (for Presto DB users). Presto DB can be configured to automatically persist query results to S3, this takes seconds to query DB and save CSV in S3 bucket.

By starting the query through the Athena API and then checking in to get the output location of the CSV, this URL can be returned to user immediately without the raw results, resulting in the real world use case we had, the 8 min 45 second download time was reduced down to 11 seconds.

Proposed Change

The proposed change is to add in functionality to return only the existing CSV file output_location to user when the user requests a download of a chart that uses an Athena Presto DB.

This change will be protected with feature flags to only turn on for Presto DB users and they will need to set environment variables for AWS region/Athena workgroup and Athena DB name, example below from .env file

SUPERSET_REGION=eu-west-1 SUPERSET_WORKGROUP=superset-etl SUPERSET_ATHENA_DB=my_superset_db

Feature flags are to enable S3 download functionality and to hide existing CSV/XLSX default options (S3 download is faster than default download), from featureFlags.ts

DownloadCSVFromS3 = 'DOWNLOAD_CSV_FROM_S3', ~~ShowDefaultCSVOptions = 'SHOW_DEFAULT_CSV_OPTIONS',~~

~~Option will appear in right click context menu~~

Apache Superset [SIP-150] Use existing Athena Presto functionality for large downloads from S3

UI is now the same to user, functionality will trigger on normal CSV/Full CSV export if feature flag is set and if database.backend is awsathena.

New or Changed Public Interfaces

Reusing existing data endpoint used by CSV and XLSX default and full download.

Passing in 'result_location', a new parameter to specify if exported file is to be built within Superset (current export) or S3 (new flow for Presto Athena).

Changes to model for output_location, the returned presigned URL which is used to specify the file inside S3 bucket.

All other changes in PR code changes.

New dependencies

No changes here

Migration Plan and Compatibility

No DB changes

Rejected Alternatives

Describe alternative approaches that were considered and rejected.

Using lambda to get file from S3 and returning through proprietary application, rejected as users are already using Superset and it makes sense to allow them to download large files through Superset UI.

Comment From: rusackas

This is similar to a reverted PR here: https://github.com/apache/superset/pull/29164

In a perfect world, it seems like there should be a world in which installing the Athena driver/dbapi would magically enable this Save To S3 option when configured. This probably requires a plugin architecture where the Athena plugin would offer a Save to S3 export plugin as a dependency... but this is a long way off.

Comment From: william-fundrecs

Agree, but setting output location for query runs is not default functionality, it has to be configured at the DB level/workgroup setting in AWS. I think this would be outside the remit of Superset.

The plugin would need to be able to query Athena API to check this is set or we would need to check a checkbox/set field value in UI at DB connection level when adding connection.

OutputLocation The location in Amazon S3 where your query and calculation results are stored, such as s3://path/to/query/bucket/. To run the query, you must specify the query results location using one of the ways: either for individual queries using either this setting (client-side), or in the workgroup, using WorkGroupConfiguration. If none of them is set, Athena issues an error that no output location is provided. If workgroup settings override client-side settings, then the query uses the settings specified for the workgroup. See WorkGroupConfiguration:EnforceWorkGroupConfiguration.

As it is, once Athena output location has been set, queries against the Athena tables will automatically persist the results in desired format (CSV, Avro, Parquet, ORC, JSON, delimited text) https://docs.aws.amazon.com/athena/latest/ug/creating-databases-prerequisites.html

https://aws.amazon.com/about-aws/whats-new/2021/08/athena-can-write-query-results-parquet-avro-orc-json-formats/

it makes sense to use what is already available and download the file that's already been created in S3, rather than get the raw results, process to dataframe, write to file format and return the file through API.

Whether this PR should be expanded to download the other available file types if different format has been set is a good question.

Comment From: rusackas

Please feel free to open a DISCUSS thread on the Apache dev mailing list to continue moving this through the SIP process and widen the discussion and get more input/opinions. If it isn't ushered through, it will eventually be closed as inactive.

Comment From: william-fundrecs

Hi @rusackas I've sent an email to the dev@superset.apache.org address with the correct subject formatting for {SIP-150] but I don't see it reflected here, https://lists.apache.org/list.html?dev@superset.apache.org

Do I need to log in or sign up somehow?

Comment From: betodealmeida

A nit, but I wouldn't change the copy to say "Download CSV from S3", since that's an implementation detail. All users care is that the file is downloaded.

What about making the CSV endpoint always return a 302 HTTP status code, either to a Superset URL that would trigger the download or to a DB-specific URL (S3 for Athena, eg). Then the config could look like this:

DB_ENGINE_SPECS_CSV_DOWNLOAD = {
    "athena": dict(
        SUPERSET_REGION=eu-west-1,
        SUPERSET_WORKGROUP=superset-etl,
        SUPERSET_ATHENA_DB=my_superset_db,
    ),
}

Comment From: betodealmeida

I also left some comments on https://github.com/apache/superset/pull/31485/. We want to abstract this as much as possible, moving DB-specific logic to the DB engine spec. This way Superset doesn't have to have custom logic for different databases, it just interacts with them via the corresponding engine spec.

Comment From: rusackas

I do indeed see the thread on the list. You're all good, we're officially in the [DISCUSS] phase now :)

Comment From: william-fundrecs

A nit, but I wouldn't change the copy to say "Download CSV from S3", since that's an implementation detail. All users care is that the file is downloaded.

What about making the CSV endpoint always return a 302 HTTP status code, either to a Superset URL that would trigger the download or to a DB-specific URL (S3 for Athena, eg). Then the config could look like this:

DB_ENGINE_SPECS_CSV_DOWNLOAD = { "athena": dict( SUPERSET_REGION=eu-west-1, SUPERSET_WORKGROUP=superset-etl, SUPERSET_ATHENA_DB=my_superset_db, ), }

For this item, it seems that the default CSV/XLSX export functionality does not use the csv endpoint, it uses the same data endpoint used to load the dashboard visualization, only the behaviour changes based on the resultFormat. This is why I added in the extra resultLocation parameter to differentiate between visualization load and export in the exportChart frontend method that calls to '/api/v1/chart/data'.

For the other comments, I will look at adding to the engine this week, I have some other work priorities before I can revisit this.

Thanks for the feedback!

Comment From: rusackas

Looks like the PR for this is stuck awaiting a rebase. We'd love to get that into a mergeable state, and then either shove this SIP through the VOTE phase to merge it, or see if folks think it doesn't need a SIP at all.

Comment From: william-fundrecs

@rusackas Apologies, I'm swamped at work and will not get a chance to do this in the next few weeks. Where last I left this I had failing integration tests and my reintroduction of boto3 dependency in aws.py was throwing some errors.

I see hive.py uses boto3 as well and somehow patches the dependency in hive_tests.py but I'm unfamiliar with the convention.

Also, I'm running on a windows machine and have problems running the quality suites locally. I already have to fudge things by installing prebuilt wheels when running locally (python-ldap etc)

I would love to get this enhancement merged but do not have the bandwidth at present

Comment From: rusackas

@william-fundrecs how's your bandwidth these days. Do you still have interest in pursuing this? I can put it up for a VOTE if you're interested in getting back to the eng side of it. Thanks!

Comment From: gaurav7261

can we also have it in sql lab