Bug description
It would be ideal that the BigQuery support allowed for a separation between GCP projects. Often the datasets live in one project which is notionally owned by a different team. Allowing for this provides:
- better isolation of concerns and therefore a better understanding of security roles and permissions
- a clear separation of usage where the visualisation cost of superset (namely BQ slots) can be clearly recognised and budgeted for
- simplified incident resolution in case of 'slowness' where BQ backend usage can be correlated to activity on a given superset instance
IMPLEMENTATION
So this would effectively need to formalise the separation between
- project_id - the project in which the data is held
- job_project_id - the project in which queries are run, where the slot allocation is used
From an implementation perspective this would require:
- all BQ queries to be fully qualified like
select * from {project}.{dataset}.{table}
- all non-query usage to be able to differentiate when you are making queries (default project job_project_id) and when you are querying metadata (default project project_id)
From experimentation it seems that [1] is relatively easy to solve. If you create a dataset with a schema of project.dataset
instead of just dataset
then any chart that uses it will work correctly. However, this does not work when building a dataset through the UI, but a little use of SQLA_TABLE_MUTATOR
seems to work (a shortened version minus error handling of what I've used is below):
def SQLA_TABLE_MUTATOR(table):
if table.database.driver == 'bigquery':
table.schema = f'{table.catalog}.{table.schema}'
return table
However point [2] is a harder nut to crack. I've been experimenting with DB_CONNECTION_MUTATOR
to achieve this but not sure this will be reliable enough. The other alternative would be a small monkey patch within superset/db_engine_specs/bigquery.py
It would be great if this logic was available out of the box.
Superset version
master / latest-dev
Python version
3.9
Node version
16
Checklist
- [x] I have searched Superset docs and Slack and didn't find a solution to my problem.
- [x] I have searched the GitHub issue tracker and didn't find a similar bug report.
- [x] I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
Comment From: withnale
The native code in superset hasn't made provision for separation of project_id usage and primarily used the standard create_engine
SQLA calls. The DB_CONNECTION_MUTATOR
is available however which seems to run whenever a DB connection is created.
Rather than just passing in parameters, it seemed to be possible to use the supplying-your-own-bigquery-client logic present in python-bigquery-sqlalchemy and make a BQ client default project decision based on context...
# shortened
def DB_CONNECTION_MUTATOR(uri, params, _username, _security_manager, source):
credentials_info = params.get('credentials_info', None)
credentials = service_account.Credentials.from_service_account_info(credentials_info)
project = 'some magic occurs here'
client = bigquery.Client(credentials=credentials, project=project)
params['connect_args'] = {'client': client}
return uri.update_query_dict({"user_supplied_client":"True"}), params
Obviously, the key part here is the 'magic' since you need to be able to make a decision about the correct project based on any context that the DB_CONNECTION_MUTATOR has available to it. The only real context is the source
field, and I've done some experimenting in setting the correct project based on that...
if source is None:
project = DATASET_PROJECT # from bigquery:// URL
elif source.name in ['CHART', 'SQL_LAB']:
project = JOB_PROJECT # from service account or other means
else:
logger.error(f"DB_CONNECTION_MUTATOR: Unknown source: {source}")
This seems too brittle and for many use cases source=None
when a decision needs to be made. Also, fundamentally I don't think this approach is robust regarding connection reuse and pooling.
I think it's probably better to try to fix the "destination decision making" part of this issue upstream in the python-sqlalchemy-bigquery repository, since at present the logic there doesn't specify a project explicitly on their bigquery calls (such as below). Modifying the various bigquery client calls to be explicit seems by far the cleanest solution making the logic available to any product built on sqlalchemy not just superset.
# python-sqlalchemy-bigquery/sqlalchemy_bigquery/core.py:1335
- datasets = connection.connection._client.list_datasets()
+ datasets = connection.connection._client.list_datasets(self.project_id)
Comment From: withnale
It looks way easier to fix in python-bigquery-sqlalchemy:1180
Comment From: withnale
I've got a PR merged https://github.com/googleapis/python-bigquery-sqlalchemy/pull/1180 now so this fixes the need without any changes to superset.
Comment From: rusackas
@withnale awesome! Do we need to bump any packages on the Superset end to get your change in?
Comment From: withnale
Not specifically. If people want to use bigquery they normally need to add it via bootstrap or as part of a custom image.
The only reference is within requirements/development.in|txt which is used by the tests, but never built into the docker image. Without a specific test, it's not going to make a difference, but probably worth bumping to 1.14.0 just in terms of regression.
Comment From: betodealmeida
(1) is fixed in https://github.com/apache/superset/pull/34360.
Let me think about (2).
Comment From: betodealmeida
Ok, if https://github.com/googleapis/python-bigquery-sqlalchemy/pull/1180 fixes this, let me bump the dep to 1.14.0.