Bug description
get_or_create
Dataset API endpoint breaks with 500 - Internal Server
error if there is any existing dataset with the same table_name
, regardless of the dataset schema
.
This is not the behaviour we observe on other endpoints, such as a POST /dataset
from where users are able to create datasets with the same table_name, as long as they belong to either a different database_id
or a different schema
.
Furthermore, if there is already an existing dataset for a given table_name
and a user is trying to create a new dataset with the same table_name
but different schema, the API will return 200, which is a false positive.
How to reproduce the bug
Case 1 - False Positive
- Go to the datasets page
- Pick any existing dataset name and prepare a payload as follows (example using the
users
datasets)
{
'table_name': 'users',
'schema': 'other',
'database_id': 1,
}
- Submit this payload via a POST request to /api/v1/dataset/get_or_create
- Note how the API will return with a
200
pointing to the existingpublic.users
dataset ID. The new dataset is not created.
Case 2 - Internal Server Error
- Create 2 or more datasets with the same
table_name
and differentschemas
(either via UI or create dataset API) - Try to create a new dataset. Again, with same
table_name
but a different schema. Payload example:
{
'table_name': 'users',
'schema': 'any_new_schema_name',
'database_id': 1,
}
- Submit this payload via a POST request to /api/v1/dataset/get_or_create
- Note how the API will return with a
500 - Internal Server Error
. The new dataset is not created.
Screenshots/recordings
No response
Superset version
master / latest-dev
Python version
3.9
Node version
16
Browser
Chrome
Additional context
No response
Checklist
- [X] I have searched Superset docs and Slack and didn't find a solution to my problem.
- [X] I have searched the GitHub issue tracker and didn't find a similar bug report.
- [X] I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
Comment From: dosubot[bot]
Hey @luizcapu! 👋 I'm here to help with bugs, answer questions, and assist you while you wait for a human maintainer. Let's squash those bugs together!
The get_or_create
Dataset API endpoint returns a 500 - Internal Server Error
when trying to create a dataset with the same table_name
but a different schema
because the current implementation only checks for the existence of a table by its table_name
and database_id
. It does not account for differences in the schema
. If a table with the same table_name
already exists, it returns the existing table's ID without considering the schema, leading to potential conflicts or errors [1].
The relevant code snippet is:
@expose("/get_or_create/", methods=("POST",))
@protect()
@safe
@statsd_metrics
@event_logger.log_this_with_context(
action=lambda self, *args, **kwargs: f"{self.__class__.__name__}"
f".get_or_create_dataset",
log_to_statsd=False,
)
def get_or_create_dataset(self) -> Response:
"""Retrieve a dataset by name, or create it if it does not exist.
---
post:
summary: Retrieve a table by name, or create it if it does not exist
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/GetOrCreateDatasetSchema'
responses:
200:
description: The ID of the table
content:
application/json:
schema:
type: object
properties:
result:
type: object
properties:
table_id:
type: integer
400:
$ref: '#/components/responses/400'
401:
$ref: '#/components/responses/401'
422:
$ref: '#/components/responses/422'
500:
$ref: '#/components/responses/500'
"""
try:
body = GetOrCreateDatasetSchema().load(request.json)
except ValidationError as ex:
return self.response(400, message=ex.messages)
table_name = body["table_name"]
database_id = body["database_id"]
if table := DatasetDAO.get_table_by_name(database_id, table_name):
return self.response(200, result={"table_id": table.id})
The code only checks for the existence of a table by table_name
and database_id
:
if table := DatasetDAO.get_table_by_name(database_id, table_name):
return self.response(200, result={"table_id": table.id})
This logic does not consider the schema
, which can lead to a 500 - Internal Server Error
if there is a conflict or inconsistency when trying to create a dataset with the same table_name
but a different schema
[1].
Additionally, the test case test_create_dataset_same_name_different_schema
in the integration tests demonstrates that creating datasets with the same table_name
but different schema
is a valid scenario, which the current implementation does not handle correctly [2].
To continue talking to Dosu, mention @dosu.
Comment From: rusackas
Sorry this (and the PR) have been stuck for >200 days now! Must just be an edge case that isn't bugging most folks. Hopefully we can get the PR brought up to a mergable state and get it in, if you're still facing this issue in current versions (4.1.2 or newer)
Comment From: rusackas
Help welcomed in contributing to or adopting the linked PR. This might be closed as "stale" and/or "not planned" if nobody is interested in carrying it through.