Bug description

get_or_create Dataset API endpoint breaks with 500 - Internal Server error if there is any existing dataset with the same table_name, regardless of the dataset schema.

This is not the behaviour we observe on other endpoints, such as a POST /dataset from where users are able to create datasets with the same table_name, as long as they belong to either a different database_id or a different schema.

Furthermore, if there is already an existing dataset for a given table_name and a user is trying to create a new dataset with the same table_name but different schema, the API will return 200, which is a false positive.

How to reproduce the bug

Case 1 - False Positive

  1. Go to the datasets page
  2. Pick any existing dataset name and prepare a payload as follows (example using the users datasets)
{
  'table_name': 'users',
  'schema': 'other',
  'database_id': 1,
}
  1. Submit this payload via a POST request to /api/v1/dataset/get_or_create
  2. Note how the API will return with a 200 pointing to the existing public.users dataset ID. The new dataset is not created.

Case 2 - Internal Server Error

  1. Create 2 or more datasets with the same table_name and different schemas (either via UI or create dataset API)
  2. Try to create a new dataset. Again, with same table_name but a different schema. Payload example:
{
  'table_name': 'users',
  'schema': 'any_new_schema_name',
  'database_id': 1,
}
  1. Submit this payload via a POST request to /api/v1/dataset/get_or_create
  2. Note how the API will return with a 500 - Internal Server Error. The new dataset is not created.

Screenshots/recordings

No response

Superset version

master / latest-dev

Python version

3.9

Node version

16

Browser

Chrome

Additional context

No response

Checklist

  • [X] I have searched Superset docs and Slack and didn't find a solution to my problem.
  • [X] I have searched the GitHub issue tracker and didn't find a similar bug report.
  • [X] I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.

Comment From: dosubot[bot]

Hey @luizcapu! 👋 I'm here to help with bugs, answer questions, and assist you while you wait for a human maintainer. Let's squash those bugs together!

The get_or_create Dataset API endpoint returns a 500 - Internal Server Error when trying to create a dataset with the same table_name but a different schema because the current implementation only checks for the existence of a table by its table_name and database_id. It does not account for differences in the schema. If a table with the same table_name already exists, it returns the existing table's ID without considering the schema, leading to potential conflicts or errors [1].

The relevant code snippet is:

@expose("/get_or_create/", methods=("POST",))
@protect()
@safe
@statsd_metrics
@event_logger.log_this_with_context(
    action=lambda self, *args, **kwargs: f"{self.__class__.__name__}"
    f".get_or_create_dataset",
    log_to_statsd=False,
)
def get_or_create_dataset(self) -> Response:
    """Retrieve a dataset by name, or create it if it does not exist.
    ---
    post:
      summary: Retrieve a table by name, or create it if it does not exist
      requestBody:
        required: true
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/GetOrCreateDatasetSchema'
      responses:
        200:
          description: The ID of the table
          content:
            application/json:
              schema:
                type: object
                properties:
                  result:
                    type: object
                    properties:
                      table_id:
                        type: integer
        400:
          $ref: '#/components/responses/400'
        401:
          $ref: '#/components/responses/401'
        422:
          $ref: '#/components/responses/422'
        500:
          $ref: '#/components/responses/500'
    """
    try:
        body = GetOrCreateDatasetSchema().load(request.json)
    except ValidationError as ex:
        return self.response(400, message=ex.messages)
    table_name = body["table_name"]
    database_id = body["database_id"]
    if table := DatasetDAO.get_table_by_name(database_id, table_name):
        return self.response(200, result={"table_id": table.id})

The code only checks for the existence of a table by table_name and database_id:

if table := DatasetDAO.get_table_by_name(database_id, table_name):
    return self.response(200, result={"table_id": table.id})

This logic does not consider the schema, which can lead to a 500 - Internal Server Error if there is a conflict or inconsistency when trying to create a dataset with the same table_name but a different schema [1].

Additionally, the test case test_create_dataset_same_name_different_schema in the integration tests demonstrates that creating datasets with the same table_name but different schema is a valid scenario, which the current implementation does not handle correctly [2].

To continue talking to Dosu, mention @dosu.

Comment From: rusackas

Sorry this (and the PR) have been stuck for >200 days now! Must just be an edge case that isn't bugging most folks. Hopefully we can get the PR brought up to a mergable state and get it in, if you're still facing this issue in current versions (4.1.2 or newer)

Comment From: rusackas

Help welcomed in contributing to or adopting the linked PR. This might be closed as "stale" and/or "not planned" if nobody is interested in carrying it through.