When importing a dataset with api/v1/dataset/import the data in the referenced csv-file (ex. data: http://example/file.csv) is not overwritten the second time you import the same dataset (but with updated data in the csv-file) and overwrite=true is set in the request
By looking at the code in datasets/commands/importers/v1/utils.py -> import_dataset
if data_uri and (not table_exists or force_data): load_data(data_uri, dataset, dataset.database, session)
In the second import rest-call data_uri exists and "not table_exists" = false because table exists I believe that force_data is not set to true if overwrite=true in the import rest call.
How to reproduce the bug
- Upload a file.csv to a configured database (in my case mysql)
- Export the dataset that was created
- Add some extra lines in the file.csv
- Modify the exported dataset-zip file and add a line "data: http://example/file.csv" referencing the updated file.csv
- Import the zip file with the swagger ui and the set the overwrite=true
or do 1 and 2 3. Modify the dataset -zip with new "uuid" and "data" and "table_name" 4. Import the zip file with the swagger ui and the set the overwrite=false to create a new dataset 5. Add some extra lines in the file.csv 6. Import the zip file again with the swagger ui and the set the overwrite=true
Expected results
I would expect the content of the updated file.csv in the physical table not the content of the original initial file.csv I can also see that superset does not log: (the second time) logger.info("Downloading data from %s", data_uri)
Actual results
I see in the logs that the dataset is updated with the same information (column,metrics) but the content of the new updated file is not downloaded and inserted in the physical table
Environment
(please complete the following information): repository: apache/superset tag: latest-dev
root@superset-865d68b7f6-b7hkl:/app/superset# superset --version Loaded your LOCAL configuration at [/app/pythonpath/superset_config.py] Python 3.9.18 Flask 2.2.5 Werkzeug 2.3.3
FEATURE_FLAGS = { "EMBEDDED_SUPERSET": True, "DASHBOARD_RBAC": True, "THUMBNAILS": True, "HORIZONTAL_FILTER_BAR": True }
Checklist
Make sure to follow these steps before submitting your issue - thank you!
- [ x] I have checked the superset logs for python stacktraces and included it here as text if there are any.
- [ x] I have reproduced the issue with at least the latest released version of superset.
- [ x] I have checked the issue tracker for the same issue and I haven't found one similar.
Additional context
Add any other context about the problem here.
Comment From: rusackas
Pingign @dpgaspar on this since you're looking at this API currently and might have some insight as to how it should or shouldn't work.
Comment From: DominikG00d
Temporary workaround until they provide a way to overwrite the force_data: bool = False default value via API request:
fetch your current table_id pk from this superset endpoint /api/v1/dataset/get_or_create/
afterwards perform a delete request over /api/v1/dataset/{pk}
now importing again is forcing this statement to reload the data, due to table does not exist anymore: if data_uri and (not table_exists or force_data): load_data(data_uri, dataset, dataset.database, session)
However, be careful that your table id does only exist once, even across other schemas. We found out that if any other schema has the same id as well, the "table_exists" flag won't be marked as false.
Comment From: rusackas
Is anyone facing this in 4.1.1? I'm not sure if changes have been made here or not.
Comment From: rusackas
Assuming the workaround here did the trick... closing this one out, but happy to revisit/reopen if anyone's facing this in current versions.