I'm using Dremio database and it throws a VALIDATION ERROR on any query that tries to group by a time field that has a granularity expression. This might be valid SQL for another database system, but not Dremio. This makes superset useless for Dremio tables since grouping by time is a very common and used feature

How to reproduce the bug

  1. Create a new table chart and pick a timestamp field like 2022-06-10 12:34:56 and select time grain day or month
  2. Select as dimension the same timestamp field
  3. Add any metric such as COUNT(*)
  4. Superset will generate this query:
SELECT DATE_TRUNC('day', "OperationDate") AS "OperationDate",
       COUNT(*) AS "count"
FROM "no-partitions-queries"."messages_operations_v"
GROUP BY DATE_TRUNC('day', "OperationDate")
ORDER BY "count" DESC
LIMIT 10000;

It is not using the name of the column OperationDate in the group by and is rewriting the same expression.

Expected results

to have this clause:

GROUP BY "OperationDate"

Actual results

Query execution error. Details: VALIDATION ERROR: Expression \'messages_operations_v.OperationDate\' is not being grouped

Unexpected error Error: ('HY000', '[HY000] [Dremio][Connector] (1040) Dremio failed to execute the query: SELECT DATE_TRUNC(\'day\', "OperationDate") AS "OperationDate",\n COUNT(*) AS "count"\nFROM "no-partitions-queries"."messages_operations_v"\nGROUP BY DATE_TRUNC(\'day\', "OperationDate")\nORDER BY "count" DESC\nLIMIT 10000\n[30038]Query execution error. Details:[ \nVALIDATION ERROR: Expression \'messages_operations_v.OperationDate\' is not being grouped\n\nSQL Query SELECT DATE_...[see log] (1040) (SQLExecDirectW)')

Screenshots

Screen Shot 2022-06-10 at 10 50 01

Environment

  • browser type and version: Chrome v102.0.5005.61
  • superset version: It show Superset 0.0.0dev -- I'm using the latest docker image
  • python version: 3.8.12
  • node.js version: none
  • any feature flags active: no

Checklist

Make sure to follow these steps before submitting your issue - thank you!

  • [x] I have checked the superset logs for python stacktraces and included it here as text if there are any.
  • [x] I have reproduced the issue with at least the latest released version of superset.
  • [x] I have checked the issue tracker for the same issue and I haven't found one similar.

Additional context

Add any other context about the problem here.

Comment From: rusackas

Wondering if @eschutho or @zhaoyongjie have any insight on this issue.

Comment From: eschutho

@jbvsmo I found this post which looks related to this issue: https://community.dremio.com/t/unable-to-group-by-date-trunc/7953

@rusackas I think it makes sense to investigate this further and whether it's indeed a fix that's needed on our side or Dremio, but I think as as workaround you should be able to create a calculated column on your dataset for DATE_TRUNC('day', "OperationDate") and give it a new name and then use that new column for your dimension/group by (you can use the original column in your time dimension). When trying that change, my query now looks like this one, which matches what looks to be the fix for Dremio:

_DEV__Explore_-_Untitled_Query_1_06_14_2022_14_17_27

also cc @betodealmeida for any additional insight.

Comment From: jbvsmo

@eschutho Following up on this, seems like Dremio still doesn't support that syntax and they recommend using the alias on the groupby clause. I asked on that community post you showed and they responded that.

There are a few issues with calculated columns in the dataset, one being annoying to do it for several tables and another that sometimes I want to aggregate by day, month, year, quarter, week... I would have to create many columns when a simply fix of using the correct alias would be enough

Comment From: eschutho

@jbvsmo you can also create an alias/calculated column without the time coercion if you want it to be more flexible. Here's an example of renaming ds to date: _DEV__Superset Then just use date for your time grain and ds for your dimension.

I hear what you're saying about it not being ideal with regard to having to do this on many tables, but it is a temporary workaround for now until Dremio can fix the bug on their end. The code solution on the Superset side around this would need to be something where we alias all time dimensions for Dremio which could get messy, but we're definitely open to any PRs if someone wants to contribute an idea.

Comment From: meyergin

Same issue at Apache drill, maybe need to check database driver to choose native name/ alias name.

Comment From: mbrannstrom

I'm using Apache Drill and having the same problem (as @meyergin also mentioned).

Apache Drill supports GROUP BY both with and without alias. E.g. both these works:

SELECT LEFT(name,2) as name2 FROM ... GROUP BY LEFT(name,2) - OK SELECT LEFT(name,2) as name2 FROM ... GROUP BY name2 - OK

However, when the alias is the same as the column name, we get a different behaviour: SELECT LEFT(name,2) as name FROM ... GROUP BY LEFT(name,2) - FAIL - Method used by Superset SELECT LEFT(name,2) as name FROM ... GROUP BY name - OK

@eschutho's workaround by adding a new calculated column ts, will make superset not use the same alias as the original column. However, this is only reasonable for the time column, and not custom SQL adhoc metrics.

I'm not sure what is "proper SQL", but many engines seem to handle GROUP BY very differently.

In superset/superset/models/helpers.py make_orderby_compatible:

    If needed, make sure aliases for selected columns are not used in
    `ORDER BY`.

    In some databases (e.g. Presto), `ORDER BY` clause is not able to
    automatically pick the source column if a `SELECT` clause alias is named
    the same as a source column. In this case, we update the SELECT alias to
    another name to avoid the conflict.`

So, some adjustment is done for ORDER BY, and it seems like the same is needed for GROUP BY.

Comment From: rusackas

I'm not sure who might be best to take a look at the current state of affairs here, as I don't know of anyone Drill or Dremio volunteers to reproduce or assess.

I might consider filing the Drill issue separately, lest this issue get conflated and stuck.

CC @naren-dremio in case he knows of anyone fitting to take a look at this from the Dremio side.

Comment From: rusackas

Anyone still facing this, or looking to work on it? It's at risk of being closed as stale.

Comment From: mbrannstrom

@rusackas: See issue #28443. The PR #28444 solves my issue with Drill. Just waiting for it to be merged ... and released.

Comment From: rusackas

Adding more reviewers to that PR :)

Comment From: rusackas

This has been silent for quite a while. Anyone still facing this in 4.1.1 or newer?

Comment From: fhyy

I'm using Drill and I'm facing the same problem in Superset version 4.1.1

Resulting query:

SELECT NEARESTDATE(time_field, 'SECOND') AS time_field, COUNT(*) AS `count` 
FROM Data GROUP BY NEARESTDATE(time_field, 'SECOND') ORDER BY `count` DESC
 LIMIT 100;

Error:

VALIDATION ERROR: From line 1, column 20 to line 1, column 35: Expression 'Data.time_field' is not being grouped

Comment From: fhyy

Renaming the alias works for me using Drill, and would probably work with Dremio as well.

I believe the issue is the ambiguity of the statement GROUP BY NEARESTDATE(time_field, 'SECOND'). Is time_field referring to the source column or the alias with the same name?

Drill seem to use the alias inside of the NEARESTDATE function, and thus the column was not properly grouped.

Expression 'Data.time_field' is not being grouped

This updated query works fine:

SELECT NEARESTDATE(time_field, 'SECOND') AS time_field_a226ab, COUNT(*) AS `count` 
FROM Data GROUP BY NEARESTDATE(time_field, 'SECOND') ORDER BY `count` DESC
 LIMIT 100;

Comment From: fhyy

See PR #32857 for the changes that worked for me with Drill. A similar fix could be done for this issue, but I cannot test nor verify.

Comment From: rusackas

Just merged https://github.com/apache/superset/pull/32857 for Drill. Would anyone be willing to copy the work over to Dremio and/or work toward the more "generic" solution discussed on that thread? CC @villebro @mistercrunch who were involved there.

Comment From: mistercrunch

Flying blind but copied the drill fix in dremio here -> https://github.com/apache/superset/pull/33939 . Please validate the branch/fix if you have access to Dremio