The input column name in pandas.dataframe.query() contains special characters. I saw the change in 0.25, but still have . \ / 等问题 And main problem is that I can't restore these characters after converting them to "_" , which is a very serious problem.
I have an idea
input: pandas.dataframe
Intermediate processing:
`x2E`(Hex)-> like pandas`x2E`dataframe(Add backslashes after processing the data) ->pandas\x2Edataframe
output pandas\x2Edataframe
Comment From: WillAyd
Do you have a minimally reproducible example for this?
https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#bug-reports-and-enhancement-requests
Comment From: solivehong
this is my say like this @WillAyd @hwalinga
import pandas as pd
import re
a = pd.DataFrame({'foo.bar':[11111,2222,333333],})
def in_columns_data(col):
space = re.findall('(\W)',list(col)[0])[0]
if space=='.':
new = list(col)[0].replace('{space}'.format(**locals()),'x2E')
control=1
print (new)
# new = list(col)[0].encode('hex')
return new,control
elif space==' ':
new = list(col)[0].replace('{space}'.format(**locals()),'x20')
control=2
print (new)
# new = list(col)[0].encode('hex')
return new,control
name,control = in_columns_data(a.columns)
a.columns = [name]
print (a)
a = a.query("{name} >2222".format(**locals()))
def out_columns_data(col,control):
if control==1:
new = list(col)[0].replace('x2E','.')
print (new)
# new = list(col)[0].encode('hex')
return [new]
a.columns = out_columns_data(a.columns,control)
print (a)
the result is
foox2Ebar
foox2Ebar
0 11111
1 2222
2 333333
foo.bar
foo.bar
0 11111
2 333333
Comment From: WillAyd
Can you try and make the example minimal? This link might help:
http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
Comment From: hwalinga
@WillAyd
This is what he means:
import pandas as pd
a = pd.DataFrame({'foo.bar':[11111,2222,333333],})
a.query("foo.bar > 2222")
# fails
(NB, Chinese just works fine in Python3, and since new releases don't support Python2 anyway, that issue can be dropped.)
He is coming from https://github.com/pandas-dev/pandas/issues/6508 which I solved for spaces in https://github.com/pandas-dev/pandas/pull/24955
The question that is now on the table: Do we want to support other forbidden characters (next to space) as well?
The problem is that we need to work around the tokenizer, but since the tokenizer recognizes python operators that can be dealt with. (So . / - + *) However, \ is an escape character, which is probably a lot harder, I don't know if even possible.
So, shall I make a pull request for this, or isn't this necessary enough to pollute the code base further with? (I estimate I need to add one new function and alter two existing ones, from what I remember from the last PR I did on this.)
@jreback has reviewed the previous PR and may want to have a word on this before I start.
Comment From: jreback
I am probably -1 on this; we by definition only support python tokens
you can always not use query which is a convenience method
Comment From: solivehong
I think the input str in query and eval is a convenient way to input, and it can improve the speed of processing data. I have been entangled in this problem because I found that strings can use regular expressions, format, etc. Method, this is too convenient@jreback
Comment From: jreback
I think the input str in query and eval is a convenient way to input, and it can improve the speed of processing data. I have been entangled in this problem because I found that strings can use regular expressions, format, etc. Method, this is too convenient@jreback
this does not improve processing at all unless you are doing very simple operations
changing to non python semantics is cause for confusion
Comment From: hwalinga
@danielhrisca @dgua
These people might also have a word on this, since they requested the same in the issue about the spaces.
Comment From: hwalinga
I thought you would be skeptical @jreback but let's view it this way: We already have a way to distinguish between valid python names and invalid python names. That is the backtick quoting I have implemented to allow spaces. So, there isn't much of a barrier left to next to allow `this kind of names`
to also allow `this.kind.of.names`
or `this-kind-of-names`
.
Not everybody in the world is used to snake_case, and the dots in the names would probably cater to the people coming from R. (In which having dots in your identifiers is basically the equivalent of underscores in python.) And who knows, maybe there is a webdev very happy to write his/her names as CSS class names.
If you are worried how horrible the code will look like. It is not that bad. I implemented it already. Take a look: https://github.com/hwalinga/pandas/tree/allow-special-characters-query
Comment From: solivehong
excellent job @hwalinga
Eval and query are the two best methods I use in use
Looking forward to updating this part of 0.25, I will download the test first.
Comment From: hwalinga
@jreback Do you want me to open a PR, or will you close this as a 'wontfix'?
Comment From: JivanRoquet
@hwalinga I second that. One more use case besides this.kind.of.name
is also when a column name starts with a digit (which is not a valid Python variable name), like 60d
or 30d
.
Currently (as of 0.25.1) even using backticks doesn't work with columns starting with a digit.
df.eval("`column 60d` + `column 30d`") # works
df.eval("`60d` + `30d`") # fails
It seems that under the hood, Pandas replaces spaces by underscores and removes the backticks like this:
`hello there` + `foo bar`
# becomes
hello_there_BACKTICK_QUOTED_STRING + foo_bar_BACKTICK_QUOTED_STRING
# which is a valid Python expression
and
`30d` + `60d`
# becomes
30d_BACKTICK_QUOTED_STRING + 60d_BACKTICK_QUOTED_STRING
# which is NOT a valid Python expression
As a solution, one might suggest always prepending a _
in front of a backticked-column (instead of only appending _BACKTICK_QUOTED_STRING
like now), like the following:
`30d` + `60d`
# would become
_30d_BACKTICK_QUOTED_STRING + _60d_BACKTICK_QUOTED_STRING
# which is a valid Python expression
Comment From: JivanRoquet
@jreback
changing to non python semantics is cause for confusion
I don't think this is right. Datasets' column names (and the people who come up with them) couldn't care less about Python semantics, and it seems reasonable enough to think that Pandas users would expect eval
and query
to work out-of-the-box with any kind of dataset column names.
Another way to put it is that the analytics tool (here, Pandas) has to adapt to real-life datasets, instead of the other way around. The "other way around" will simply never happen, and if it doesn't happen either on Pandas' side, then it's up to the Pandas analyst to deal with the nitty-gritty hoops and bumps of dealing with exotic column names. Which is far from ideal.
Comment From: hwalinga
Thanks for the feedback.
I can understand jreback's perspective. Allowing all these kind of edge cases brings in quite some ugly code to the codebase. Unless you have your own language parser build-in. You aren't really solving it very elegantly.
Even if we allow for these kind of edge cases to be valid with the aforementioned hacks, there will all kinds of ways to break it.
Some different approach that has also been discussed was to use uuid instead. That would probably be most elegant, but would make exceptions harder to read.
Since there are now multiple people having trouble (or expecting too much) from the backticks, maybe the backtick approach is something worth reimplementing, even though the exceptions become harder to read?
@zhaohongqiangsoliva maybe this new activity makes it worth reopening again?
Comment From: TomAugspurger
@JivanRoquet Are non-python identifiers even feasible with how query / eval works? I'm not too familiar with it, but my understanding is that we use the ast
module to build up an expression that's passed to numexpr.
Comment From: solivehong
@hwalinga I won't open a closed question,I searched using google, but I didn't find a way.
Comment From: dgua
"I don't think this is right. Datasets' column names (and the people who come up with them) couldn't care less about Python semantics, and it seems reasonable enough to think that Pandas users would expect eval and query to work out-of-the-box with any kind of dataset column names."
100% agree with @JivanRoquet. This needs to work for all of us who use real life data files. This issue needs to be resolved.
Comment From: TomAugspurger
@dgua do you have a design proposal?
Comment From: hwalinga
@TomAugspurger To give you little background. Using non-python identifiers was solved in https://github.com/pandas-dev/pandas/pull/24955 . However, it was only solved for spaces. This issue talks about extending that functionality. I actually already implemented it here: https://github.com/hwalinga/pandas/tree/allow-special-characters-query
Shall I make a PR? (I will then also make the change to allow numbers in the beginning.)
Comment From: jreback
@hwalinga your implementation looks ok; even though i am basically 0- on generally using query / eval (as its very opaque to readers); would be ok with this change
Comment From: shankari
This has been resolved 5 years ago, but in case this issue ever comes up for discussion again (maybe in an effort to simplify/cleanup code), I just want to highlight that calling pandas.json_normalize
on nested code generates column names with dots.
>>> import pandas as pd
>>> t = [{"a": {"b": 2, "c": 3}}, {"a": {"b": 4, "c": 5}}]
>>> pd.json_normalize(t)
a.b a.c
0 2 3
1 4 5
wrt
even though i am basically 0- on generally using query / eval (as its very opaque to readers); would be ok with this change
querying these values can be a bit cumbersome - e.g. the basic filter doesn't work
>>> td = pd.json_normalize(t)
>>> td[td.a.b == 4]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/kshankar/miniconda-25.1.1/envs/emission/lib/python3.9/site-packages/pandas/core/generic.py", line 5902, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'a'
Using an explicit column name works
>>> td[td['a.b'] == 4]
a.b a.c
1 4 5
but not for more complex queries
>>> td[td['a.b'] == 4 || td['a.c'] == 3]
File "<stdin>", line 1
td[td['a.b'] == 4 || td['a.c'] == 3]
^
SyntaxError: invalid syntax
>>> td[(td['a.b'] == 4) or (td['a.c'] == 3)]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/kshankar/miniconda-25.1.1/envs/emission/lib/python3.9/site-packages/pandas/core/generic.py", line 1527, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().