Research
-
[X] I have searched the [pandas] tag on StackOverflow for similar questions.
-
[X] I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
https://stackoverflow.com/questions/79304226/should-i-manually-patch-the-pandas-dataframe-query-vulnerability-or-wait-for-a
(To clarify, this question was written by another user.)
Question about pandas
Hi, I saw this question on StackOverflow, which is about a public CVE, CVE-2024-9880.
The basic premise of the CVE is that if an attacker controls the expr
argument to DataFrame.query(), then arbitrary code execution can be achieved.
The example given in the CVE is
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['error_details', 'confidential_info', 'normal']})
query = '@pd.core.frame.com.builtins.__import__("os").system("""ping google.com #""")'
try:
engine = "python"
result = df.query(query,local_dict={},engine="python",).index
except Exception as e:
print(f'Error: {e}')
However, this is not minimal, and a more minimal construction would be
import pandas as pd
df = pd.DataFrame()
expr = '@pd.compat.os.system("""echo foo""")'
result = df.query(expr, engine='python')
(The report also says that engine='python'
is required, but both engine='python'
and engine='numexpr'
worked in my testing.)
My question is about Pandas's security model. What security guarantees does Pandas make about DataFrame.query() with an attacker-controlled expr
?
My intuition about this is "none, don't do that," but I'm wondering what the Pandas project thinks.
Comment From: skj-skj
any updates? Snyk is flagging this as High Severity Vulnerability and is blocking my deployment.
its unfortunate it is happening on Holidays.
Comment From: rhshadrach
My intuition about this is "none, don't do that," but I'm wondering what the Pandas project thinks.
This is indeed my take, both query and eval should be used with string literals and not with strings provided by or derived from untrusted user input.
cc @pandas-dev/pandas-core
Comment From: Dr-Irv
This is indeed my take, both query and eval should be used with string literals and not with strings provided by or derived from untrusted user input.
This could be tricky, because something like this should be OK:
mystr = "something from a user"
df.query(f"a_column == {mystr}")
In other words, it's OK to construct a query string based on user input.
Comment From: nickodell
This could be tricky, because something like this should be OK:
Isn't this technique still vulnerable to injection? Example:
import pandas as pd
df = pd.DataFrame({'a_column': []})
mystr = '@pd.compat.os.system("""echo foo""")'
df.query(f"a_column == {mystr}")
This still prints 'foo' for me.
Comment From: Dr-Irv
Isn't this technique still injectable?
Yes, so it seems that the caller that creates the parameter to query()
needs to put a safeguard around that computation.
Comment From: rhshadrach
@Dr-Irv - I do not think it is a reasonable expectation that users are able to implement a safeguard. Preventing query-injection can be very difficult. This is why SQL libraries provide such functionality. It seems to me either pandas needs to provide it or make the recommendation that users do not use query/eval in this manner.
Comment From: WillAyd
The eval documentation has a warning against this (see https://github.com/pandas-dev/pandas/pull/59108). I think can be added to query as well
Comment From: WillAyd
As an added note, please do not report security vulnerabilities to our github page. If you suspect or know of one, we have a process in place through our partnership agreement with Tidelift.
See https://tidelift.com/docs/security for more details on how to report going forward. Locking this conversation for now - thanks!
Comment From: TomAugspurger
Agreed with Will, that repeating that warning in the docs for query
is worth doing. pandas doesn't attempt to sanitize the input in any way, so there's not a security boundary here. (we could have a discussion about whether we wanted to attempt to sanitize things, but IMO it's not worth it).
A bit more context on this specific report. It was reported to the pandas-security team, and we replied that it wasn't considered a vulnerability. I'm not sure why they obtained / published a CVE :/
Comment From: TomAugspurger
@ethansilvas, @aftersnow, @huntr-helper I see that you're active on https://huntr.com/bounties/a49baae1-4652-4d6c-a179-313c21c41a8d. If possible, could you attempt to retract that CVE? Thanks!
Comment From: WillAyd
On a side note we probably need to better document the security reporting process to Tidelift. I didn't see it in the website - am I overlooking?
Comment From: TomAugspurger
Not sure whether we have a link from the site.
We have https://github.com/pandas-dev/pandas/security, which is the standard place for GitHub. Also linking from the site, if we aren't already, would be nice.
edit: https://pandas.pydata.org/docs/development/policies.html#security-policy links to https://github.com/pandas-dev/pandas/security, so you'll get there eventually.
Comment From: skj-skj
is it possible to add parameterized query like there is for SQL in DataFrame().query()
or is it already available?
Comment From: WillAyd
@skj-skj if you want to open a dedicate issue to discuss that I think it would be better. My initial thought is that we wouldn't want to build that ourselves and create that security boundary. It has the potential to be a lot of work to maintain, not in an area we specialize in, and we are already resource constrained. If there's a dedicated library that could be used for that purpose maybe it can be leveraged. But yea I think worth opening an issue dedicated to discuss that if its something you are passionate about
Comment From: bashtage
I do not think it is a reasonable expectation that users are able to implement a safeguard. Preventing query-injection can be very difficult. This is why SQL libraries provide such functionality. It seems to me either pandas needs to provide it or make the recommendation that users do not use query/eval in this manner.
I think that someone who is using .query
with arbitrary input from untrusted users should have the knowledge to use it without sanitization. Is there anything that query
can do that can't be done with other methods that would not have the same security risk?
Comment From: bashtage
Reading this thread, this feels like a great idea for a project that extends query
but with a promise of security, a la parameterized SQL.
Comment From: CTPassion
From what I've read in this thread, am I right that the development team doesn't consider this a vulnerability and will not be making changes (other than perhaps a logging warning)?
In that case there are many people like myself who will need to inform SNYK as it is considering this a high risk vulnerability that is blocking pipelines. SNYK Pandas
Comment From: bashtage
That is correct. Pandas doesn't make any security promises with query
, and so users who pass untrusted data to query
must provide any required security.
@TomAugspurger 's comment above has the most precis answer.
Comment From: mcclanahanmp
To note, I did submit an inquiry on the 26th to security@tidelift.com, but have yet to receive a response. However, thank you all for your input on this matter and @CTPassion and @bashtage for clarifying. -SrvrAdmin
Comment From: CTPassion
Note, SNYK have removed this as a vulnerability https://security.snyk.io/vuln/SNYK-PYTHON-PANDAS-8549481 :)
Comment From: beefkit
I saw that this was officially moved on 3/20 into the NVD and is 'awaiting analysis', so seems like we will be getting an answer soon so this can be removed off our scans.
https://nvd.nist.gov/vuln/detail/CVE-2024-9880
Our security team has let this progress into needing an acceptance of risk from our client. I explained that we do not even use this method nor is a validated CVE.. Seems that Tenable assigns scores and purports them as valid when they have not even been validated by NIST or accepted into the NVD. They also do not even test for this vulnerability but only rely off the reported version number.
Hoping NIST can wrap this up fast so this can be closed.
https://www.tenable.com/cve/CVE-2024-9880
Comment From: jvparidon
This ended up in the NVD because of a report filed at https://huntr.com/bounties/a49baae1-4652-4d6c-a179-313c21c41a8d. I posted to huntr to see if they'd be willing to follow SNYK's lead and mark this as not a vulnerability, which would presumably be the easiest way to resolve this problem.
Comment From: jvparidon
Now marked as rejected in the NVD. (https://nvd.nist.gov/vuln/detail/CVE-2024-9880)
Comment From: beefkit
Just a fun little write up on this.. Tenable is still actively refusing to remove their plugin on this finding. Not sure what gives.. I am reaching out to Huntr support to see if they have any information/documentation on the rejection, but this is quite ridiculous.
Comment From: cappetta
@beefkit - Tenable is still actively refusing to remove their plugin on this finding. Not sure what gives..
It seems folks don't understand cyber exposures. the pandas team and multiple people in this thread clearly state there is a code injection exposure, it exists & is known to exist. Pandas team does not want to own the fix. The exposure is valid. CISOs need to have developers analyze their codebase to determine if the app is passing in any untrusted input, if so - red-team exercise to either prove they are exploitable, introduce a fix via PR to the project, or accept the risk this software requires you to inherit.
Comment From: beefkit
@beefkit -
Tenable is still actively refusing to remove their plugin on this finding. Not sure what gives..
It seems folks don't understand cyber exposures. the pandas team and multiple people in this thread clearly state there is a code injection exposure, it exists & is known to exist. Pandas team does not want to own the fix. The exposure is valid. CISOs need to have developers analyze their codebase to determine if the app is passing in any untrusted input, if so - red-team exercise to either prove they are exploitable, introduce a fix via PR to the project, or accept the risk this software requires you to inherit.
Hey Cappetta, I definitely agree with you that code injection is possible if, as a developer, you don't sanitize your inputs and allow for code injection to happen from a user. However, this is not specific to this class in Pandas. Even base Python has this functionality, as do many different languages. Python's eval() and exec() functions work exactly the same way, and you'll find similar dynamic code execution functions in JavaScript, PHP, Ruby.
The issue here is that these functions are designed to execute code - that's their intended purpose, not a security vulnerability.
Tenable does not scan for actual exploitable code - they simply report based on version numbers. Since this 'vulnerability' is flagged for certain pandas versions, it gets reported regardless of whether your application actually uses DataFrame.query() or passes user input to it. If Tenable actually tested for exploitable code, I wouldn't be writing this comment. Security teams I have worked with would not be able to tell exploitable code from non-exploitable code and simply rely on whatever vendor they're using, in this case Tenable/Nessus, to tell them what is and what is not exploitable.
As far as what Pandas can take responsibility for, I think a warning would suffice. However, I'm not sure how this would be 'fixed' since it's not a 'true' security vulnerability - it only becomes one when you code it to be exploitable. What makes this situation particularly frustrating is that this CVE was withdrawn by Huntr at the request of MITRE and Baidu for being outside of scope, and it was rejected from the NVD. Yet Tenable is still actively claiming it's a high-severity vulnerability with a CVSS score of 8.4.
What specific fixes would you suggest to resolve this issue, and what would you like to see happen for Pandas to own the fix?
Comment From: cappetta
@beefkit - your an anon w/ no activity, you've mentioned tenable 9-times while cycling through the this is an intended exposure called a feature
& contradicting the basic concepts of cyber exposure.
First - Tenable does create plugins which perform exploit direct checks. This particular plugin requires 'thorough checks'. The plugin description is crystal clear, the see also links pinpoint the line in the v2.2.3 code calling the eval(). The main branch does not appear to fix this, there is some warnings/comments & layering.
The pandas dataframe query documentation indicates the intention & scope of the function: Query the columns of a DataFrame with a boolean expression.
That query allows variables; those variables can be manipulated to perform code/command execution.
The problem is that Pandas looks up the object named after the @
. It doesn’t restrict what that object can be. This means you’re not limited to “safe” variables like integers — you can reference whole modules (pd), classes, or functions available in the scope.
So an attacker can craft an expression like: @pd.core.frame.com.builtins.__import__("os").system("ping google.com")
Step by step payload processing :
- @pd → the global pandas module.
- .core.frame.com.builtins → gets to the builtins mapping.
- .import("os") → dynamically imports the os module.
- .system("ping google.com #") → runs a shell command.
So instead a simple expectation that the query is filtering rows, a malicious attacker can taint the scope and exploit eval as a backdoor and run system commands.
Comment From: TomAugspurger
@cappetta please keep it civil.
As mentioned in https://github.com/pandas-dev/pandas/issues/60602#issuecomment-2563146635, we're happy to accept a PR to copy the warning from the eval
docstring to the query
docstring.
The pandas dataframe query documentation indicates the intention & scope of the function: Query the columns of a DataFrame with a boolean expression.
That's what pandas does to the DataFrame with the result of the query.
a malicious attacker can taint the scope and exploit eval as a backdoor and run system commands.
Indeed. And like Python's eval
pandas doesn't make any attempt to prevent that.
Comment From: rhshadrach
As mentioned in #60602 (comment), we're happy to accept a PR to copy the warning from the
eval
docstring to thequery
docstring.
I believe this has already been done, but not yet released.