Pandas QST: Does the project consider DataFrame.query() arbitrary code execution to be a security vulnerability?

Research

[X] I have searched the [pandas] tag on StackOverflow for similar questions.
[X] I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/79304226/should-i-manually-patch-the-pandas-dataframe-query-vulnerability-or-wait-for-a

(To clarify, this question was written by another user.)

Question about pandas

Hi, I saw this question on StackOverflow, which is about a public CVE, CVE-2024-9880.

The basic premise of the CVE is that if an attacker controls the expr argument to DataFrame.query(), then arbitrary code execution can be achieved.

The example given in the CVE is

import pandas as pd


df = pd.DataFrame({'a': [1, 2, 3], 'b': ['error_details', 'confidential_info', 'normal']})


query = '@pd.core.frame.com.builtins.__import__("os").system("""ping google.com #""")'
try:
    engine = "python"
    result = df.query(query,local_dict={},engine="python",).index
except Exception as e:
    print(f'Error: {e}')

However, this is not minimal, and a more minimal construction would be

import pandas as pd

df = pd.DataFrame()

expr = '@pd.compat.os.system("""echo foo""")'
result = df.query(expr, engine='python')

(The report also says that engine='python' is required, but both engine='python' and engine='numexpr' worked in my testing.)

My question is about Pandas's security model. What security guarantees does Pandas make about DataFrame.query() with an attacker-controlled expr?

My intuition about this is "none, don't do that," but I'm wondering what the Pandas project thinks.

Comment From: skj-skj

any updates? Snyk is flagging this as High Severity Vulnerability and is blocking my deployment.

its unfortunate it is happening on Holidays.

Comment From: rhshadrach

My intuition about this is "none, don't do that," but I'm wondering what the Pandas project thinks.

This is indeed my take, both query and eval should be used with string literals and not with strings provided by or derived from untrusted user input.

cc @pandas-dev/pandas-core

Comment From: Dr-Irv

This is indeed my take, both query and eval should be used with string literals and not with strings provided by or derived from untrusted user input.

This could be tricky, because something like this should be OK:

mystr = "something from a user"
df.query(f"a_column == {mystr}")

In other words, it's OK to construct a query string based on user input.

Comment From: nickodell

This could be tricky, because something like this should be OK:

Isn't this technique still vulnerable to injection? Example:

import pandas as pd

df = pd.DataFrame({'a_column': []})

mystr = '@pd.compat.os.system("""echo foo""")'
df.query(f"a_column == {mystr}")

This still prints 'foo' for me.

Comment From: Dr-Irv

Isn't this technique still injectable?

Yes, so it seems that the caller that creates the parameter to query() needs to put a safeguard around that computation.

Comment From: rhshadrach

@Dr-Irv - I do not think it is a reasonable expectation that users are able to implement a safeguard. Preventing query-injection can be very difficult. This is why SQL libraries provide such functionality. It seems to me either pandas needs to provide it or make the recommendation that users do not use query/eval in this manner.

Comment From: WillAyd

The eval documentation has a warning against this (see https://github.com/pandas-dev/pandas/pull/59108). I think can be added to query as well

Comment From: WillAyd

As an added note, please do not report security vulnerabilities to our github page. If you suspect or know of one, we have a process in place through our partnership agreement with Tidelift.

See https://tidelift.com/docs/security for more details on how to report going forward. Locking this conversation for now - thanks!

Comment From: TomAugspurger

Agreed with Will, that repeating that warning in the docs for query is worth doing. pandas doesn't attempt to sanitize the input in any way, so there's not a security boundary here. (we could have a discussion about whether we wanted to attempt to sanitize things, but IMO it's not worth it).

A bit more context on this specific report. It was reported to the pandas-security team, and we replied that it wasn't considered a vulnerability. I'm not sure why they obtained / published a CVE :/

Comment From: TomAugspurger

@ethansilvas, @aftersnow, @huntr-helper I see that you're active on https://huntr.com/bounties/a49baae1-4652-4d6c-a179-313c21c41a8d. If possible, could you attempt to retract that CVE? Thanks!

Comment From: WillAyd

On a side note we probably need to better document the security reporting process to Tidelift. I didn't see it in the website - am I overlooking?

Comment From: TomAugspurger

Not sure whether we have a link from the site.

We have https://github.com/pandas-dev/pandas/security, which is the standard place for GitHub. Also linking from the site, if we aren't already, would be nice.

edit: https://pandas.pydata.org/docs/development/policies.html#security-policy links to https://github.com/pandas-dev/pandas/security, so you'll get there eventually.

Comment From: skj-skj

is it possible to add parameterized query like there is for SQL in DataFrame().query() or is it already available?

Comment From: WillAyd

@skj-skj if you want to open a dedicate issue to discuss that I think it would be better. My initial thought is that we wouldn't want to build that ourselves and create that security boundary. It has the potential to be a lot of work to maintain, not in an area we specialize in, and we are already resource constrained. If there's a dedicated library that could be used for that purpose maybe it can be leveraged. But yea I think worth opening an issue dedicated to discuss that if its something you are passionate about

Comment From: bashtage

I do not think it is a reasonable expectation that users are able to implement a safeguard. Preventing query-injection can be very difficult. This is why SQL libraries provide such functionality. It seems to me either pandas needs to provide it or make the recommendation that users do not use query/eval in this manner.

I think that someone who is using .query with arbitrary input from untrusted users should have the knowledge to use it without sanitization. Is there anything that query can do that can't be done with other methods that would not have the same security risk?

Comment From: bashtage

Reading this thread, this feels like a great idea for a project that extends query but with a promise of security, a la parameterized SQL.

Comment From: CTPassion

From what I've read in this thread, am I right that the development team doesn't consider this a vulnerability and will not be making changes (other than perhaps a logging warning)?

In that case there are many people like myself who will need to inform SNYK as it is considering this a high risk vulnerability that is blocking pipelines. SNYK Pandas

Comment From: bashtage

That is correct. Pandas doesn't make any security promises with query, and so users who pass untrusted data to query must provide any required security.

@TomAugspurger 's comment above has the most precis answer.

Comment From: mcclanahanmp

To note, I did submit an inquiry on the 26th to security@tidelift.com, but have yet to receive a response. However, thank you all for your input on this matter and @CTPassion and @bashtage for clarifying. -SrvrAdmin

Comment From: CTPassion

Note, SNYK have removed this as a vulnerability https://security.snyk.io/vuln/SNYK-PYTHON-PANDAS-8549481 :)

Comment From: beefkit

I saw that this was officially moved on 3/20 into the NVD and is 'awaiting analysis', so seems like we will be getting an answer soon so this can be removed off our scans.

https://nvd.nist.gov/vuln/detail/CVE-2024-9880

Our security team has let this progress into needing an acceptance of risk from our client. I explained that we do not even use this method nor is a validated CVE.. Seems that Tenable assigns scores and purports them as valid when they have not even been validated by NIST or accepted into the NVD. They also do not even test for this vulnerability but only rely off the reported version number.

Hoping NIST can wrap this up fast so this can be closed.

https://www.tenable.com/cve/CVE-2024-9880

Comment From: jvparidon

This ended up in the NVD because of a report filed at https://huntr.com/bounties/a49baae1-4652-4d6c-a179-313c21c41a8d. I posted to huntr to see if they'd be willing to follow SNYK's lead and mark this as not a vulnerability, which would presumably be the easiest way to resolve this problem.

Comment From: jvparidon

Now marked as rejected in the NVD. (https://nvd.nist.gov/vuln/detail/CVE-2024-9880)

Comment From: beefkit

Just a fun little write up on this.. Tenable is still actively refusing to remove their plugin on this finding. Not sure what gives.. I am reaching out to Huntr support to see if they have any information/documentation on the rejection, but this is quite ridiculous.