Pandas DataFrame.copy(), at least, should be threadsafe

dataframe.copy() should happen atomically/be threadsafe, meaning that it should produce a consistent dataframe even if the call to .copy() is made while another thread is deleting entries from the dataframe, or if another thread calls a deletion method while the call to .copy() is working (in other words, i guess .copy() should acquire a lock that prevents mutation during the copy). That is, the following code, which crashes in 0.7.3, should succeed:


import pandas
import threading

df = pandas.DataFrame()

def mutateDf(df):
    while True:
        df[0] = pandas.Series([1,2,3])
        del df[0]

def readDf(df):
    while True:
        dfCopy = df.copy()
        if 0 in dfCopy and 1 in dfCopy[0]:
            a = dfCopy[0][1]

t1 = threading.Thread(target=mutateDf, args=(df,))
t2 = threading.Thread(target=readDf, args=(df,))

t1.start()
t2.start()

Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 504, in run
    self.__target(*self.__args, **self.__kwargs)
  File "<ipython-input-5-8aef72c7f1b4>", line 4, in readDf
    if 0 in dfCopy and 1 in dfCopy[0]:
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/frame.py", line 1458, in __getitem__
    return self._get_item_cache(key)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/generic.py", line 294, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib/python2.7/dist-packages/pandas-0.7.3-py2.7-linux-x86_64.egg/pandas/core/internals.py", line 625, in get
    _, block = self._find_block(item)
TypeError: 'NoneType' object is not iterable

Comment From: ghost

Right now, pandas is explicitly not thread-safe. Taking any step down this path will inevitably generate lots of pain and changes all over. Python threads see more limited use then in other languages, the upside is correspondingly limited.

You can always implement per-object or a global pandas lock in your own code, if threads are what you want.

Pushing back to 0.12, at least.

Comment From: wesm

That's not quite true-- for example most things are threadsafe and we've ensured that e.g. IO functions can be run in separate threads. Perhaps we should just acquire a lock inside the copy functions for now

Comment From: jreback

copy might be thread safe with a single dtype (but prob not) multiple dtypes now are not thread safe (as @wes points out a lock will fix all this) I would be in favor of providing this as an option, default to False though

Comment From: ghost

I was thinking of #2440. Perhaps parts of pandas are thread-safe, but afik there's no list of what's safe or not and users have hit the non-safe parts before this, when they tried.

Comment From: kokes

I cannot replicate the error posted, see this gist. Either there have been new developments in atomicity of pandas or perhaps threading has a different scheduler, or...?

(Tried under both Python 2.7 and 3.5)

Comment From: jreback

this almost certainly had to do with the unsafe-threadness (is that a word?) with numexpr. numexpr>=2.5 (and even >=2.4.6) now don't much with the global thread state. @kokes what version do you have?

Comment From: kokes

Good! I've got 2.4.6 (under conda), upgraded to 2.5 and got the same.

Comment From: allComputableThings

copy might be thread safe with a single dtype

Doubtful: See: https://github.com/pandas-dev/pandas/issues/25870

It seems, you can't can't currently use pandas series for a 'read-only' hash-lookup in a threaded environment.

It fails on the second call to Series.reindex(..., copy=True) - I was extremely surprised by it, thinking the operation to be non-mutating. I would have expected any hidden object state, such as built indexes, to be finalized at the end of the first call, and subsequent calls to be safe.

Comment From: allComputableThings

I'm deeply confused about this issue. The original discussion was that .copy is not thread safe. My assumption was that it would not be, because someone may be writing to the dataframe. However, is .copy also unsafe in other situations (where no-one is performing impure functional operations, such as modifying the columns/index/labels/cells of a dataframe or series)?

I ask because, my expectation is that Series.reindex(..., copy=True) is a pure function (except for memomization of things like the internal index). Yet it is seem to not be thread safe while no other types of operation are happening. Copy is happening, but no-one is writing. So what?

s = pd.Series(...)
f(s)  # Success!

# Thread 1:
   while True: f(s)  

# Thread 2:
   while True: f(s)  # BANG! Exception !

... where f(s): s.reindex(..., copy=True). Can the thread-unsafeness of .copy really the cause?

Comment From: buhtz

Why is threading usefull with Pandas? Threading helps when you have to much IO things.

But with Pandas you do a lot of CPU stuff. In that case multiprocessing would be much better - if you have enough RAM.

Comment From: allComputableThings

There are lots of reasons to want threading that are unrelated to IO. In fact, in most languages except Python threading is the first choice for parallelism. In my use case, I was hoping to use Pandas to hold a large static datatable (~8Gb) to answer optimized web requests (where a database would have been excessively slow). Python's forking/spawning of separate processes can carry excessive overheads shifting data between the processes, or try-as-you-might copy-on-write ends up consuming a lot of memory if you have a lot of processes. If your data-access is shared-access-read-only, being able to access it in a threaded fashion is optimal. Threading is frowned upon in Python circles only because it hasn't been able to shake itself of the unresolved GIL design bug. However, the GIL is a non-issue for me because most of the heavy-lifting can be done by C code not involved with the interpreter.

As for Pandas, it's not threadsafe is in any sense that you can rely on. Not even for read-only use cases, because pandas is not read-only, even when for reading-type operations. https://stackoverflow.com/questions/13592618/python-pandas-dataframe-thread-safe/55382886#55382886

Stuart

On Tue, Aug 10, 2021 at 1:52 PM Codeberg-AsGithubAlternative-buhtz ***@***.*** wrote:

Why is threading usefull with Pandas? Threading helps when you have to much IO things.

But with Pandas you do a lot of CPU stuff. In that case multiprocessing would be much better - if you have enough RAM.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/2728#issuecomment-896308804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3QJLEYTXYXCDQJBFZ5XD3T4GGRZANCNFSM4ADA2CZQ .

Comment From: buhtz

Thanks for your thoughts which help to dive more into Panda-thinking. ;)

I am aware of Pythons "GIL-problem". But in some cases it can be used as an advantage. E.g. in the context of non-thread-safe Pandas I have to multiply the data between the processes and do not have to think about race conditions anymore.

But am I right to say that threads are running always on the same CPU core, no matter which language (C, Python) they are from, right?

There are lots of reasons to want threading that are unrelated to IO. In fact, in most languages except Python threading is the first choice for parallelism.

It is not "parallel" when running on the same Core - IMHO.

In my use case, I was hoping to use Pandas to hold a large static datatable (~8Gb) to answer optimized web requests

That is a nice IO use case. Web requests are IO because the thread has to wait a lot of time for the data.

Comment From: allComputableThings

It is not "parallel" when running on the same Core - IMHO.

That is true for the Python interpreter only. Generally, threads of a single process can use multiple cores, and vectorized code (called from Python) can make use of multiple cores.

In my use case, I was hoping to use Pandas to hold a large static datatable

(~8Gb) to answer optimized web requests

That is a nice IO use case. Web requests are IO because the thread has to wait a lot of time for the data.

In my case, Python was the database. I had static data and the need to aggregate and process some 100’s of thousands of records for each request. SQL doesn’t an provide efficient query language for large matrix operations (our queries took under second in memory, but some minutes to run in SQL, even with careful indexing). This case is not unusual- using pandas or numpy to do what is too slow or cumbersome in SQL.

So, CPU bound, not IO bound, since there was no external database to wait on.

To resolve this problem, we switched to numpy for these queries, since pandas didn’t allow to support multiple queries safely.

Comment From: MarcoGorelli

I would be in favor of providing this as an option, default to False though

a decade's passed and nobody's implemented this - let's close for now then

Comment From: shoyer

I think it would make sense to consider reopening this issue. Multi-threaded pandas is increasingly common (e.g., inside Dask) and will only become more common in the future with the removal of Python's GIL.

Comment From: jbrockmendel

Discussed briefly on this week's dev call. My main question is whether copy is in some way special, or if we're inevitably going to be asked to put locks around all public APIs.

Comment From: alippai

@jbrockmendel My expectation is that the copy() shouldn’t mutate the underlying data structure of the original DF/Series/Index. This means no locking would be needed for a read-only workload. Having pure, side-effect free functions sounds to be a smaller task than supporting threading everywhere