Pandas DOC: section on caveats of storing lists inside DataFrame/Series

xref to a lot of issues, for example #16864

I think we could use a doc section stating storing nested lists/arrays inside a pandas object is preferred to be avoided, showing the downsides (perf, memory use) and a worked out example of an alternative. This seems to be earned knowledge that many have, but not sure we do a good job stating it clearly.

Closely related, might also benefit from a little section encouraging use of Python core data structures when appropriate.

probably goes here - http://pandas.pydata.org/pandas-docs/stable/gotchas.html

Comment From: pdpark

I'd be happy to take this, just not sure what "a worked out example of an alternative" would look like? I've found a few discussions around storing lists in Dataframe cells and none of them discouraged it. This discussion on Stack Overflow is the only one I've found with alternatives: https://stackoverflow.com/questions/39661198/optimal-way-to-add-small-lists-to-pandas-dataframe. Which is the best option? Or is there another, better option? Thanks.

Comment From: jreback

https://stackoverflow.com/questions/45587778/python-explode-rows-from-panda-dataframe https://stackoverflow.com/questions/44361160/explode-a-csv-in-python https://stackoverflow.com/questions/38428796/how-to-do-lateral-view-explode-in-pandas

FYI, the timings are suspect of course, these examples don't use a large enough frame to actually matter.

https://github.com/pandas-dev/pandas/issues/16538

We should make a small section on this. Also should prob just write .explode :< (note for strings we already have this, its the expand=True option in .str.split()

Comment From: jreback

more refs

https://github.com/pandas-dev/pandas/issues/8517

http://www.markhneedham.com/blog/2015/03/23/python-equivalent-to-flatmap-for-flattening-an-array-of-arrays/
https://stackoverflow.com/questions/31080258/pysparks-flatmap-in-pandas
https://stackoverflow.com/questions/32468402/how-to-explode-a-list-inside-a-dataframe-cell-into-separate-rows

Comment From: jreback

This is pretty idiomatic / efficient.

(pd.melt(df.nearest_neighbors.apply(pd.Series).reset_index(), 
             id_vars=['name', 'opponent'],
             value_name='nearest_neighbors')
     .set_index(['name', 'opponent'])
     .drop('variable', axis=1)
     .dropna()
     .sort_index()
     )

Comment From: pdpark

I read through the examples in the links, very informative, thanks. I'll put something together and submit a PR.

Comment From: pdpark

Just want to clarify something: this issue was opened with the intent, as I understand it, to document the fact that storing lists in dataframes is not ideal. However, the examples above are all about how to explode lists stored in data frames. Is the recommended approach to create a temporary data frame with lists in order to create the preferred dataframe without lists?

Comment From: jreback

no a long form dataframe is ideal from a performance and idiomatic perspective. those examples are illustrative of what to do if they already have lists

point is that you shouldn’t have them in the first place; if you do then you invariable need to convert them anyways

Comment From: pdpark

This example, also from here: https://stackoverflow.com/a/46161733, seems simpler/easier to understand?

(df.nearest_neighbors.apply(pd.Series) .stack() .reset_index(level=2, drop=True) .to_frame('nearest_neighbors'))

Any reason not to prefer it as the canonical example?

Comment From: jreback

yep that prob would be a nice example

Comment From: pdpark

Cool, thanks.

Comment From: pdpark

I want to include an example of doing an "explosion" without creating an intermediary df with lists in cells. Here's my example - what do you think?

df = (pd.DataFrame(OrderedDict([('name', ['A.J. Price']*3), ('opponent', ['76ers', 'blazers', 'bobcats']), ('attribute x', ['A','B','C']) ]) ))

nn = [['Zach LaVine', 'Jeremy Lin', 'Nate Robinson', 'Isaia']]*3

df2 = pd.concat([df[['name','opponent']], pd.DataFrame(nn)], axis=1)

df3 = (df2.set_index(['name', 'opponent']) .stack() .reset_index(level=2, drop=True) .to_frame('nearest_neighbors')) df3

Comment From: pdpark

Added this change to existing pull request.

Comment From: gumus-g

Hi! I’d like to help with this doc issue by adding a section to the gotchas guide. It would explain why storing lists in DataFrame or Series cells is discouraged, and show better approaches like using explode() or apply(pd.Series). Let me know if there are examples or notes you’d like included!