Feature Type

  • [x] Adding new functionality to pandas

  • [x] Changing existing functionality in pandas

  • [x] Removing existing functionality in pandas

Problem Description

Currently, Pandas does not provide a native way to store and retrieve complex data types like NumPy arrays (e.g., embeddings) in formats such as CSV without converting them to strings. This results in a loss of structure and requires additional parsing during data retrieval.

Many machine learning practitioners store intermediate results, including embeddings and model-generated representations, in Pandas DataFrames. However, when saving such data using CSV, the complex data types are converted to string format, making them difficult to work with when reloading the DataFrame.

Current Workarounds:

Pickle: While Pickle retains structure, it is not cross-platform friendly and has security concerns when loading untrusted files. Parquet: Parquet supports complex structures better but may not be the default choice for many users who rely on CSV. Manual Parsing: Users often need to reprocess stringified NumPy arrays back into their original format, which is inefficient and error-prone. Feature Request: Introduce an option in Pandas to serialize and deserialize complex data types when saving to and loading from CSV, possibly by:

Allowing automatic conversion of NumPy arrays to lists during CSV storage. Providing a built-in method for reconstructing complex data types when reading CSV files. Supporting a more intuitive way to store and load multi-dimensional data efficiently without requiring workarounds.

Feature Description

  1. Modify to_csv() to Handle Complex Data Types Introduce a new parameter (e.g., preserve_complex=True) in to_csv() that automatically converts NumPy arrays to lists before saving.

Pseudocode: import pandas as pd import numpy as np

class EnhancedDataFrame(pd.DataFrame): def to_csv(self, filename, preserve_complex=False, kwargs): if preserve_complex: df_copy = self.copy() for col in df_copy.columns: if isinstance(df_copy[col].iloc[0], (np.ndarray, list)): # Check for complex types df_copy[col] = df_copy[col].apply(lambda x: json.dumps(x.tolist())) # Convert arrays to JSON super().to_csv(filename, kwargs) If preserve_complex=True, NumPy arrays and lists are serialized into JSON format before saving. The standard to_csv() functionality remains unchanged for other users.

  1. Modify read_csv() to Restore Complex Data Types Introduce a restore_complex=True parameter in read_csv() that automatically detects JSON-encoded lists and converts them back to NumPy arrays.

Pseudocode: class EnhancedDataFrame(pd.DataFrame): @staticmethod def from_csv(filename, restore_complex=False, kwargs): df = pd.read_csv(filename, kwargs) if restore_complex: for col in df.columns: if df[col].apply(lambda x: isinstance(x, str) and x.startswith("[")).all(): # Check for JSON format df[col] = df[col].apply(lambda x: np.array(json.loads(x))) # Convert back to NumPy array return df If restore_complex=True, JSON-encoded lists are automatically converted back to NumPy arrays when reading a CSV file. Example Usage python Copy Edit df = pd.DataFrame({'id': [1, 2], 'embedding': [np.array([0.1, 0.2, 0.3]), np.array([0.4, 0.5, 0.6])]})

Save with complex type handling

df.to_csv("data.csv", preserve_complex=True)

Load and restore complex types

df_loaded = pd.read_csv("data.csv", restore_complex=True) print(df_loaded["embedding"][0]) # Output: array([0.1, 0.2, 0.3])

Expected Benefits ✅ Users will be able to save and retrieve NumPy arrays, embeddings, or complex objects easily using CSV. ✅ Reduces the need for workarounds like Pickle or manual parsing. ✅ Keeps Pandas CSV handling more intuitive for machine learning workflows.

Alternative Solutions

Enhance documentation to recommend best practices for handling complex data types with Pandas and suggest an official approach for this use case.

Additional Context:

A common issue in ML workflows that involve embeddings, image vectors, and multi-dimensional numerical data. Other libraries like PyArrow and Dask handle complex data better, but many users prefer Pandas for ease of use.

Additional Context

  1. Using JSON Format Instead of CSV Instead of saving complex data to a CSV file, users can save the DataFrame as a JSON file, which supports nested data structures.

Example df.to_json("data.json", orient="records") df_loaded = pd.read_json("data.json") ✅ Pros: Natively supports lists and dictionaries without conversion. Readable and widely supported format. ❌ Cons:

JSON files are not as efficient as CSV for large datasets. JSON format is not always easy to work with in spreadsheet software. 2. Using Pickle for Serialization Pandas provides built-in support for Pickle, which can store and retrieve complex objects.

Example df.to_pickle("data.pkl") df_loaded = pd.read_pickle("data.pkl") ✅ Pros:

Preserves complex data types natively. Fast read/write operations. ❌ Cons: Pickle files are not human-readable. They are Python-specific, making them less portable for cross-platform use. 3. Using Parquet for Efficient Storage Parquet is a columnar storage format optimized for performance and supports complex data types.

Example

df.to_parquet("data.parquet") df_loaded = pd.read_parquet("data.parquet") ✅ Pros: Efficient storage with better compression. Supports multi-dimensional data and preserves data types. ❌ Cons: Requires pyarrow or fastparquet dependencies. Not as universally used as CSV. 4. Manual Preprocessing for CSV Storage Users often manually convert complex data to JSON strings before saving them to CSV.

Example import json df["embedding"] = df["embedding"].apply(lambda x: json.dumps(x.tolist())) df.to_csv("data.csv", index=False)

df_loaded = pd.read_csv("data.csv") df_loaded["embedding"] = df_loaded["embedding"].apply(lambda x: np.array(json.loads(x))) ✅ Pros: Works with existing Pandas functionality. CSV remains human-readable. ❌ Cons: Requires manual preprocessing each time. Error-prone and inefficient for large datasets.

Why the Proposed Feature is Needed While these alternatives exist, they require either additional dependencies, manual preprocessing, or compromise on format usability. Adding native support for preserving and restoring complex data types in Pandas CSV operations would:

Eliminate the need for external libraries like JSON, Pickle, or Parquet. Improve usability for machine learning and data science workflows. Keep CSV files human-readable while ensuring data integrity.

Comment From: rhshadrach

Thanks for the request, why can't the dtype argument be used here:

pd.DataFrame({"a": [1 + 3j, 1.5 + 4.5j, 1/3]}).to_csv("test.csv", index=False)
print(pd.read_csv("test.csv", dtype={"a": np.complex128}, engine="python"))
                    a
0  1.000000+3.000000j
1  1.500000+4.500000j
2  0.333333+0.000000j

Note you need to use the engine python rather than c as the latter does not (yet) support complex. I think adding support for complex to the c engine would be welcome.

Comment From: ashishjaimongeorge

Thank you for the clarification! I am interested in working on adding support for complex numbers to the C engine. Could you please guide me on how to get started or point me to any relevant developer resources? I look forward to contributing.

Comment From: rhshadrach

Sure thing!

https://pandas.pydata.org/pandas-docs/dev/development/index.html

Comment From: Jaspvr

Hi, here is an implementation of the suggested solution described, would it be possible to have it reviewed? There is a couple things (like that I am unable to run pytests for some reason) that should be addressed, but it is passing the tests that I have written.

https://github.com/pandas-dev/pandas/pull/61157

Comment From: ashishjaimongeorge

Could someone confirm if this issue has been resolved? I’d appreciate any updates or details on its current status.

Comment From: rhshadrach

@ashishjaimongeorge - why have you closed this issue?

Comment From: ashishjaimongeorge

My apologies for the accidental input earlier. Could you also clarify if the above PR addresses the issue, and whether we're exploring any other approaches?

Comment From: rhshadrach

As commented on the PR, I am negative on adding a keyword to read_csv when it seems to me that dtype already supports this. I am not aware of any other work on this issue.

Comment From: ashishjaimongeorge

While dtype works, it’s not the most user-friendly solution for everyone. For users who aren’t deeply familiar with read_csv’s options—or the fact that they need to switch to the Python engine for complex types—it can feel a bit complex and unintuitive. They have to:

Know that dtype supports complex types like np.complex128. Understand that the C engine doesn’t support complex numbers yet, so they must specify engine="python". Manually set up the dtype dictionary for each relevant column.

This process, while manageable for advanced users, adds friction for beginners or those working with complex data like embeddings in machine learning workflows. A dedicated feature like restore_complex would simplify this by automatically handling the conversion of serialized complex types (e.g., JSON-encoded NumPy arrays) back into their original form, without requiring users to juggle dtype and engine settings.

You’ve suggested that dtype already supports this use case, and I agree it’s functional—but I’d argue it’s somewhat complex to implement effectively in practice. For example:

1. Users need prior knowledge of NumPy data types (e.g., np.complex128) and how they map to their data.
2. If a DataFrame has multiple columns with complex types, setting up the dtype dictionary can get cumbersome.
3. The engine limitation means the faster C engine isn’t an option, and users might not realize why their code fails if they don’t specify engine="python".

In contrast, adding a keyword like restore_complex would abstract these details away, making it more accessible and reducing the chance of errors. It’s not about replacing dtype—it’s about offering a simpler alternative for a common use case.

A Balanced Approach - I hear your hesitation about adding another keyword to read_csv, and I don’t want to overcomplicate the API. If you’re firmly against this, I’m happy to stick with your suggestion and focus on improving how we work with dtype.

For instance, we could:Enhance the documentation to clearly explain how to use dtype for complex types, including the engine caveat. Maybe add a note in the error message when the C engine fails with complex types, nudging users toward engine="python". That said, I still think keeping this feature as an option would be a win for usability. It doesn’t take away from dtype—it just gives users a more intuitive path, especially for machine learning folks dealing with embeddings or multi-dimensional data.

Comment From: simonjayhawkins

Why the Proposed Feature is Needed While these alternatives exist, they require either additional dependencies, manual preprocessing, or compromise on format usability. Adding native support for preserving and restoring complex data types in Pandas CSV operations would:

Eliminate the need for external libraries like JSON, Pickle, or Parquet. Improve usability for machine learning and data science workflows. Keep CSV files human-readable while ensuring data integrity.

It feels like the proposal is edging toward creating a new data interchange protocol rather than just a convenience tweak for CSV operations.

CSV files are, by definition, a simple, flat structure where values are separated by commas. The proposal essentially forces CSV to act as if it were a hybrid of CSV and JSON. This muddles the original purpose of CSV—keeping data easily readable and interoperable with tools like Excel or Google Sheets—by introducing nested structures that those tools don’t naturally support.

Surely, there would also need to be consideration of how all the other csv options apply to the values of the nested data in the JSON array? The interpretation of these nested values on reading or the formatting of these nested values on writing using the other options available is probably a non-trivial task. I suspect that ensuring that every possible CSV parameter applies correctly to the nested values adds layers of complexity and potential edge cases.

Comment From: Priya09153

Hi, i would like to work on it

Comment From: simonjayhawkins

Thanks @Priya09153 for the interest but there is no agreed way forward here.

Both myself and @rhshadrach have effectively rejected the enhancement request so far.

I would not be adverse to closing the issue as "won't fix" however also happy to keep the issue open to allow further discussion towards a potentially acceptable solution.

Comment From: rhshadrach

I would like this issue to remain open to support complex with dtype using the C-engine. Alternatively, I wouldn't be adverse with opening a new issue for this and closing this one.

If we did have support with the C-engine, it seems to me we could consider automatically inferring complex data. I have sense as to whether this would be a good idea or not. But I am very much opposed to adding yet another option to read_csv.

Comment From: simonjayhawkins

I would like this issue to remain open to support complex with dtype using the C-engine.

Thanks @rhshadrach for clarifying.

To expand on your previous comment https://github.com/pandas-dev/pandas/issues/60895#issuecomment-2646345348

import pandas as pd
import io

# Create a CSV file in memory
csv_data = io.StringIO()
csv_data.write("a\n1.000000+3.000000j\n1.500000+4.500000j\n0.333333+0.000000j\n")
csv_data.seek(0)  # Reset the file pointer to the beginning

# Read the CSV into Pandas with complex number dtype
df = pd.read_csv(csv_data, dtype={"a": complex}, engine="python")

print(df.a)
# 0    1.000000+3.000000j
# 1    1.500000+4.500000j
# 2    0.333333+0.000000j
# Name: a, dtype: complex128

csv_data.seek(0)  # Reset the file pointer to the beginning

# Read the CSV into Pandas with complex number dtype without specifying dtype
df = pd.read_csv(csv_data, engine="python")

print(df.a)
# 0    1.000000+3.000000j
# 1    1.500000+4.500000j
# 2    0.333333+0.000000j
# Name: a, dtype: object

csv_data.seek(0)  # Reset the file pointer to the beginning

# Read the CSV into Pandas with complex number dtype using c engine
df = pd.read_csv(csv_data, dtype={"a": complex}, engine="c")

# TypeError: the dtype complex128 is not supported for parsing

And to reiterate your suggestion above, the current situation with respect to complex numbers is that the dtype needs to be specified as it is not inferred and it only works with the Python engine.

So this does seem a reasonable enhancement request IMO.

I think your enhancement suggestion is obfuscated in this issue with the use of the term complex for embeddings which we in the pandas project tend to refer to as nested data.

The PR that was opened and which we rejected was related to nested data.

Alternatively, I wouldn't be adverse with opening a new issue for this and closing this one.

SGTM