Pandas version checks
-
[x] I have checked that this issue has not already been reported.
-
[x] I have confirmed this bug exists on the latest version of pandas.
-
[x] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import numpy as np
import pandas as pd
rng = np.random.default_rng()
n_rows = 3000000
timestamps = rng.random(n_rows) * 1000
theta = rng.random(n_rows) * 666
random_df = pd.DataFrame({"timestamp":timestamps, "theta": theta})
random_df.sort_values("timestamp")
random_df.to_csv("D:\\random_df.csv")
test_df = pd.read_csv("D:\\random_df.csv")
current_time = 0
for trial in np.arange(0,50):
end_time = current_time + 10.0
selected_data = test_df.loc[(test_df["timestamp"]>current_time) & (test_df["timestamp"]<end_time),"theta"]
print(f"trial {trial}, {selected_data.shape[0]} rows found")
if selected_data.shape[0]==0:
selected_data = test_df.loc[(test_df["timestamp"]>current_time) & (test_df["timestamp"]<end_time),"theta"]
print(f"tried again, {selected_data.shape[0]} rows found")
current_time = end_time + 1.0
Issue Description
Hi all, I'm trying to select data from a large (3 million rows, 0.5GB) dataframe that I created previously and saved as a csv, then read back into a csv. Randomly and without throwing any errors, selecting data based on some condition returns an empty series, even though the data exists. If I run the same code multiple times, the selection of data fails for different subsets of data. If within the same code I check whether an empty df has been returned and then try to select the exact same data again, the data is often (but not always) found. If this is a memory issue, it seems like an error should be thrown. Thanks!!
Expected Behavior
Data is selected on the first try or an error is thrown if it's a memory issue.