Is your feature request related to a problem?
I wish I could use pandas to drop NaN values using a threshold that is a fraction of the total column/row, not an absolute number. See detailed example below.
Describe the solution you'd like
thresh : int or float, optional Where int, requires that many non-NA values and where float, require that fraction of non-NA values.
API breaking implications
Only needs to extend it to accept floats as well as ints.
Describe alternatives you've considered
None. IMHO this solution is too simple and effective to consider other options.
Additional context
Example
import numpy as np
import pandas as pd
import missingno as msno
X = pd.DataFrame(
{
'ones': np.ones(50),
'rand': np.random.normal(size=50),
'linear': np.linspace(1, 50),
}
)
y = 3 * np.sin(X.linear) + X.ones + X.rand
y = y.rename('sin_target')
X.loc[4:10, 'ones'] = np.nan
X.loc[4:20, 'rand'] = np.nan
X.loc[40:45, ['linear', 'ones', 'rand']] = np.nan
y.loc[23:25] = np.nan
Xy = pd.concat([X, y], axis=1)
msno.matrix(Xy)
Xy_t = Xy.dropna(thresh=0.7 * len(Xy), axis=1)
# Proposed: Xy_t = Xy.dropna(thresh=0.7, axis=1)
msno.matrix(Xy_t)
Xy_t = Xy.dropna(thresh=0.4 * Xy.shape[1], axis=1)
# Proposed: Xy_t = Xy.dropna(thresh=0.4, axis=1)
msno.matrix(Xy_t)
It should be obviously that this is very useful in cases where the df axis size is liable to change, and where using piped functionality (saves the extra line of code calculating thresh_int = 0.4 * Xy.shape[1]
).
Comment From: attack68
This seems like a reasonably sensible and simple extension, albeit, there might need to be careful coding around 1 int and 1.00 float (0 int and 0.00 float are the same).
I don't necessarily recommend this but while I'm looking at it there may be a case for more advanced missing data pipelining for thresh
to actually be a callable, based on the row, column that is being sampled.
Comment From: jreback
this is a duplicate issue - pls search (prior one is closed as this is not a good api)
Comment From: jreback
duplicate of #35299
if you want to propose a new api go ahead, though am loath to add any additional keywords.
Comment From: rhshadrach
I don't see why the existing alternative mentioned here is not sufficient.
@attack68
I don't necessarily recommend this but while I'm looking at it there may be a case for more advanced missing data pipelining for thresh to actually be a callable, based on the row, column that is being sampled.
That shouldn't go in the dropna
method though, right? That sounds more like a general filter (there was a recent issue on this).
Comment From: attack68
@rhshadrach yes probably right, I am not upto date with discussions on this, more a high level view that more technical missing data routines are becoming more necessary (particularly in my field) and pandas might have requests for flexibility in this regard. As for where there should go, I can agree dropna might serve a more basic purpose and best to keep it basic.
Comment From: jbrockmendel
The requested feature is roughly:
def new_func(self, thresh):
pct_na = self.isna().sum(axis=0) / len(self)
mask = pct_na <= thresh
return self.loc[:, mask]
If there's a 3-liner available, i don't think this merits a new method.
Comment From: rhshadrach
Agree with @jbrockmendel , closing.