Pandas ENH: Allow dropna to accept floats [0, 1] as thresh values

Is your feature request related to a problem?

I wish I could use pandas to drop NaN values using a threshold that is a fraction of the total column/row, not an absolute number. See detailed example below.

Describe the solution you'd like

thresh : int or float, optional Where int, requires that many non-NA values and where float, require that fraction of non-NA values.

API breaking implications

Only needs to extend it to accept floats as well as ints.

Describe alternatives you've considered

None. IMHO this solution is too simple and effective to consider other options.

Additional context

Example

import numpy as np
import pandas as pd
import missingno as msno


X = pd.DataFrame(
    {
        'ones': np.ones(50),
        'rand': np.random.normal(size=50),
        'linear': np.linspace(1, 50),
    }
)

y = 3 * np.sin(X.linear) + X.ones + X.rand
y = y.rename('sin_target')

X.loc[4:10, 'ones'] = np.nan
X.loc[4:20, 'rand'] = np.nan
X.loc[40:45, ['linear', 'ones', 'rand']] = np.nan
y.loc[23:25] = np.nan

Xy = pd.concat([X, y], axis=1)

msno.matrix(Xy)

Pandas ENH: Allow dropna to accept floats [0, 1] as thresh values

Xy_t = Xy.dropna(thresh=0.7 * len(Xy), axis=1)
# Proposed: Xy_t = Xy.dropna(thresh=0.7, axis=1)
msno.matrix(Xy_t)

Pandas ENH: Allow dropna to accept floats [0, 1] as thresh values

Xy_t = Xy.dropna(thresh=0.4 * Xy.shape[1], axis=1)
# Proposed: Xy_t = Xy.dropna(thresh=0.4, axis=1)
msno.matrix(Xy_t)

Pandas ENH: Allow dropna to accept floats [0, 1] as thresh values

It should be obviously that this is very useful in cases where the df axis size is liable to change, and where using piped functionality (saves the extra line of code calculating thresh_int = 0.4 * Xy.shape[1]).

Comment From: attack68

This seems like a reasonably sensible and simple extension, albeit, there might need to be careful coding around 1 int and 1.00 float (0 int and 0.00 float are the same).

I don't necessarily recommend this but while I'm looking at it there may be a case for more advanced missing data pipelining for thresh to actually be a callable, based on the row, column that is being sampled.

Comment From: jreback

this is a duplicate issue - pls search (prior one is closed as this is not a good api)

Comment From: jreback

duplicate of #35299

if you want to propose a new api go ahead, though am loath to add any additional keywords.

Comment From: rhshadrach

I don't see why the existing alternative mentioned here is not sufficient.

@attack68

I don't necessarily recommend this but while I'm looking at it there may be a case for more advanced missing data pipelining for thresh to actually be a callable, based on the row, column that is being sampled.

That shouldn't go in the dropna method though, right? That sounds more like a general filter (there was a recent issue on this).

Comment From: attack68

@rhshadrach yes probably right, I am not upto date with discussions on this, more a high level view that more technical missing data routines are becoming more necessary (particularly in my field) and pandas might have requests for flexibility in this regard. As for where there should go, I can agree dropna might serve a more basic purpose and best to keep it basic.

Comment From: jbrockmendel

The requested feature is roughly:

def new_func(self, thresh):
    pct_na = self.isna().sum(axis=0) / len(self)
    mask = pct_na <= thresh
    return self.loc[:, mask]

If there's a 3-liner available, i don't think this merits a new method.

Comment From: rhshadrach

Agree with @jbrockmendel , closing.