Feature Type

  • [x] Adding new functionality to pandas

  • [ ] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

While working with pandas DataFrames during exploratory data analysis (EDA), analysts frequently perform the same manual steps to understand their dataset:

  • Count null and non-null values
  • Check unique value counts
  • Estimate missing percentages

These operations are often repeated multiple times, especially after data cleaning, filtering, or merging. Currently, users rely on combinations like:

df.isnull().sum()
df.nunique()
df.notnull().sum()

There is no single built-in pandas utility that offers this all-in-one diagnostic view.

Feature Description

Add a utility function pd.check(df) that returns a concise column-wise summary of a DataFrame’s structure, including:

  • Unique values per column
  • Non-null value counts
  • Missing value counts
  • Missing percentages (rounded to 2 decimals by default)

This function is designed to streamline early-stage exploratory data analysis by combining multiple common pandas operations into one, reusable utility.

Suggested API: def check(df: pd.DataFrame, round_digits: int = 2) -> pd.DataFrame: ... - Optional round_digits parameter to control percentage precision - Returns a pandas DataFrame - No side effects (no printing) - Aligns well with other utility functions like pd.describe()

Alternative Solutions

There are existing pandas functions like:

  • df.info() – shows non-null counts and data types
  • df.describe()– provides statistical summaries (only for numeric data)
  • df.isnull().sum() – shows missing values per column
  • df.nunique()– shows unique counts

However, none of these provide a combined summary in a single DataFrame format. Users must manually combine several operations, which can be repetitive and error-prone.

Third-party options:

pandas-profiling and sweetviz offer full data profiling, but they are heavy-weight, generate HTML reports, and not ideal for lightweight inspection or script-based pipelines.

My package pandas_eda_check implements this specific summary cleanly and could be a minimal addition to pandas.

Additional Context

Why in pandas?

  • Aligns with pandas’ mission of being a one-stop shop for tabular data operations
  • Adds convenience and consistency to common EDA workflows
  • Minimal overhead and easy to implement
  • Could serve as a precursor to a more comprehensive eda submodule in the future

Reference Implementation

I've implemented this in an open-source utility here: 🔗 https://github.com/CS-Ponkoj/pandas_eda_check

PyPI: https://pypi.org/project/pandas-eda-check/

Open to Feedback

I’d love to hear from the maintainers and community about:

  • Whether this function aligns with pandas’ philosophy
  • Suggestions to improve API or return format
  • If accepted, I’m happy to submit a PR with tests and docs

Thanks for your time and consideration.

Ponkoj Shill PhD Candidate, ML Engineer Email: csponkoj@gmail.com

Comment From: ishaan1234

take