Pandas ENH: set __module__ on top-level public objects

Currently the repr of the DataFrame class (and any other class or method in the main namespace) shows the "full code path" of where the object is actually defined:

>>> import pandas as pd
>>> pd.DataFrame
<class 'pandas.core.frame.DataFrame'>

while we could also make it show the code path of how it is publicly exposed (and expected to be imported and used):

>>> pd.DataFrame
<class 'pandas.DataFrame'>

The above can be achieved by setting the __module__ attribute on the classes and methods. In numpy they already do this for several years, and so the repr of top-level functions or objects shows "numpy.<..>", and not things like "numpy.core.multiarray..". The main PR in numpy that implemented this: https://github.com/numpy/numpy/pull/12382

I think the main benefits are:

Reduce the visual noise and hide implementation details for users (no regular user needs to know that DataFrame class is defined in pandas/core/frame.py)
Avoid that people tend to incorrectly import from where it is defined (i.e. discourage from pandas.core.frame import DataFrame, a pattern that we often see in downstream packages). I think this would also help for making pandas.core private (and potentially renaming it, xref https://github.com/pandas-dev/pandas/issues/27522, cc @rhshadrach)

The main disadvantage is that we thus mask where an object lives, which makes it harder for contributors to figure that out. On the draft PR, @jbrockmendel also commented:

inspired by similar implementation in numpy

Whenever I try to figure out how something in numpy works I have a hard time finding out where something is defined because they use patterns like from foo import * at the top level. I don't know if the pattern in this PR contributes to that pain point, but my spidey sense is tingling that it might.

This does not change any * imports (it only changes the visual repr), but that aside, it certainly hides a bit more where something is defined, making it harder to find the location (the file) in the source code. But this masking is the purpose of the proposal, with the idea that this is better for users (see bullet points above). I certainly comes with the drawback for contributors, but in making the trade-off, there are much more users, so I would personally go with prioritizing that use case (and for contributors, there are still many other ways to find where something is defined: looking at our imports in the codebase, searching for "class DataFrame", ...).

Overview of objects:

[x] DataFrame: https://github.com/pandas-dev/pandas/pull/55171
[x] Series: https://github.com/pandas-dev/pandas/pull/60263
[x] Index classes: https://github.com/pandas-dev/pandas/pull/59909
[ ] dtype classes:
[x] https://github.com/pandas-dev/pandas/pull/59909
[ ] https://github.com/pandas-dev/pandas/pull/62335
[x] Scalars: https://github.com/pandas-dev/pandas/pull/57976
[ ] all read_.. functions
[x] concat, isna, merge, etc
[ ] pivot, pivot_table, melt, lreshape, cut, qcut, wide_to_long
[x] date_range, timedelta_range etc
[ ] period_range
[x] to_datetime, to_timedelta, to_numeric: https://github.com/pandas-dev/pandas/pull/62368
[ ] NamedAgg, IndexSlice
[x] https://github.com/pandas-dev/pandas/pull/62380

Comment From: rhshadrach

Avoid that people tend to incorrectly import from where it is defined (i.e. discourage from pandas.core.frame import DataFrame, a pattern that we often see in downstream packages)

Wow: https://github.com/search?q=%22from+pandas.core.frame+import+DataFrame%22&type=code

I agree with the trade off proposed here. I think not exposing the internal locations in such a visible way to users is more valuable than aiding developers in navigating the pandas codebase in this fashion.

Comment From: Dr-Irv

FWIW, for numpy, if you use VS Code, and do a "Go to Definition" on any function that is defined in python (not ones that are C-python), it does go to the correct file. So contributors could figure out the locations of code that way. I tested this with np.linspace(). So I imagine the same thing would happen for pandas if we made this change.

Comment From: simonjayhawkins

@jorisvandenbossche can you clarify the benefits for the top level functions. I'm able to explain what the user will see differently when we set the module for the objects, but not clear when it comes to the functions.

Comment From: simonjayhawkins

eg. for a function

>>> pd.isna.__module__
'pandas.core.dtypes.missing'
>>> repr(pd.isna)
'<function isna at 0x7efcf5e3cee0>'
>>>

after setting __module__

>>> pd.isna.__module__
'pandas'
>>> repr(pd.isna)
'<function isna at 0x7ff349aaa7a0>'
>>>

Comment From: jorisvandenbossche

That's a good question. Based on what you show above, it might be good to just focus on the classes.

However, it might depend on the console you are using. Or how you are looking at it. It's strange that for isna, the result of repr(..) is different from the actual repr (but so maybe that is specific to IPython):

In [23]: repr(pd.isna)
Out[23]: '<function isna at 0x7f462079be20>'

In [24]: pd.isna
Out[24]: <function pandas.core.dtypes.missing.isna(obj: 'object') -> 'bool | npt.NDArray[np.bool_] | NDFrame'>

So here the actual repr this shows it as pandas.core.dtypes.missing.isna.

Now, it's certainly less important for functions (and also less common to look at them that way), but I think there is still some value is hiding the full path in the output above.

Comment From: simonjayhawkins

Now, it's certainly less important for functions (and also less common to look at them that way), but I think there is still some value is hiding the full path in the output above.

No problem. some functions have been done too.

Comment From: rhshadrach

It's strange that for isna, the result of repr(..) is different from the actual repr (but so maybe that is specific to IPython):

This is IPython logic.

https://github.com/ipython/ipython/blob/0615526f80691452f2e282c363bce197c0141561/IPython/lib/pretty.py#L806

Comment From: simonjayhawkins

the discussion above lists the top level objects exposed in the pandas/core/api.py.

In addition, following @rhshadrach comment https://github.com/pandas-dev/pandas/pull/60268#issuecomment-2466740792, the __module__ attribute for the public(ish) objects that are exposed in /pandas/api/typing/__init__.py should also be included in this task.

Comment From: dangreb

The addition of the decorator set_module for DataFrames and Series, as far as i managed to verify, seems to be causing issues in PyCharm's Type Renderers functionality. When a Custom Data View is created at the IDE, we need to inform the target type, and the settings engine will validate the input against stubs and save the full qualified name it gets from them! Since no such abreviation is in place at the stubs, it will record "pandas.core.frame.DataFrame" as the qualified name.

Then when JetBrains's PyDev fork, that is base for PyCharm debugger, search for Type Renderers customized for the DataFrame, it compares the abreviated "qualname" from runtime (basically f'{type(obj)module}.{type(obj)name}' that will make for "andas.DataFrame") with the one he recorded from the customizing based on the stubs (that will be "pandas.core.frame.DataFrame").

There's also some hard-codes for pandas.core.frame.DataFrame and pandas.core.frame.Series at a "Table Provider" routine, used to determine PyCharm's "Data View" tool engine for pandas's DataFrames and Series, but that i believe should be addressed over there?

Anyway, as i'm currently experimenting with nightly releases, i noticed that and thought to give you some heads up! #

Comment From: jorisvandenbossche

@dangreb thanks for the heads up; I also answered at https://github.com/pandas-dev/pandas/issues/62121

One other thing from your comment: you mention how this now no longer matches with the stubs. So should we potentially do the same change over there? (I am not entirely sure how this works, or if that is possible) The DataFrame class in the stubs is at https://github.com/pandas-dev/pandas-stubs/blob/91c83bf560688192c49efdf1324024b258a624af/pandas-stubs/core/frame.pyi#L373. But that is not python code, so applying a decorator won't have any effect there, I suppose. Or should the class then be located in pandas/__init__.pyi to have that work? (which I suppose then is not that practical / has other downsides) (cc @Dr-Irv)

Comment From: Dr-Irv

One other thing from your comment: you mention how this now no longer matches with the stubs. So should we potentially do the same change over there? (I am not entirely sure how this works, or if that is possible) The DataFrame class in the stubs is at https://github.com/pandas-dev/pandas-stubs/blob/91c83bf560688192c49efdf1324024b258a624af/pandas-stubs/core/frame.pyi#L373. But that is not python code, so applying a decorator won't have any effect there, I suppose. Or should the class then be located in pandas/__init__.pyi to have that work? (which I suppose then is not that practical / has other downsides) (cc @Dr-Irv)

I certainly don't want to start moving the definitions of classes around. There are tools that look at where we define the stubs in relation to the source code, so keeping DataFrame in pandas-stubs/core/frame.pyi mirrors the source location of pandas/core/frame.py and will keep those tools working.

In VS Code, if you do "Go to Declaration" or "Go to Definition" or "Go to Type Definition" on pd.DataFrame, it pulls up the files frame.py or frame.pyi from the core directories in the 2 projects. IMHO, not much confusion for users. Not sure how PyCharm handles stubs vs. source. One thing that is true with VS Code is that pandas-stubs are bundled with the pylance extension, so people don't necessarily need to install pandas-stubs into their environment. It is there for "free".

Comment From: dangreb

Guys, quick follow up. They merged the proposed fix to the intelliJ-community repo that fix the potential issue with this ENH.

cheers!

Comment From: jorisvandenbossche

@dangreb Thanks for the follow-up!