-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
<CASE 1>
import pandas as pd
cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
codes, uniques = pd.factorize(cat)
codes
>>> Output-: array([0, 0, 1], dtype=int64)
<CASE 2>
import pandas as pd
cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'], ordered=True)
codes, uniques = pd.factorize(cat)
codes
>>> Output-: array([0, 0, 1], dtype=int64)
Issue Description
In case 1 when we define a nominal variable we get the factorized values as [0,0,1] which seems fine but in case 2 when the variable is ordinal we get the same output i.e. [0,0,1]
Expected Behavior
But instead, we should have got the output as [0,0,2]
Installed Versions
Comment From: jreback
In [296]: cat1 = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
In [297]: cat2 = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'], ordered=True)
In [298]: cat1.codes
Out[298]: array([0, 0, 2], dtype=int8)
In [299]: cat2.codes
Out[299]: array([0, 0, 2], dtype=int8)
I am not sure there is a case for actually factorizing a Categorical itself. It is likely not tested. A Categorical is by definition already factorized.
Comment From: jorisvandenbossche
The problem is that the "codes" returned by factorize
are indices into the "uniques" part of the return. And that only contains the values that are present (regardless of the categories of the categorical):
In [5]: import pandas as pd
...: cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'], ordered=True)
...: codes, uniques = pd.factorize(cat)
In [6]: codes
Out[6]: array([0, 0, 1])
In [7]: uniques
Out[7]:
['a', 'c']
Categories (3, object): ['a' < 'b' < 'c']
So unless we would change uniques
, the codes
is actually correct, and can't be different between the ordered=True/False cases.
Comment From: jbrockmendel
I agree with @jorisvandenbossche, nothing to do here.