Pandas Categorical data not supported in plots of parallel coordinates

Just hoping to see if it might be possible to support categorical data in the fantastic parallel coordinates plot function.

I've uploaded an example script and dataset where the current code fails:

Archive.zip

Also, variable line widths that reflect frequency would also be highly appreciated. :)

Comment From: jreback

IIRC no other plot libraries have a similar function, @TomAugspurger

pull-requests are welcome

Comment From: TomAugspurger

This should be doable, @tantrev interested in submitting a pull request? We might even want to just convert class_column to a Categorical. That could simplify some of the code later on, e.g. here

Comment From: TomAugspurger

@tantrev can you explain a bit more about what you expected, as I think parallel_coordinates is handling categoricals correctly.

The second argument to parallel_coordinates should be the class column, which I think should be your categorical column, and the y axis will be the measurement (which is maybe 'Resistance' for you). So I would expect you to do

parallel_coordinates(df, 'ccle_primary_site', cols=['Resistance'])

Comment From: troglotit

@TomAugspurger Excuse me for intervening. I think OP wanted something like this graph but with categories on the y axis (i.e. several y-axis denomination).

Comment From: TomAugspurger

@troglotit I think you're right, thanks. @tantrev does this look correct?

In [33]: import pandas.util.testing as tm

In [34]: df = pd.concat([pd.Series(tm.makeCategoricalIndex(), name='d1'), pd.Series(tm.makeCategoricalIndex(), name='d2')], axis=1)

In [35]: df['v'] = np.linspace(0, 1, 10)

In [36]: parallel_coordinates(df, 'v', cols=['d1', 'd2'])

We could pretty easily take the codes of the categories as the position on the y axis. One issue here is how to handle unordered categories. My inclination is to raise with a ValueError rather than assume an ordering where it might not be appropriate.

Comment From: tantrev

Hello! Thank you for all of your sincere help and my apologies for the delay - I have been ridiculously swamped these past two weeks.

@TomAugspurger - yes, so the categorical columns I was referring to in the example dataset I uploaded were the first two - the so-called "ccle_primary_site" and "ccle_primary_hist" columns (ccle stands for cancer cell line encyclopedia and hist stands for histology). The resistance column was just meant to color-code the two datasets from each other.

With regards to your example dataset, I think we're on the same page: your 'v' column used for coloring was just uniformly distributed over a [0,1] range instead of a bimodal distribution like the one I provided.

The type of categorical visualization I envision is similar to the far-left column of the "progressive rendering" on this d3 page - it'd just be nice to be able to intermix categorical and numerical data across all columns. As far as ordering goes, perhaps a default alphabetical order might be forgiving to stupid users like myself?

My final, extra icing-on-the-top request was to somehow reflect frequencies in our parallel coordinates plot. With numerical data, this isn't so much an issue because such data tends to be different enough that clusters are visually identifiable but with clear-cut categories, such natural "peak broadening" goes away (for example - there isn't any great way to discern prevalence within categories). Hence, I proposed artificially inflating categorical line widths based on frequencies - basically a mutant between a "parallel sets" plot and parallel coordinates. That being said, it might just be a horrible idea. :P

Also, I'm a bit of a Github newb - would I first need to code a parallel coordinates replacement for a pull request? Thanks again for all of your help.

Comment From: TomAugspurger

would I first need to code a parallel coordinates replacement for a pull request?

Not a full replacement, just modify the function in https://github.com/pydata/pandas/blob/cb43b2f20e21db25bb17f334bfa8f08a7292186d/pandas/tools/plotting.py#L669

You'll need to detect Categorical columns, use the codes as the y values and the categories as the tick labels. Keep in mind that the current implementation is super limited. We don't really have different axes per dimension, so everything is plotted on the same scale. It may take a bit of work to get things functioning properly.

Comment From: tacaswell

@phobson has implemented plots like that on top of mpl (I do not have a good link though).

Comment From: phobson

For completeness, the plots Tom is talking about are here: https://github.com/Geosyntec/wqio/blob/v0.3.2/wqio/utils/figutils.py#L795 and also here: https://gist.github.com/phobson/9de120cabde660ec734c

I don't think they would work with categorical data in y- direction, but I might be surprised.

The function in wqio expects wide-form data.

Comment From: stangirala

@TomAugspurger @jreback Parallel sets is something that I'd like to have too as a standard option/function. I can work on this.

EDIT: I'd like to have a parallel_sets function that will in turn call parallel coordinates, will that work?

Comment From: TomAugspurger

Yes, I think so if you're willing to give it a shot.

On Fri, Apr 14, 2017 at 7:45 PM, Telt notifications@github.com wrote:

@TomAugspurger https://github.com/TomAugspurger @jreback https://github.com/jreback Parallel sets is something that I'd like to have too as a standard option/function. I can work on this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/12341#issuecomment-294261603, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIhmjsnA2X6hSrcIFbozhi27UjhuJks5rwBM_gaJpZM4Ha5BI .

Comment From: stangirala

@TomAugspurger @jreback Can you have a look at a WIP, I have some initial code with tests, https://github.com/stangirala/pandas/commit/378bef9f36de6267a733cf623e9392dd89d82a3b

I have a couple of questions, the signature is different from the parallel_coordinates method and I was not sure if we need all of them (for example axvlines)?

Do you have a suggestion on how to test _connect_polyline()? I was thinking of simply using a mock object for the axes objects and verifying their attributes were being accessed.

EDIT: Some context, I ended up with a new method called parallel_sets() which converts all of its inputs to a string, as if expecting categorical data always, via a string_to_integer dictionary, and plots those integers.

Comment From: shoyer

Some thoughts on this issue:

Why does parallel_sets need to be separate function from parallel_coordinates? It seems like this is what you would always want when categorical values are given as a column. parallel_coordinates needs to support different axes/tick-labels for this, but this is functionality that would be a major improvement for the original visualization.
Why is class_column a required argument for parallel_coordinates? That seems very strange -- there's no inherent reason why parallel coordinates plots need to be colored.
The static graphs produced by matplotlib pale in comparison to the JavaScript visualizations with dynamic brushing. So I'm not opposed to fixing up this functionality, but really matplotlib is not the best tool here.

Comment From: stangirala

From what I can read about Parallel Coordinates, it assumes each sample is a point in some n-d space and not necessarily from a category and in my PR it made sense to have two different functions. But support for different axes/tick-labels might be a better idea/simpler.
The class_column is used given the general use-case for parallel_coordinates is visualization of multivariate data.
Indeed, do you think it makes sense to fold in a js library (say Bokeh) as part of the standard visualization?