Create a class / method for number parsing. This would be a generalization of the code that remove the thousands separators from numbers (see: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1502).

Other items include: 1. Alternative decimal point 2. Using parenthesis for negative numbers 3. Stripping currency character

I can submit PR for this. Do people have other suggestions for this?

Comment From: jreback

there is a decimal parameter already, but I not sure in doc/source/io.rst (though in doc string)....maybe add that doc?

Comment From: cancan101

The idea of factoring out this functionality is so that it can be used for parsing numbers in other locations without having to go through the TextParser.

Comment From: jreback

why would you not simply use replace for that, which can do all of what you are talking about? (if its not parser related)

Comment From: jreback

http://pandas.pydata.org/pandas-docs/dev/missing_data.html#string-regular-expression-replacement

Comment From: cancan101

How would you use replace for the parenthesis? Bear in mind that the value then needs to be negated.

Comment From: cancan101

I suppose I could replace the parenthesis with a leading negative sign and then convert that string to an int/ float.

Comment From: cancan101

That being said I was thinking of places where I wanted to be able to take an arbitrary string which might not be in a DataFrame / Series and parse a number.

Comment From: jreback

that would work; replace should infer the return types correctly

Comment From: cancan101

@jreback That addresses using replace for a DataFrame or Series but not in the general sense where we have a string to parse.

Comment From: jreback

out of scope for pandas

Comment From: cancan101

Fair enough. I was just trying to factor out code that could be used both for pandas and for what is out of scope for pandas.

Comment From: jreback

@cancan101 close this then? or do you want to try to do some of this in the actual parse itself? (which depending on the API/scope may of may not be useful). As its a small edge case and if you did have parens/dollar signs in your numbers you can just replace after you read it in, because there is a perf impact in even checking for it on more general parsing

Comment From: cancan101

I was going to / could write this such that there is no performance impact. I would instantiate a parser at the following location: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1505 based on parsing options. Theoretically are a few more branches per FILE (not per cell or line).

Comment From: jreback

Well, so you want to parse (value) -> -value and $numeric -> numeric?

this is JUST for PythonParser, right? (if so I am -1 on this); the c-parser is most often the default/used

Comment From: cancan101

What do you mean -1 on this? Like you don't like the change? Or you care less about the performance hit because this code path isn't the default?

Comment From: cancan101

And yes, what I was thinking was just the PythonParser

Comment From: jreback

no, doing a change just for the PythonParser; the c-parser is the primary parser and getting them out of sync is IMHO a bad idea (as you can see by the recent fixes to put them BACK in sync). In reality the PythonParser is not even necessary (except for some back compat I think) (as the c-parser can parse virtually eveything)

Comment From: cancan101

This does not actually have to change the API for the PythonParser. All I am suggesting is that the underlying python code for the PythonParser is factored out and usable elsewhere.

Comment From: jreback

for what purpose?

Comment From: cancan101

There are cases in the HTML parser that I want to be able to use this functionality.

Comment From: jtratner

Can you show an example of input where this is necessary?

Comment From: jreback

@cancan101 I would not be averse to seeing a suite of converters, ala`io/date_converters``, e.g. functions to do some sort of conversions like that

Comment From: cancan101

Having a set of converters seems reasonable. Do you think it makes sense then for the PythonParser to use one of these converters (ie the one that handles the thousands separator)?

Comment From: jreback

when I mean converters, I really mean essentially a set of regular expressions/functions that you just pass to replace. You can certainly pass these via the converters argument if you REALLY want to. The thousands/decimal stuff seems more general purpose to me.

I mean you could argue that the parents/ignore dollar signs are general purpose too, but I have never seen people actually write those to csv's (and only use them as formatting).

Comment From: cancan101

Far more often I see these in HTML tables, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI I was just hoping to centralize the number parsing logic for CSV and for HTML in one place. It looks like logic for parsing HTML will need to be richer.

Comment From: jreback

HTML is a bit of a different beast. @cpcloud can comment on this when he is back, but the API is meant to be somewhat similar to read_csv, but the backend is completely different

read_csv is eseentially line-oriented, why HTML parsing is not

Comment From: cancan101

@jreback We still have not really answered the question as to whether we should copy and paste the following code from parsers.py into a converter, or whether they should share the code:

        nonnum = re.compile('[^-^0-9^%s^.]+' % self.thousands)
        ...    
                if (not isinstance(x, compat.string_types) or
                    self.thousands not in x or
                        nonnum.search(x.strip())):
                    rl.append(x)
                else:
                    rl.append(x.replace(self.thousands, ''))

Comment From: jtratner

@cancan101 also, keep in mind that it's often much simpler to change things around after converting into a DataFrame as opposed to trying to do it at the same time you parse the data. So if you have something like '(9)' as a string, it could be faster to allow it to come out as a string dtype and then manipulate it into integers.

Comment From: cancan101

@jtratner That makes sense. With that line of thinking, why even have this thousands parsing in any of the parsers? Why not deprecate that functionality / emulate it by loading string into a DataFrame and then performing a replace, etc? That would clean up (and speed up) the parser code, avoid issue like this, and make the various parsers more consistent?

Comment From: jreback

@cancan101 its all a trade-off. More complicated parsing code by MUCH faster for in-line conversions, (e.g. having thousands separators) is not so uncommon, and once you take them out you are left with a float.

Doing after conversions is much more memory intensive (and time intensive), you have to scan twice (or more).

Getting good/great performance is hard. Always solve the problem first, then optimize. That said it is keeping in mind that certain bottlenecks can be dealt with by more complicated code.

Comment From: cancan101

@jreback Fair enough. I was going to say there is there is something to be said about keeping the parsers about extracting structure from the info (column, indexes, etc) and leaving extracting meaning (i.e. number parsing) to the post processing. This would make writing new parsers simpler.

Comment From: jreback

@cancan101 you have a fair point; in this case it was all about performance when @wesm wrote the new parsers late last year. I am not sure of any gains in anew 'new' parsers thought. What 'new' parsers are needed?

Comment From: cancan101

At this point the only one on my horizon would be the XBRL parser (#4407), although I don't think that one will have issues with string data.

Comment From: cancan101

It also is not all about "new" parsers but maintaining and improving old parsers. This does contract jtratner's previous points about keeping the APIs simple. Admittedly there might be a valid performance reason here to do so.

Comment From: jreback

@cancan101

a lot of times people want different API's for things but the implementation can be a single one (with say options) to support these API's

the parsers are the reverse

there is bascially a single API, with different implementations (e.g. most use read_csv machinery, but excel puts a layer on top, and HTML is completely different, as is the stata and JSON implementations for that matter)

Comment From: cancan101

Here the number of rows is small (<100), so for me, I can definitely go the "read as string and then apply more complicated parsing logic" without worrying about performance.

Comment From: cancan101

@jreback Another new parser would be the PDF parser #4556

Comment From: jbrockmendel

Did a decision get reached on this?

Comment From: jorisvandenbossche

I think it would be nice to add a decimal option to pandas.to_numeric

Comment From: liverpool1026

following onto this.

should to_numeric support cases like thousand seperator? Or that is better handled outside to_numeric.

The issue I am having is sometimes the data comes in as "1,200" instead of "1200" or 1200.

So when I call to_numeric

  • "1,200" -> Error
  • "1200" -> 1200
  • 1200 -> 1200

When I would want "1,200" to be coverted to 1200 when I call to_numeric.

Comment From: AlexHodgson

take

Comment From: AlexHodgson

I see that to_numeric() uses precise_xstrtod() from tokenizer.c to do the string to numeric conversion, this function already has functionality to handle different characters for decimal and thousands, but the normal separators are hardcoded in pd_parser.c where the call to precise_xstrtod() occurs. I could add parameters to to_numeric() for decimal and thousand separators like read_csv() has, and then pass these down to the underlying converter?