Create a class / method for number parsing. This would be a generalization of the code that remove the thousands separators from numbers (see: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1502).
Other items include: 1. Alternative decimal point 2. Using parenthesis for negative numbers 3. Stripping currency character
I can submit PR for this. Do people have other suggestions for this?
Comment From: jreback
there is a decimal
parameter already, but I not sure in doc/source/io.rst (though in doc string)....maybe add that doc?
Comment From: cancan101
The idea of factoring out this functionality is so that it can be used for parsing numbers in other locations without having to go through the TextParser
.
Comment From: jreback
why would you not simply use replace
for that, which can do all of what you are talking about? (if its not parser related)
Comment From: jreback
http://pandas.pydata.org/pandas-docs/dev/missing_data.html#string-regular-expression-replacement
Comment From: cancan101
How would you use replace for the parenthesis? Bear in mind that the value then needs to be negated.
Comment From: cancan101
I suppose I could replace the parenthesis with a leading negative sign and then convert that string to an int/ float.
Comment From: cancan101
That being said I was thinking of places where I wanted to be able to take an arbitrary string which might not be in a DataFrame
/ Series
and parse a number.
Comment From: jreback
that would work; replace should infer the return types correctly
Comment From: cancan101
@jreback That addresses using replace for a DataFrame or Series but not in the general sense where we have a string to parse.
Comment From: jreback
out of scope for pandas
Comment From: cancan101
Fair enough. I was just trying to factor out code that could be used both for pandas and for what is out of scope for pandas.
Comment From: jreback
@cancan101 close this then? or do you want to try to do some of this in the actual parse itself? (which depending on the API/scope may of may not be useful). As its a small edge case and if you did have parens/dollar signs in your numbers you can just replace after you read it in, because there is a perf impact in even checking for it on more general parsing
Comment From: cancan101
I was going to / could write this such that there is no performance impact. I would instantiate a parser at the following location: https://github.com/cancan101/pandas/blob/1703ef44cd6b98e17c785c9120e29bbeefdefd1c/pandas/io/parsers.py#L1505 based on parsing options. Theoretically are a few more branches per FILE (not per cell or line).
Comment From: jreback
Well, so you want to parse (value)
-> -value
and $numeric
-> numeric
?
this is JUST for PythonParser, right? (if so I am -1 on this); the c-parser is most often the default/used
Comment From: cancan101
What do you mean -1 on this? Like you don't like the change? Or you care less about the performance hit because this code path isn't the default?
Comment From: cancan101
And yes, what I was thinking was just the PythonParser
Comment From: jreback
no, doing a change just for the PythonParser
; the c-parser is the primary parser and getting them out of sync is IMHO a bad idea (as you can see by the recent fixes to put them BACK in sync). In reality the PythonParser
is not even necessary (except for some back compat I think) (as the c-parser can parse virtually eveything)
Comment From: cancan101
This does not actually have to change the API for the PythonParser
. All I am suggesting is that the underlying python code for the PythonParser
is factored out and usable elsewhere.
Comment From: jreback
for what purpose?
Comment From: cancan101
There are cases in the HTML parser that I want to be able to use this functionality.
Comment From: jtratner
Can you show an example of input where this is necessary?
Comment From: jreback
@cancan101 I would not be averse to seeing a suite of converters
, ala`io/date_converters``, e.g. functions to do some sort of conversions like that
Comment From: cancan101
Having a set of converters seems reasonable. Do you think it makes sense then for the PythonParser
to use one of these converters (ie the one that handles the thousands separator)?
Comment From: jreback
when I mean converters, I really mean essentially a set of regular expressions/functions that you just pass to replace
. You can certainly pass these via the converters
argument if you REALLY want to. The thousands/decimal
stuff seems more general purpose to me.
I mean you could argue that the parents/ignore dollar signs are general purpose too, but I have never seen people actually write those to csv's (and only use them as formatting).
Comment From: cancan101
Far more often I see these in HTML tables, for example: http://www.sec.gov/Archives/edgar/data/47217/000104746913006802/a2215416z10-q.htm#CCSCI I was just hoping to centralize the number parsing logic for CSV and for HTML in one place. It looks like logic for parsing HTML will need to be richer.
Comment From: jreback
HTML
is a bit of a different beast. @cpcloud can comment on this when he is back, but the API is meant to be somewhat similar to read_csv
, but the backend is completely different
read_csv
is eseentially line-oriented, why HTML parsing is not
Comment From: cancan101
@jreback We still have not really answered the question as to whether we should copy and paste the following code from parsers.py
into a converter, or whether they should share the code:
nonnum = re.compile('[^-^0-9^%s^.]+' % self.thousands)
...
if (not isinstance(x, compat.string_types) or
self.thousands not in x or
nonnum.search(x.strip())):
rl.append(x)
else:
rl.append(x.replace(self.thousands, ''))
Comment From: jtratner
@cancan101 also, keep in mind that it's often much simpler to change things around after converting into a DataFrame
as opposed to trying to do it at the same time you parse the data. So if you have something like '(9)'
as a string, it could be faster to allow it to come out as a string dtype and then manipulate it into integers.
Comment From: cancan101
@jtratner That makes sense. With that line of thinking, why even have this thousands
parsing in any of the parsers? Why not deprecate that functionality / emulate it by loading string into a DataFrame
and then performing a replace, etc? That would clean up (and speed up) the parser code, avoid issue like this, and make the various parsers more consistent?
Comment From: jreback
@cancan101 its all a trade-off. More complicated parsing code by MUCH faster for in-line conversions, (e.g. having thousands separators) is not so uncommon, and once you take them out you are left with a float.
Doing after conversions is much more memory intensive (and time intensive), you have to scan twice (or more).
Getting good/great performance is hard. Always solve the problem first, then optimize. That said it is keeping in mind that certain bottlenecks can be dealt with by more complicated code.
Comment From: cancan101
@jreback Fair enough. I was going to say there is there is something to be said about keeping the parsers about extracting structure from the info (column, indexes, etc) and leaving extracting meaning (i.e. number parsing) to the post processing. This would make writing new parsers simpler.
Comment From: jreback
@cancan101 you have a fair point; in this case it was all about performance when @wesm wrote the new parsers late last year. I am not sure of any gains in anew 'new' parsers thought. What 'new' parsers are needed?
Comment From: cancan101
At this point the only one on my horizon would be the XBRL parser (#4407), although I don't think that one will have issues with string data.
Comment From: cancan101
It also is not all about "new" parsers but maintaining and improving old parsers. This does contract jtratner's previous points about keeping the APIs simple. Admittedly there might be a valid performance reason here to do so.
Comment From: jreback
@cancan101
a lot of times people want different API's for things but the implementation can be a single one (with say options) to support these API's
the parsers are the reverse
there is bascially a single API, with different implementations (e.g. most use read_csv machinery, but excel puts a layer on top, and HTML is completely different, as is the stata and JSON implementations for that matter)
Comment From: cancan101
Here the number of rows is small (<100), so for me, I can definitely go the "read as string and then apply more complicated parsing logic" without worrying about performance.
Comment From: cancan101
@jreback Another new parser would be the PDF parser #4556
Comment From: jbrockmendel
Did a decision get reached on this?
Comment From: jorisvandenbossche
I think it would be nice to add a decimal
option to pandas.to_numeric
Comment From: liverpool1026
following onto this.
should to_numeric support cases like thousand seperator? Or that is better handled outside to_numeric.
The issue I am having is sometimes the data comes in as "1,200" instead of "1200" or 1200.
So when I call to_numeric
- "1,200" -> Error
- "1200" -> 1200
- 1200 -> 1200
When I would want "1,200" to be coverted to 1200 when I call to_numeric.
Comment From: AlexHodgson
take
Comment From: AlexHodgson
I see that to_numeric() uses precise_xstrtod() from tokenizer.c to do the string to numeric conversion, this function already has functionality to handle different characters for decimal and thousands, but the normal separators are hardcoded in pd_parser.c where the call to precise_xstrtod() occurs. I could add parameters to to_numeric() for decimal and thousand separators like read_csv() has, and then pass these down to the underlying converter?