-
Notifications
You must be signed in to change notification settings - Fork 1
Description
npreadtext's code was never designed by me to be the fastest possible reader. Rather, it is designed to be complete (full unicode support, no special casing of raw files), extensible and fairly simple. Considering that, I think it is pretty sweet that it easily beats the pandas parser.
However, there are some optimization plausible (and probably more micro-optimizations).
So here I list some potential ones, some of which I explorered a bit, but just don't really want to do (yet):
- Read user-opened files in chunks. This could be pretty worth-while (maybe even ~30%) and adds no complexity at all. The only downside: On error, we don't just read one line to much, but an unspecified amount more. This also makes it incompatible with
max_rows=...where loadtxt ensures we only read as much as necessary (for user opened files). - Use bloom filters:
- Adds very little complexity (I have a branch on my fork to try it, it mostly steals the filter setup/design from Python, the rest is nothing).
- Worthwhile for whitespace handling: This seems worthwhile for all whitespace checks (both in the tokenizer and the numerical converters). It is a minor (but simple!) optimization for the numerical converters, it may actually be very worthwhile for
delimiter=None(I did not test that). - It seems not worthwhile for
delimiter=",", but it may be if we add e.g. an escape character as well.
- "Specialized versions". By duplicating the core loop for certain constant expression, leaner versions could be created without much code churn. But it seems you need to be very specialized for this to be worthwhile. For example, make a loop specific to
delimiter=",", newline="\n", quotechar=None, comment='#'(newline="\n"denoting that universal-newline handling is not needed).- Quick check indicates that this only optimize things by ~10% right now though, which seems too modest to make sense for me.
The following are more complicated ones, that I would not aim for in NumPy, unless they proof very worthwhile probably (the first one is pretty reasonable though, if it is worthwhile on its own). I could imagine that combining a few of these makes a big difference (but seems probably too complicated/too much churn for an "every day" parser.)
- Some files could be read directly (without Python involvement) or in raw/bytes mode. Python seems pretty quick reading and this does not seem a huge portion of the whole time, so it is unclear it would help much:
- Easy, if the it was only used for
encoding="latin1"or maybeascii(i.e. equivalent to unicode stored in a single byteUCS1). - Trickier, but possible for
utf-8(would require utf-8 decoding and requires limiting all control characters to ASCII, like pandas). Of course for numerical data only ascii is used andlatin1is probably always OK except printing wrong errors when parsing fails.
- Easy, if the it was only used for
- Converters could not use UCS4: This is plausible, but using UCS4 means a single code path is viable. It could effectively mean that converters would have to require UCS1, UCS2, and UCS4 versions.
- This could be quite a lot faster, but also seems like a lot of annoying churn/additional complexity.
- Requires templating all converters for UCS1, (UCS2?), and UCS4. Also means any extension will need to handle all these if made available.
- Seems like quite a lot of churn, but if it makes a huge difference, could be worthwhile.
- Would require either to:
- Limit to UCS1/UTF8(?) to be used only if any character outside the range occurs it is an immediate failure (or even ASCII for simplicity with reading raw UTF8 data).
- Annoying logic, since Python unicode stricts can be UCS1, UCS2, or UCS4. So reading a single file can easily switch between modes mid-file.
- Field-by-field parsing is possible often, but tricky because usecols supports negative indexing. It might be worthwhile in some cases. I somewhat expect this would mainly makes sense if combined with one/both of the above complicated optimizations. (Otherwise it may improve cache locality slightly, and simplifies field handling, since there is always only a single field.)
- No-copy parsing (copy will be necessary sometimes, though) might be possible if loops are specialised to latin1/ascii only. Further, currently converters can rely on
\0termination, which is difficult or impossible with no-copy.