Potential optimization strategies (just for completeness sake :))

`npreadtext`'s code was never designed by me to be the _fastest_ possible reader.  Rather, it is designed to be complete (full unicode support, no special casing of raw files), extensible and fairly simple.  Considering that, I think it is pretty sweet that it easily beats the pandas parser.
However, there are some optimization plausible (and probably more micro-optimizations).

So here I list some potential ones, some of which I explorered a bit, but just don't really want to do (yet):
* Read user-opened files in chunks.  This could be pretty worth-while (maybe even ~30%) and adds no complexity at all.  The only downside: On error, we don't just read one line to much, but an unspecified amount more.  This also makes it incompatible with `max_rows=...` where loadtxt ensures we only read as much as necessary (for user opened files).
* Use bloom filters:
  * Adds very little complexity (I have a branch on my fork to try it, it mostly steals the filter setup/design from Python, the rest is nothing).
  * Worthwhile for whitespace handling:  This seems worthwhile for all whitespace checks (both in the tokenizer _and_ the numerical converters).  It is a minor (but simple!) optimization for the numerical converters, it may actually be very worthwhile for `delimiter=None` (I did not test that).
  * It seems not worthwhile for `delimiter=","`, but it may be if we add e.g. an escape character as well.
* "Specialized versions".  By duplicating the core loop for certain constant expression, leaner versions could be created without much code churn.  But it seems you need to be _very_ specialized for this to be worthwhile.  For example, make a loop specific to `delimiter=",", newline="\n", quotechar=None, comment='#'` (`newline="\n"` denoting that universal-newline handling is not needed).
  * Quick check indicates that this only optimize things by ~10% right now though, which seems too modest to make sense for me.

The following are more complicated ones, that I would _not_ aim for in NumPy, unless they proof _very_ worthwhile probably (the first one is pretty reasonable though, if it is worthwhile on its own).  I could imagine that combining a few of these makes a big difference (but seems probably too complicated/too much churn for an "every day" parser.)
* Some files could be read directly (without Python involvement) or in raw/bytes mode.  Python seems pretty quick reading and this does not seem a huge portion of the whole time, so it is unclear it would help much:
  * Easy, if the it was only used for `encoding="latin1"` or maybe `ascii` (i.e. equivalent to unicode stored in a single byte `UCS1`).
  * Trickier, but possible for `utf-8` (would require utf-8 decoding _and_ requires limiting all control characters to ASCII, like pandas).  Of course for _numerical_ data only ascii is used and `latin1` is probably always OK except printing wrong errors when parsing fails.
* Converters could not use UCS4:  This is plausible, but using UCS4 means a single code path is viable.  It could effectively mean that converters would have to require UCS1, UCS2, and UCS4 versions.
  * This could be quite a lot faster, but also seems like a lot of annoying churn/additional complexity.
  * Requires templating all converters for UCS1, (UCS2?), and UCS4.  Also means any extension will need to handle all these if made available.
  * Seems like quite a lot of churn, but if it makes a _huge_ difference, could be worthwhile.
  * Would require either to:
    * Limit to UCS1/UTF8(?) to be used only if any character outside the range occurs it is an immediate failure (or even ASCII for simplicity with reading raw UTF8 data).
    * Annoying logic, since Python unicode stricts can be UCS1, UCS2, or UCS4.  So reading a single file can easily switch between modes mid-file.
* Field-by-field parsing is possible often, but tricky because usecols supports _negative_ indexing.  It _might_ be worthwhile in some cases.  I somewhat expect this would mainly makes sense if combined with one/both of the above complicated optimizations.  (Otherwise it may improve cache locality slightly, and simplifies field handling, since there is always only a single field.)
* No-copy parsing (copy will be necessary sometimes, though) might be possible if loops are specialised to latin1/ascii only.  Further, currently converters can rely on `\0` termination, which is difficult or impossible with no-copy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential optimization strategies (just for completeness sake :)) #106

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Potential optimization strategies (just for completeness sake :)) #106

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions