Skip to content

Conversation

@incaseoftrouble
Copy link

An improvement of the Distribution using two arrays and binary search, yielding significantly smaller memory footprint. Avoiding repeated boxing and unboxing yields noticeable speed-ups.

Repeated calls to add() have a worse complexity (theta n per call), hence a bulk constructor and a builder with amortized n log n construction complexity (for the whole distribution) is added.

@merkste
Copy link
Contributor

merkste commented Apr 23, 2018

I have the following questions:

  1. Do you have any benchmarks or numbers that support and quantify the performance gain?
  2. Have you considered to inherit/implement Distribution instead of changing the class? Maybe we can make Distribution an interface with different implementations?
  3. Have you considered to simplify the builder concept by just copying an already build map-based distribution?

I am also very sceptical considering the fact that the entries you use for iteration are mutable. This is a sure source for hard-to-track bugs, as it imposes a very different contract than before. Also, you can write the underlying structure during iteration which might be undesirable.

@incaseoftrouble
Copy link
Author

incaseoftrouble commented Apr 23, 2018

Do you have any benchmarks or numbers that support and quantify the performance gain?

Only a few local tests. If exhaustive testing is wished for I can do it. The primary goal was memory efficiency, in the limit this implementation is probably worse than optimal hashing (log n vs constant ...)

Maybe we can make Distribution an interface with different implementations?

I'd love to, but that makes the changes quite global. The best / modern approach would be to have Distribution.of factory methods, choosing the right implementation. Since usually the distributions are only built once, they probably should be immutable. This allows for some optimizations. Creation can be implemented via separate builder classes, which then in turn can choose the right implementation again.

Have you considered to simplify the builder concept by just copying an already build map-based distribution?

Its slower due to boxing/unboxing. This basically is taken from my local branch where I replaced everything with primitive data structures. Can easily be adapted and is quite related to the the of factory methods.

I am also very sceptical considering the fact that the entries you use for iteration are mutable. This is a sure source for hard-to-track bugs, as it imposes a very different contract than before. Also, you can write the underlying structure during iteration which might be undesirable.

There are multiple things I think:
a) Returning one object which is mutated during construction: This is the way fastutil implements its iteration. The idea is that, since there are less objects, the GC has less work to do. But especially with the more modern GCs, this presumably is not relevant. The JVM might even figure out that the entries are value types and might handle them accordingly. So, in short, this can easily be changed and there is no strong opinion from my side.

b) Mutation through the entry: This can easily be controlled by exceptions. Wasn't it mutable before? map.entrySet().iterator() can setValue IIRC.

c) Mutation of the structure while iterating: Well, fail-fast is only required to be best-effort. It can be done with slightly more memory and runtime but I think its not worth it. Especially, since the object should be immutable anyway ...

@davexparker davexparker force-pushed the master branch 2 times, most recently from ca12ca0 to 6bf73df Compare January 12, 2024 14:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants