Skip to content

fix(spike): Replace posh with reactive DS#665

Closed
lambduhh wants to merge 3 commits intoathensresearch:masterfrom
lambduhh:athens/spike/570-DB-performance-upgrade
Closed

fix(spike): Replace posh with reactive DS#665
lambduhh wants to merge 3 commits intoathensresearch:masterfrom
lambduhh:athens/spike/570-DB-performance-upgrade

Conversation

@lambduhh
Copy link
Copy Markdown
Contributor

Proposed changes to eliminate posh from the code base since it has known performance issues. This replaces with reactive datascript. athens.rxdb is meant to be a drop in replacement for posh with the exception that db/dsdb doesn't need provided as argument.

There seems to be a lot of regression errors that result from posh functions returning different spec than datascript, these can be fixed if it is determined that the performance upgrade is worth it. Seemed like a good point to stop and check before going any further.

@tangjeff0 Please pull down branch and run performance tests or tell me a method of reproducing a performance test so I can try it out myself.

@lambduhh lambduhh self-assigned this Feb 18, 2021
Comment thread src/cljs/athens/db.cljs


(d/listen! dsdb :history
#_(d/listen! dsdb :history
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

causing app to crash, probably due to difference in spec between posh/ds. commented out as short fix just to get app running to test

(dissoc :merge-prompt))
:timeout {:action :clear
:id :merge-prompt}})))
(try
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same issue as above, short fix just to test

Comment thread src/cljs/athens/util.cljs

;; TODO: move all these DOM utilities to a .cljs file instead of cljc
(defn scroll-top! [element pos]
(defn scroll-top!
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cljsstyle did this?



(d/listen! dsdb :devtool/open listener)
#_(d/listen! dsdb :devtool/open listener)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

short fix, regressions

(ns athens.views.graph-page
(:require
["react-force-graph" :as rfg]
#_["react-force-graph" :as rfg]
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't get this working locally, short fix

@lambduhh
Copy link
Copy Markdown
Contributor Author

I've marked the somewhat irrelevant changes that I made to fix regressions in the short term (just to get working so we can test), may even need help putting everything back together but it didn't make sense to put in any extra time before bringing to the team for review and thoughts. Let me know what you think!

@jefftangx
Copy link
Copy Markdown
Collaborator

It doesn't look like dbrx.cljs is part of this PR @lambduhh

Copy link
Copy Markdown
Collaborator

@jefftangx jefftangx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.loom.com/share/189603c0e90947359700061563960a60

Main points from the loom

  • The initial prototype could've been 95% fewer LoC changed
  • Could've thought through places that might be good to test performance before starting this out! (or asking)

Next steps: start a new branch with only 1 place where a posh query is replaced with the make-reaction. A/B test the performance with a transaction that should lead to a reaction.

@lambduhh
Copy link
Copy Markdown
Contributor Author

Thanks for the code review! I appreciate your analysis of thinking like a scientist. Inspires me to tell the story of this PR and structure my response in the same way?

Observation
Athens is having performance issues related to large amounts of data, this has been observed and communicated by users in the discord and by Jeff when uploading his Roam data (which is rather large)

Question
How can we improve (and how do we measure) the performance of Athens when large data is present?

Additional Information

  • The current implementation of athens uses posh as a reactive database for datascript. Posh is no longer actively maintained and has been abandoned by its owner
  • When posh was created, perhaps it’s speed was comparable to DS but currently is 11 releases behind the latest DS version, which has vastly improved in these updates.
  • Reviewing the posh codebase it is observed that posh does a lot in addition to just serving as a reactive db for ds, most noticeably involving caching.
  • This led me to inquire on the discord about the motivations for using posh and whether it was due to posh’s caching abilities or simply as a reactive db.
    Response was that it was reactive db > cacheing

Hypothesis
H0- Replacing posh with datascript’s native reactive db capabilities will have no effect on the performance of athens when a large amount of data is present
H1- Replacing posh with ds native reactive db will improve Athen’s performance by decreasing loading times when a large amount of data is present.

Testing

Trivial
Load various sizes of data into DS and Posh connections and measure the performance of queries, pulls and transactions on

  • Initial actions
  • Subsequent actions

Could use the time or simple-benchmark macros to measure individual runtimes of functions when running athens with small amounts of data vs large amounts of data. (Note: We should get some sort a standard file that represents the working definition of “large amount of data” that can be used as a benchmark for upgrades)

The value proposition of Posh is that it performs a series of calculations to attempt to determine if a query/pull that has been executed previously and needs to be recalculated due to new transactions or, if a cached version of the query can be returned instead. With no intermediate transactions, all supported Posh pulls and queries will be memoized.

  • The price paid for this is longer Posh transaction times as a substantial amount of work is done to index the datoms.
  • Additionally, even a modest amount of data churn will cause a substantial amount of new meta-calculations to determine if a cached response can be returned OR if a new query results need to be calculated.

Therefore, Posh is most suitable for long sessions of analytical workloads where

  • repeated queries are expected and
  • data churn is minimal.

There is an inflection point of expected data complexity and churn beyond which the speed of Datascript (which is currently the fastest known in-memory datalog processing engine (even faster that Datomic-mem), outweighs the possible values from cached posh queries.

Determining if an application is above that threshold cannot be determined with trivial testing, it must be tested based on the behavior of the actual application.

Realistic

Comparing Posh v DS in a realistic scenario involves comparing the behavior of the application with a large volume of data under normal and abnormal operating conditions (using methods listed above) Since Athens is a heavy user input application, there is a significant amount of data churn. Therefore, there is a strong reason to suspect that reactive datascript (since it skips the extra indexing and cache check calculations) would be more performant than posh.

Why is it not possible to replace only a subset of the functionality with Reactive Datascript? (as suggested in the PR as an minimum viable prototype)

(defn posh! [dcfg & conns]
  (let [posh-atom (atom {})]
    (reset! posh-atom
            (loop [n 0
                   conns conns
                   posh-tree (-> (p/empty-tree dcfg [:results])
                                 (assoc :ratoms {}
                                        :reactions {}))]
              (if (empty? conns)
                posh-tree
                (recur (inc n)
                       (rest conns)
                       (let [db-id (keyword (str "conn" n))]
                         (p/add-db posh-tree
                                   db-id
                                   (set-conn-listener! dcfg posh-atom (first conns) db-id)
                                   (:schema @(first conns))))))))))

Note that posh! replaces the existing connection with a p/empty-tree
This p/empty-tree is in turn expected by make-query-reaction, which is called on p/pull, p/pull-many, and p/q.
Additionally, p/transact is expecting a p/empty-tree

  • A reactive datascript conn, which is simply a Reagent atom, is there-fore fundamentally incompatible with Posh functions. The call to p/posh! would destroy the reactive Datascript conn, and the reactive Datascript conn cannot be used by p/pull, p/pull-many, p/q, or p/transact.

Therefore, the only way to test the change in realistic circumstances is to isolate all the Posh functions with equivalent reactive Datascript calls, as the p/empty-tree and the reactive Datascript conn are mutually exclusive. The current implementation is the least invasive method I could get to run.

@jefftangx
Copy link
Copy Markdown
Collaborator

Really great write-up! I'm stuck on two assumptions:

Since Athens is a heavy user input application, there is a significant amount of data churn.

Does it have to be true that heavy user input ~> data churn? What if the user input was all on the same page? In that case, it seems that a lot of the caching would be useful.

Therefore, the only way to test the change in realistic circumstances is to isolate all the Posh functions with equivalent reactive Datascript calls, as the p/empty-tree and the reactive Datascript conn are mutually exclusive.

My understand is that the way Reagent/React works is that only the queries on the Virtual DOM are relevant. Changes to data ~> re-compute necessary available subscriptions/reactions ~> necessary UI changes.

For instance, all-tables has one query:

(d/q '[:find [?e ...]
                                  :where
                                  [?e :node/title ?t]]
                                @db/dsdb)

And node-page or block-page likewise only have 2 queries at the root level:

  1. get-node-document/get-block-document
  2. get-linked-references

If I'm understanding this correctly, you're saying we wouldn't be able test the performance of replacing one of these posh queries with a different query?


Regarding "large data", there are probably two main contexts. One would be if the current page had a lot of data. One would be if the db was large, but the current page was negligibly small.

For the former, we could use Jake's note, which he volunteered in Discord: https://discord.com/channels/708122962422792194/709246147549593601/809965655302340621. I copied and pasted this into this branch, but it seemed to crash Athens. I'm guessing it's a parser issue.

For the latter, we could use upload a large public Roam, or just have large enough db. My current index.transit is 4.8Mb, but I probably shouldn't share it here. This is probably the use case that is more interesting. I think it's pretty strange that my db is noticably slower even on a small page. The performance of a single page shouldn't be influenced by the size of the db, which is in the background. In order to test this out, we may want to download and import the edn file of a large public Roam DB such as https://roamresearch.com/#/app/roam-book-club-2. This is like 30+Mb so it's quite a bit larger. You could do this by git checking out the branch #561, and using the upload feature there.

I think the biggest missing piece is how to actually benchmark performance. Would love to hear your thoughts on all the above!

@lambduhh
Copy link
Copy Markdown
Contributor Author

user-input → data churn

I agree completely with the point that having a cache is useful. I
propose that we can have better cache behavior without posh
and maintain all the benefits.

A few points to consider on this:

  1. posh cache behavior is non-deterministic and expensive. In posh,
    it is not obvious what the rules are which determine when a query
    will result in a cache hit vs. recomputation.

  2. re-frame and reagent offer fine-grained, deterministic cache
    control mechanisms with re-frame subscriptions and
    reagent.ratom/make-reaction, respectively.

    The rules for these mechanisms are very clear -- whenever the inputs
    change, the calculation is recomputed.

    There is also a paradigm for tiered levels of
    subscriptions/reactions for even finer grained control over this
    behavior.

    See re-frame
    subscriptions
    for
    more detail on this model. (Note: we can also accomplish this
    behavior with re-frame.ratom/make-reaction, but this page captures
    the philosophy well.)

  3. Even without the cache, Datascript is so performant that it will
    almost never be necessary to use a cache unless we are updating
    hundreds of elements per second.

  4. We can assume that posh will not be getting any further updates,
    while reagent, re-frame, and datascript continue to receive
    community support and updates.

When do components update?

For an authoritative answer on this question, here is the official
documentation

on when components update.

The important thing to note is that posh introduces a second layer
of cache on top of reagent/re-frame. posh first checks internally to
see if it thinks a query/pull needs to be recomputed based on
intermediate transactions. If so, posh will perform the new
calculation and pass the result to the reagent component /
subscription. Then the reagent~/~re-frame component/subscription will
perform the same check -- if the input is different than the previous
input, it will recompute the output.

Why can't we test one of the queries in isolation?

We can test non-reactive queries in isolation, but we can't test
reactive queries in isolation without introducing a data access
layer of abstraction. This is because the reactive
datascript reagent/atom and the posh tree are fundamentally
incompatible.

Data access layer of abstraction

While not the point of this PR, it's not a bad practice do adopt a
3-layer, model of data access abstraction as it will make it easier in
the future if we need to change the underlying storage (i.e., graphQL,
asami, datahike) again.

Example table
component
:

;; before
(defn table
  []
  (let [pages (r/atom (->> (d/q '[:find [?e ...]
                                  :where
                                  [?e :node/title ?t]]
                                @db/dsdb)
                           (p/pull-many db/dsdb '["*" :block/_refs {:block/children [:block/string] :limit 5}])
                           deref
                           (sort-by (fn [x] (count (:block/_refs x))))
                           reverse))]
    (fn [] ...)))
;; after
(defn table
  []
  (let [pages (athens.data.rx/top-n-pages 5)]
    (fn [] ...)))

Specific test

Also note that in this case, it is not possible to test reactive
datascript in isolation. The reactive behavior of the page comes from

(p/pull-many db/dsdb '["*" :block/_refs {:block/children [:block/string] :limit 5}])
deref

not

(d/q '[:find [?e ...]
      :where
      [?e :node/title ?t]]
      @db/dsdb)   

The p/pull-many is the reactive part. That's the part that we would
need to change to reactive Datascript. The d/q is already plain
datascript. It would be more difficult to figure out how to replace
this one query than to simply replace all the queries with reactive
datascript.

Testing proposals

I like the idea of testing using a standard DB. Ultimately I think that
the user experience, not benchmarks, is what will be important, and your
observation that small page performance should not suffer when there is
a large database is key.

An important note here is that using reagent~/~re-frame deterministic
cache control, we will be able to isolate the behavior of the page from
the DB.

This comes back to the undesirable two-layer cache with posh -- posh
will update indexes regardless of whether a component is mounted or
not. reagent~/~re-frame will not check subscriptions for unmounted
components.

Therefore if we are encountering behavioral bottlenecks, ("clicking X
causes a page slowdown") we can implement data-tiering on downstream
subscriptions/reactions to isolate the rerender.

The important thing, I think, is not the load time, but the user
experience. It doesn't matter if the loading is fast if it causes page
jank -- and it doesn't matter if the loading is slow is the user doesn't
notice and still has a good experience.

So I think we should go with your suggestion of testing using a DB that
is representative of the interesting cases of user behavior.

Then we can open a parallel reactive Datascript branch and work to bring
it up to parity with the main branch, implementing data-tiering to
isolate and slow behavior.

Once we are convinced there are no regressions and the user experience
is enhanced, we can sunset posh and make the switch.

@jefftangx
Copy link
Copy Markdown
Collaborator

jefftangx commented Feb 23, 2021

This is really interesting @lambduhh. Thank you for your deep dive on all this!!


For tiered subscriptions, are you suggesting tiers of datascript queries? That's interesting.


Even without the cache, Datascript is so performant that it will
almost never be necessary to use a cache unless we are updating
hundreds of elements per second.

I agree here, I'm not 100% certain we need a cache yet.


It would be more difficult to figure out how to replace
this one query than to simply replace all the queries with reactive
datascript.

I don't understand this one. Why wouldn't it be able to swap just this out? Either way, I don't think all-pages is actually a good query to test. You make the point later on that UX matters more than benchmarks. In this case, UX around writing is more important than all-pages.


Ultimately I think that
the user experience, not benchmarks, is what will be important

Great point. 100% agree


This comes back to the undesirable two-layer cache with posh -- posh
will update indexes regardless of whether a component is mounted or
not. reagent~/~re-frame will not check subscriptions for unmounted
components.

This implementation would definitely explain why posh is slow on small pages but big db. I don't totally understand how posh's caching system works, so I will take your word for it right now. If it is true, I would definitely be off by a metric boatload saying this could be done in 95% less LoC 😅

That being said, I wasn't able to test this on my personal DB without it crashing (https://www.loom.com/share/98dcfc61c84344dea3b22ccc1dafa232). I don't know the best way to get you a large db, other than downloading a large public one and then importing it off of this branch #561.

I don't totally understand how make-reaction works either. I do remember @jeroenvandijk tried to improve block performance a long while ago with cursor, which I see uses the same thing under the hood, though I think that had more to do with react than datascript. Wonder if Jeroen has any points there or on any other points on this conversation!

@pithyless
Copy link
Copy Markdown
Contributor

pithyless commented Mar 3, 2021

I did a little snooping and found there is a significant constant overhead with athens.effects/walk-transact, which may be interesting to explore. Eliding several rabbit holes and dead ends I journeyed, here's a quick repro:

  1. given the initial athens db (with the Welcome page)
  2. opening and closing nested blocks on the Welcome page (what is demoed on the animation attached to "You can open and close blocks that have children.")
  3. timing (transact! db/dsdb final-tx-data) inside athens.effects/walk-transact takes ~30ms (on my machine)
  4. The transaction, though, is just doing this:
[[:db/add [:block/uid "6aecd4172"] :block/open true]]
  1. Adding a bit of debug code to the function:
(let [more-tx-data  (parse-for-links with-tx)
      final-tx-data (vec (concat tx-data more-tx-data))]
   (pprint final-tx-data)

   (let [fake-db (-> (d/datoms @db/dsdb :eavt)
                     (d/init-db  db/schema)
                     (d/conn-from-db))]
     (prn "first datascript transact!")
     (time (d/transact! fake-db final-tx-data))

     (prn "second datascript transact!")
     (time (d/transact! fake-db final-tx-data)))

   (prn "first posh transact!")
   (time (transact! db/dsdb final-tx-data))

   (prn "second posh transact!")
   (time (transact! db/dsdb final-tx-data))

   (let [outputs (:tx-data (transact! db/dsdb final-tx-data))]
     (ph-link-created! outputs)))
  1. We see outputs like this:
[[:db/add [:block/uid "7e409b1cb"] :block/open false]]

core.cljs:198 "first datascript transact!"
core.cljs:198 "Elapsed time: 0.220000 msecs"

core.cljs:198 "second datascript transact!"
core.cljs:198 "Elapsed time: 0.115000 msecs"

core.cljs:198 "first posh transact!"
core.cljs:198 "Elapsed time: 42.965000 msecs"

core.cljs:198 "second posh transact!"
core.cljs:198 "Elapsed time: 3.050000 msecs"

Looks to me like something is adding quite a lot of overhead for a seemingly simple update (the second runs measure noop overhead). This is time spent toggling a flag, before we even start parsing and rendering the blocks; seems to be a potentially janky experience even without lots of data.

If you've got a bigger database lying around, I wonder if the overhead is constant/linear/exponential with more data (irrespective of how small is the transaction change). And perhaps the problem is not with posh itself, but with a subscription? No idea, have not explored this further. But maybe this will be useful anecdote as a jumping off point for further digging. :)

@lambduhh
Copy link
Copy Markdown
Contributor Author

lambduhh commented Mar 3, 2021

From @pithyless on discord pasting so all info is in one place:
"side note: concat always raises an eyebrow; this (vec (concat tx-data more-tx-data)) should probably be (into tx-data more-tx-data) or if tx-data is not guaranteed to be a vector, perhaps (reduce into [] [tx-data more-tx-data])

PSS. I forgot to mention it in the PR, but to pushing for a "Data access layer of abstraction"; this level of indirection would really help with identifying performance bottlenecks, testing alternative strategies, but also documenting where the app is changing state."

totally agree with both points 💯

@pithyless
Copy link
Copy Markdown
Contributor

I dug a little further into why things are taking so long.

There exists a function posh.lib.pull-analyze/pull-analyze, which calls out to posh.lib.datom-matcher/reduce-pattern; which in turn is a recursive function with a helper (posh.lib.datom-matcher/combine-entids - that is a very eager recursive function that builds up a seq-of-seq-of-seqs...).

This is how often a single show/hide block action will trigger the recursion on the Welcome page:

cache-welcome

And this is the same action, but on a much larger markdown file:

cache-markdown

Essentially, posh is working very hard to make sure we only query the bare-minimum; so in this case literally enumerating certain attributes for any potentially related blocks on the entire page. Which totally defeats the purpose, because it would have been faster to even naively read in the entire db, render the blocks, and rely solely on React vdom diffing.

This kind of detailed caching strategy is appropriate when querying the database is prohibitively expensive, but that's not the case for Athens. It's also in stark contrast to the approach e.g. Fulcro uses, where it only tracks which ids may have changed and then queries the DB directly.

All the recursions, conses, seqs, mapcats, and concats made me also wonder about how much GC pressure is being generated by this kind of operation. And sure enough, it's clearly visible (screenshot includes some debugging vars and prn; feel free to ignore those):

pull-vs-patterns

Turns out calculating which attributes may have changed in this specific example was 8x more expensive than the actual datascript query to pull all that data back out.

This is definitely something that could be potentially refactored and improved upstream in posh, but irrespective I agree that the kinds of operations and updates Athens is interested in do not necessarily benefit from posh caching strategies and perhaps it would be better to avoid them altogether.

@jefftangx
Copy link
Copy Markdown
Collaborator

Is this still being worked on @lambduhh? If no, I'm sure someone would be happy to build off of it. My DB is getting really slow 😿

@lambduhh
Copy link
Copy Markdown
Contributor Author

lambduhh commented Mar 8, 2021

Yeah since time is a concern may be quicker to let somebody else take the reins (I've been preoccupied w death in my family). Perhaps @pithyless ?

@jefftangx
Copy link
Copy Markdown
Collaborator

Sorry to hear that. Let us know if you need anything ❤️

@pithyless
Copy link
Copy Markdown
Contributor

@lambduhh First and foremost, take care of yourself and your family. ❤️

I think I can find the time to take this PR off your hands and to run with it.

pithyless added a commit to pithyless/athens that referenced this pull request Mar 9, 2021
Solution was originally proposed by @lambduhh in
athensresearch#665.
See PR discussion for details why posh is considered
slow and a mismatch for Athens UI performance right now.

This version of the solution introduces a new namespace
`athens.posh` as a form of indirection. This way the rest
of the codebase does not need any further changes (aside from
the namespace require) and also allows easy switching between
the two different databases to compare performance and
correctness (via `athens.posh/version`).

The assumption is that later this hybrid duality can be elided
and `athens.posh` refactored into a more robust data fetching layer.
@pithyless pithyless mentioned this pull request Mar 9, 2021
pithyless added a commit to pithyless/athens that referenced this pull request Mar 9, 2021
Solution was originally proposed by @lambduhh in
athensresearch#665.
See PR discussion for details why posh is considered
slow and a mismatch for Athens UI performance right now.

This version of the solution introduces a new namespace
`athens.posh` as a form of indirection. This way the rest
of the codebase does not need any further changes (aside from
the namespace require) and also allows easy switching between
the two different databases to compare performance and
correctness (via `athens.posh/version`).

The assumption is that later this hybrid duality can be elided
and `athens.posh` refactored into a more robust data fetching layer.
@jefftangx jefftangx closed this Apr 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants