fix(spike): Replace posh with reactive DS by lambduhh · Pull Request #665 · athensresearch/athens

lambduhh · 2021-02-18T20:48:34Z

Proposed changes to eliminate posh from the code base since it has known performance issues. This replaces with reactive datascript. athens.rxdb is meant to be a drop in replacement for posh with the exception that db/dsdb doesn't need provided as argument.

There seems to be a lot of regression errors that result from posh functions returning different spec than datascript, these can be fixed if it is determined that the performance upgrade is worth it. Seemed like a good point to stop and check before going any further.

@tangjeff0 Please pull down branch and run performance tests or tell me a method of reproducing a performance test so I can try it out myself.

lambduhh · 2021-02-18T20:49:54Z



-(d/listen! dsdb :history
+#_(d/listen! dsdb :history


causing app to crash, probably due to difference in spec between posh/ds. commented out as short fix just to get app running to test

lambduhh · 2021-02-18T20:50:46Z

-               (dissoc :merge-prompt))
-       :timeout {:action :clear
-                 :id :merge-prompt}})))
+    (try


same issue as above, short fix just to test

lambduhh · 2021-02-18T20:51:36Z


 ;; TODO: move all these DOM utilities to a .cljs file instead of cljc
-(defn scroll-top! [element pos]
+(defn scroll-top!


cljsstyle did this?

lambduhh · 2021-02-18T20:52:40Z



-(d/listen! dsdb :devtool/open listener)
+#_(d/listen! dsdb :devtool/open listener)


short fix, regressions

lambduhh · 2021-02-18T20:53:13Z

 (ns athens.views.graph-page
  (:require
-    ["react-force-graph" :as rfg]
+    #_["react-force-graph" :as rfg]


couldn't get this working locally, short fix

lambduhh · 2021-02-18T20:55:15Z

I've marked the somewhat irrelevant changes that I made to fix regressions in the short term (just to get working so we can test), may even need help putting everything back together but it didn't make sense to put in any extra time before bringing to the team for review and thoughts. Let me know what you think!

jefftangx · 2021-02-19T00:28:59Z

It doesn't look like dbrx.cljs is part of this PR @lambduhh

jefftangx

https://www.loom.com/share/189603c0e90947359700061563960a60

Main points from the loom

The initial prototype could've been 95% fewer LoC changed
Could've thought through places that might be good to test performance before starting this out! (or asking)

Next steps: start a new branch with only 1 place where a posh query is replaced with the make-reaction. A/B test the performance with a transaction that should lead to a reaction.

lambduhh · 2021-02-19T22:00:21Z

Thanks for the code review! I appreciate your analysis of thinking like a scientist. Inspires me to tell the story of this PR and structure my response in the same way?

Observation
Athens is having performance issues related to large amounts of data, this has been observed and communicated by users in the discord and by Jeff when uploading his Roam data (which is rather large)

Question
How can we improve (and how do we measure) the performance of Athens when large data is present?

Additional Information

The current implementation of athens uses posh as a reactive database for datascript. Posh is no longer actively maintained and has been abandoned by its owner
When posh was created, perhaps it’s speed was comparable to DS but currently is 11 releases behind the latest DS version, which has vastly improved in these updates.
Reviewing the posh codebase it is observed that posh does a lot in addition to just serving as a reactive db for ds, most noticeably involving caching.
This led me to inquire on the discord about the motivations for using posh and whether it was due to posh’s caching abilities or simply as a reactive db.
Response was that it was reactive db > cacheing

Hypothesis
H0- Replacing posh with datascript’s native reactive db capabilities will have no effect on the performance of athens when a large amount of data is present
H1- Replacing posh with ds native reactive db will improve Athen’s performance by decreasing loading times when a large amount of data is present.

Testing

Trivial
Load various sizes of data into DS and Posh connections and measure the performance of queries, pulls and transactions on

Initial actions
Subsequent actions

Could use the time or simple-benchmark macros to measure individual runtimes of functions when running athens with small amounts of data vs large amounts of data. (Note: We should get some sort a standard file that represents the working definition of “large amount of data” that can be used as a benchmark for upgrades)

The value proposition of Posh is that it performs a series of calculations to attempt to determine if a query/pull that has been executed previously and needs to be recalculated due to new transactions or, if a cached version of the query can be returned instead. With no intermediate transactions, all supported Posh pulls and queries will be memoized.

The price paid for this is longer Posh transaction times as a substantial amount of work is done to index the datoms.
Additionally, even a modest amount of data churn will cause a substantial amount of new meta-calculations to determine if a cached response can be returned OR if a new query results need to be calculated.

Therefore, Posh is most suitable for long sessions of analytical workloads where

repeated queries are expected and
data churn is minimal.

There is an inflection point of expected data complexity and churn beyond which the speed of Datascript (which is currently the fastest known in-memory datalog processing engine (even faster that Datomic-mem), outweighs the possible values from cached posh queries.

Determining if an application is above that threshold cannot be determined with trivial testing, it must be tested based on the behavior of the actual application.

Realistic

Comparing Posh v DS in a realistic scenario involves comparing the behavior of the application with a large volume of data under normal and abnormal operating conditions (using methods listed above) Since Athens is a heavy user input application, there is a significant amount of data churn. Therefore, there is a strong reason to suspect that reactive datascript (since it skips the extra indexing and cache check calculations) would be more performant than posh.

Why is it not possible to replace only a subset of the functionality with Reactive Datascript? (as suggested in the PR as an minimum viable prototype)

It’s not possible to replace a subset of Athens using reactive datascript without introducing major architectural changes to abstract out the data layer.
Many places throughout the application call directly in Posh using Posh functions (pull, pull-many, q, transact! etc ) and those places expect a Posh tree, which is fundamentally different than a Datascript connection.
See posh! function https://github.com/mpdairy/posh/blob/2347c8505f795ab252dbab2fcdf27eca65a75b58/src/posh/plugin_base.cljc

(defn posh! [dcfg & conns]
  (let [posh-atom (atom {})]
    (reset! posh-atom
            (loop [n 0
                   conns conns
                   posh-tree (-> (p/empty-tree dcfg [:results])
                                 (assoc :ratoms {}
                                        :reactions {}))]
              (if (empty? conns)
                posh-tree
                (recur (inc n)
                       (rest conns)
                       (let [db-id (keyword (str "conn" n))]
                         (p/add-db posh-tree
                                   db-id
                                   (set-conn-listener! dcfg posh-atom (first conns) db-id)
                                   (:schema @(first conns))))))))))

Note that posh! replaces the existing connection with a p/empty-tree
This p/empty-tree is in turn expected by make-query-reaction, which is called on p/pull, p/pull-many, and p/q.
Additionally, p/transact is expecting a p/empty-tree

A reactive datascript conn, which is simply a Reagent atom, is there-fore fundamentally incompatible with Posh functions. The call to p/posh! would destroy the reactive Datascript conn, and the reactive Datascript conn cannot be used by p/pull, p/pull-many, p/q, or p/transact.

Therefore, the only way to test the change in realistic circumstances is to isolate all the Posh functions with equivalent reactive Datascript calls, as the p/empty-tree and the reactive Datascript conn are mutually exclusive. The current implementation is the least invasive method I could get to run.

jefftangx · 2021-02-22T04:31:00Z

Really great write-up! I'm stuck on two assumptions:

Since Athens is a heavy user input application, there is a significant amount of data churn.

Does it have to be true that heavy user input ~> data churn? What if the user input was all on the same page? In that case, it seems that a lot of the caching would be useful.

Therefore, the only way to test the change in realistic circumstances is to isolate all the Posh functions with equivalent reactive Datascript calls, as the p/empty-tree and the reactive Datascript conn are mutually exclusive.

My understand is that the way Reagent/React works is that only the queries on the Virtual DOM are relevant. Changes to data ~> re-compute necessary available subscriptions/reactions ~> necessary UI changes.

For instance, all-tables has one query:

(d/q '[:find [?e ...]
                                  :where
                                  [?e :node/title ?t]]
                                @db/dsdb)

And node-page or block-page likewise only have 2 queries at the root level:

get-node-document/get-block-document
get-linked-references

If I'm understanding this correctly, you're saying we wouldn't be able test the performance of replacing one of these posh queries with a different query?

Regarding "large data", there are probably two main contexts. One would be if the current page had a lot of data. One would be if the db was large, but the current page was negligibly small.

For the former, we could use Jake's note, which he volunteered in Discord: https://discord.com/channels/708122962422792194/709246147549593601/809965655302340621. I copied and pasted this into this branch, but it seemed to crash Athens. I'm guessing it's a parser issue.

For the latter, we could use upload a large public Roam, or just have large enough db. My current index.transit is 4.8Mb, but I probably shouldn't share it here. This is probably the use case that is more interesting. I think it's pretty strange that my db is noticably slower even on a small page. The performance of a single page shouldn't be influenced by the size of the db, which is in the background. In order to test this out, we may want to download and import the edn file of a large public Roam DB such as https://roamresearch.com/#/app/roam-book-club-2. This is like 30+Mb so it's quite a bit larger. You could do this by git checking out the branch #561, and using the upload feature there.

I think the biggest missing piece is how to actually benchmark performance. Would love to hear your thoughts on all the above!

lambduhh · 2021-02-23T01:22:52Z

user-input → data churn

I agree completely with the point that having a cache is useful. I
propose that we can have better cache behavior without posh
and maintain all the benefits.

A few points to consider on this:

posh cache behavior is non-deterministic and expensive. In posh,
it is not obvious what the rules are which determine when a query
will result in a cache hit vs. recomputation.
re-frame and reagent offer fine-grained, deterministic cache
control mechanisms with re-frame subscriptions and
reagent.ratom/make-reaction, respectively.

The rules for these mechanisms are very clear -- whenever the inputs
change, the calculation is recomputed.

There is also a paradigm for tiered levels of
subscriptions/reactions for even finer grained control over this
behavior.

See re-frame
subscriptions for
more detail on this model. (Note: we can also accomplish this
behavior with re-frame.ratom/make-reaction, but this page captures
the philosophy well.)
Even without the cache, Datascript is so performant that it will
almost never be necessary to use a cache unless we are updating
hundreds of elements per second.
We can assume that posh will not be getting any further updates,
while reagent, re-frame, and datascript continue to receive
community support and updates.

When do components update?

For an authoritative answer on this question, here is the official
documentation
on when components update.

The important thing to note is that posh introduces a second layer
of cache on top of reagent/re-frame. posh first checks internally to
see if it thinks a query/pull needs to be recomputed based on
intermediate transactions. If so, posh will perform the new
calculation and pass the result to the reagent component /
subscription. Then the reagent~/~re-frame component/subscription will
perform the same check -- if the input is different than the previous
input, it will recompute the output.

Why can't we test one of the queries in isolation?

We can test non-reactive queries in isolation, but we can't test
reactive queries in isolation without introducing a data access
layer of abstraction. This is because the reactive
datascript reagent/atom and the posh tree are fundamentally
incompatible.

Data access layer of abstraction

While not the point of this PR, it's not a bad practice do adopt a
3-layer, model of data access abstraction as it will make it easier in
the future if we need to change the underlying storage (i.e., graphQL,
asami, datahike) again.

Example table
component:

;; before
(defn table
  []
  (let [pages (r/atom (->> (d/q '[:find [?e ...]
                                  :where
                                  [?e :node/title ?t]]
                                @db/dsdb)
                           (p/pull-many db/dsdb '["*" :block/_refs {:block/children [:block/string] :limit 5}])
                           deref
                           (sort-by (fn [x] (count (:block/_refs x))))
                           reverse))]
    (fn [] ...)))

;; after
(defn table
  []
  (let [pages (athens.data.rx/top-n-pages 5)]
    (fn [] ...)))

Specific test

Also note that in this case, it is not possible to test reactive
datascript in isolation. The reactive behavior of the page comes from

(p/pull-many db/dsdb '["*" :block/_refs {:block/children [:block/string] :limit 5}])
deref

not

(d/q '[:find [?e ...]
      :where
      [?e :node/title ?t]]
      @db/dsdb)

The p/pull-many is the reactive part. That's the part that we would
need to change to reactive Datascript. The d/q is already plain
datascript. It would be more difficult to figure out how to replace
this one query than to simply replace all the queries with reactive
datascript.

Testing proposals

I like the idea of testing using a standard DB. Ultimately I think that
the user experience, not benchmarks, is what will be important, and your
observation that small page performance should not suffer when there is
a large database is key.

An important note here is that using reagent~/~re-frame deterministic
cache control, we will be able to isolate the behavior of the page from
the DB.

This comes back to the undesirable two-layer cache with posh -- posh
will update indexes regardless of whether a component is mounted or
not. reagent~/~re-frame will not check subscriptions for unmounted
components.

Therefore if we are encountering behavioral bottlenecks, ("clicking X
causes a page slowdown") we can implement data-tiering on downstream
subscriptions/reactions to isolate the rerender.

The important thing, I think, is not the load time, but the user
experience. It doesn't matter if the loading is fast if it causes page
jank -- and it doesn't matter if the loading is slow is the user doesn't
notice and still has a good experience.

So I think we should go with your suggestion of testing using a DB that
is representative of the interesting cases of user behavior.

Then we can open a parallel reactive Datascript branch and work to bring
it up to parity with the main branch, implementing data-tiering to
isolate and slow behavior.

Once we are convinced there are no regressions and the user experience
is enhanced, we can sunset posh and make the switch.

jefftangx · 2021-02-23T07:13:12Z

This is really interesting @lambduhh. Thank you for your deep dive on all this!!

For tiered subscriptions, are you suggesting tiers of datascript queries? That's interesting.

Even without the cache, Datascript is so performant that it will
almost never be necessary to use a cache unless we are updating
hundreds of elements per second.

I agree here, I'm not 100% certain we need a cache yet.

It would be more difficult to figure out how to replace
this one query than to simply replace all the queries with reactive
datascript.

I don't understand this one. Why wouldn't it be able to swap just this out? Either way, I don't think all-pages is actually a good query to test. You make the point later on that UX matters more than benchmarks. In this case, UX around writing is more important than all-pages.

Ultimately I think that
the user experience, not benchmarks, is what will be important

Great point. 100% agree

This comes back to the undesirable two-layer cache with posh -- posh
will update indexes regardless of whether a component is mounted or
not. reagent~/~re-frame will not check subscriptions for unmounted
components.

This implementation would definitely explain why posh is slow on small pages but big db. I don't totally understand how posh's caching system works, so I will take your word for it right now. If it is true, I would definitely be off by a metric boatload saying this could be done in 95% less LoC 😅

That being said, I wasn't able to test this on my personal DB without it crashing (https://www.loom.com/share/98dcfc61c84344dea3b22ccc1dafa232). I don't know the best way to get you a large db, other than downloading a large public one and then importing it off of this branch #561.

I don't totally understand how make-reaction works either. I do remember @jeroenvandijk tried to improve block performance a long while ago with cursor, which I see uses the same thing under the hood, though I think that had more to do with react than datascript. Wonder if Jeroen has any points there or on any other points on this conversation!

pithyless · 2021-03-03T09:32:00Z

I did a little snooping and found there is a significant constant overhead with athens.effects/walk-transact, which may be interesting to explore. Eliding several rabbit holes and dead ends I journeyed, here's a quick repro:

given the initial athens db (with the Welcome page)
opening and closing nested blocks on the Welcome page (what is demoed on the animation attached to "You can open and close blocks that have children.")
timing (transact! db/dsdb final-tx-data) inside athens.effects/walk-transact takes ~30ms (on my machine)
The transaction, though, is just doing this:

[[:db/add [:block/uid "6aecd4172"] :block/open true]]

Adding a bit of debug code to the function:

(let [more-tx-data  (parse-for-links with-tx)
      final-tx-data (vec (concat tx-data more-tx-data))]
   (pprint final-tx-data)

   (let [fake-db (-> (d/datoms @db/dsdb :eavt)
                     (d/init-db  db/schema)
                     (d/conn-from-db))]
     (prn "first datascript transact!")
     (time (d/transact! fake-db final-tx-data))

     (prn "second datascript transact!")
     (time (d/transact! fake-db final-tx-data)))

   (prn "first posh transact!")
   (time (transact! db/dsdb final-tx-data))

   (prn "second posh transact!")
   (time (transact! db/dsdb final-tx-data))

   (let [outputs (:tx-data (transact! db/dsdb final-tx-data))]
     (ph-link-created! outputs)))

We see outputs like this:

[[:db/add [:block/uid "7e409b1cb"] :block/open false]]

core.cljs:198 "first datascript transact!"
core.cljs:198 "Elapsed time: 0.220000 msecs"

core.cljs:198 "second datascript transact!"
core.cljs:198 "Elapsed time: 0.115000 msecs"

core.cljs:198 "first posh transact!"
core.cljs:198 "Elapsed time: 42.965000 msecs"

core.cljs:198 "second posh transact!"
core.cljs:198 "Elapsed time: 3.050000 msecs"

Looks to me like something is adding quite a lot of overhead for a seemingly simple update (the second runs measure noop overhead). This is time spent toggling a flag, before we even start parsing and rendering the blocks; seems to be a potentially janky experience even without lots of data.

If you've got a bigger database lying around, I wonder if the overhead is constant/linear/exponential with more data (irrespective of how small is the transaction change). And perhaps the problem is not with posh itself, but with a subscription? No idea, have not explored this further. But maybe this will be useful anecdote as a jumping off point for further digging. :)

lambduhh · 2021-03-03T16:55:17Z

From @pithyless on discord pasting so all info is in one place:
"side note: concat always raises an eyebrow; this (vec (concat tx-data more-tx-data)) should probably be (into tx-data more-tx-data) or if tx-data is not guaranteed to be a vector, perhaps (reduce into [] [tx-data more-tx-data])

PSS. I forgot to mention it in the PR, but to pushing for a "Data access layer of abstraction"; this level of indirection would really help with identifying performance bottlenecks, testing alternative strategies, but also documenting where the app is changing state."

totally agree with both points 💯

pithyless · 2021-03-03T21:55:40Z

I dug a little further into why things are taking so long.

There exists a function posh.lib.pull-analyze/pull-analyze, which calls out to posh.lib.datom-matcher/reduce-pattern; which in turn is a recursive function with a helper (posh.lib.datom-matcher/combine-entids - that is a very eager recursive function that builds up a seq-of-seq-of-seqs...).

This is how often a single show/hide block action will trigger the recursion on the Welcome page:

And this is the same action, but on a much larger markdown file:

Essentially, posh is working very hard to make sure we only query the bare-minimum; so in this case literally enumerating certain attributes for any potentially related blocks on the entire page. Which totally defeats the purpose, because it would have been faster to even naively read in the entire db, render the blocks, and rely solely on React vdom diffing.

This kind of detailed caching strategy is appropriate when querying the database is prohibitively expensive, but that's not the case for Athens. It's also in stark contrast to the approach e.g. Fulcro uses, where it only tracks which ids may have changed and then queries the DB directly.

All the recursions, conses, seqs, mapcats, and concats made me also wonder about how much GC pressure is being generated by this kind of operation. And sure enough, it's clearly visible (screenshot includes some debugging vars and prn; feel free to ignore those):

Turns out calculating which attributes may have changed in this specific example was 8x more expensive than the actual datascript query to pull all that data back out.

This is definitely something that could be potentially refactored and improved upstream in posh, but irrespective I agree that the kinds of operations and updates Athens is interested in do not necessarily benefit from posh caching strategies and perhaps it would be better to avoid them altogether.

jefftangx · 2021-03-08T16:23:31Z

Is this still being worked on @lambduhh? If no, I'm sure someone would be happy to build off of it. My DB is getting really slow 😿

lambduhh · 2021-03-08T17:06:23Z

Yeah since time is a concern may be quicker to let somebody else take the reins (I've been preoccupied w death in my family). Perhaps @pithyless ?

jefftangx · 2021-03-08T17:23:50Z

Sorry to hear that. Let us know if you need anything ❤️

pithyless · 2021-03-09T09:08:32Z

@lambduhh First and foremost, take care of yourself and your family. ❤️

I think I can find the time to take this PR off your hands and to run with it.

@lambduhh

Solution was originally proposed by @lambduhh in athensresearch#665. See PR discussion for details why posh is considered slow and a mismatch for Athens UI performance right now. This version of the solution introduces a new namespace `athens.posh` as a form of indirection. This way the rest of the codebase does not need any further changes (aside from the namespace require) and also allows easy switching between the two different databases to compare performance and correctness (via `athens.posh/version`). The assumption is that later this hybrid duality can be elided and `athens.posh` refactored into a more robust data fetching layer.

@lambduhh

Solution was originally proposed by @lambduhh in athensresearch#665. See PR discussion for details why posh is considered slow and a mismatch for Athens UI performance right now. This version of the solution introduces a new namespace `athens.posh` as a form of indirection. This way the rest of the codebase does not need any further changes (aside from the namespace require) and also allows easy switching between the two different databases to compare performance and correctness (via `athens.posh/version`). The assumption is that later this hybrid duality can be elided and `athens.posh` refactored into a more robust data fetching layer.

fix(spike): Replace posh with reactive DS

66ea1c6

lambduhh self-assigned this Feb 18, 2021

lambduhh commented Feb 18, 2021

View reviewed changes

refactor(lint): Run cljsstyle again

7f651e4

fix(dbrx): Add the most important file

9e26477

jefftangx requested changes Feb 19, 2021

View reviewed changes

jefftangx added the datascript label Feb 22, 2021

jefftangx mentioned this pull request Mar 6, 2021

feat(import): Roam import #561

Closed

pithyless mentioned this pull request Mar 9, 2021

Replace posh with ratom #784

Closed

jefftangx closed this Apr 16, 2021

jefftangx mentioned this pull request May 20, 2021

Improve Performance for large DBs #570

Open

juniusfree mentioned this pull request May 24, 2021

refactor: remove posh, use datascript #1167

Closed

5 tasks



		(d/listen! dsdb :devtool/open listener)
		#_(d/listen! dsdb :devtool/open listener)

Uh oh!

Conversation

lambduhh commented Feb 18, 2021

Uh oh!

lambduhh Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

lambduhh Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

lambduhh Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

lambduhh Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

lambduhh Feb 18, 2021

Choose a reason for hiding this comment

Uh oh!

lambduhh commented Feb 18, 2021

Uh oh!

jefftangx commented Feb 19, 2021

Uh oh!

jefftangx left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lambduhh commented Feb 19, 2021

Testing

Realistic

Uh oh!

jefftangx commented Feb 22, 2021

Uh oh!

lambduhh commented Feb 23, 2021

user-input → data churn

When do components update?

Why can't we test one of the queries in isolation?

Data access layer of abstraction

Specific test

Testing proposals

Uh oh!

jefftangx commented Feb 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pithyless commented Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lambduhh commented Mar 3, 2021

Uh oh!

pithyless commented Mar 3, 2021

Uh oh!

jefftangx commented Mar 8, 2021

Uh oh!

lambduhh commented Mar 8, 2021

Uh oh!

jefftangx commented Mar 8, 2021

Uh oh!

pithyless commented Mar 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jefftangx left a comment •

edited

Loading

jefftangx commented Feb 23, 2021 •

edited

Loading

pithyless commented Mar 3, 2021 •

edited

Loading