Just recently stumbled on SciDocs which is a benchmark of tasks designed specifically for evaluating scientific document embeddings. We should benchmark the leading solutions (including our own) to decide which model to use in this repo.
It might be helpful to write a series of scripts to automate this evaluation so we can quickly evaluate new models in the future.