Skip to content

Latest commit

 

History

History
116 lines (81 loc) · 10.4 KB

File metadata and controls

116 lines (81 loc) · 10.4 KB

Centralisation of resources

Please add links in the corresponding section or a new section if you find something interesting.

These can be papers, blog posts, podcasts, models, videos, courses, books, etc. Demos developed within the HU (ex: markdown files demonstrating a packages) should also be included. If possible, please include a short description/key takeaway(s).

Science

  • The Palindrome Stories about how mathematics impact science and technology.
  • Intro to modern statistics
  • Consensus.app lets you ask a question and the app finds papers that answer yes or no to this question.
  • Connected papers Finds papers that are similar to each other and presents them in a graph. (these papers can site each other but don't have too)

General Data Science / Machine Learning

  • The Gradient. Digital magazine about trends in AI/machine learning, founded by Stanford Artificial Intelligence Laboratory. A lot of essay-style articles (a few are linked below).
  • Discovering novel algorithms with AlphaTensor. Discovery of novel algorithms by AI. Broad possibilities for application in data science/computer science in general. Link to the GitHub repo and article in the article.
  • Illustrated Machine Learning A website illustrating a variety of ML concepts, a bit "back of the envelop" drawing style.
  • New machine setup. Tutorial on how to set up a new machine for Data Science. Based on MacOS.
  • Equality of odds. One technique to prevent bias in ML.
  • Bruno Rodrigues blog. Nice technical blog, more on the software engineering side. Posts include a 4-parter on Nix for reproducible science.
  • Telling stories with data. Haven't read it all, but seems like a nice book/reference for data analysis, reproducibility and communication.
  • ML papers explained. Links to a collection of Medium posts to explain ML papers/concepts, e.g. Transformers and all the LLM zoo, CNNs, vision models, etc.
  • Four kinds of optimization. A short dive into how to optimize the running time of your programs. Explores trade-offs of 4 solutions: use a better algorithm, a better data structure, a lower-level programming language (e.g. rewrite some Python in Rust) or accept a less precise solution.

Natural Language Processing

RAG

  • paperQA. Minimal package to do QA on PDFs.
  • LlamaIndex. Data framework for LLM applications.
  • LlamaParse. Connected to previous link - with json parsing capabilities for tables, text (and images?).
  • Tips & Tricks for RAG. From LlamaIndex, YouTube video for putting RAG in production.
  • Real time RAG chatbot. Blogpost on building RAG with "real time" updates of the knowledge base.
  • Production RAG. Presentation on choices to make while building a RAG (also on evaluation, deployment, budget).
  • Embeddings quantization. Cost and latency reductions thanks to embeddings quantization.
  • Retrieval Augmented fine tuning. Approach to fine tune LLMs to retrieve relevant documents.
  • Rerankers: Low-dependency python package to unify interface to most common rerankers models.

Models

  • SAPBERT. HuggingFace model for biomedical entities representation.
  • PubMedBERT
  • REBEL. Relationship extraction model. HuggingFace model, pluggable on spaCy. Ranked 3rd on relationship extraction the Adverse Drug Events dataset.
  • BioGPT. Pre-trained on biomedical literature, claims to outperform SoT models on most biomedical NLP tasks.
  • Efficient Transformers. How to make LLM more efficient. Covers knowledge distillation and fine-tuning.
  • PKPDAI. Suite of models specialized for PK modeling. Document classifier and NER.

Corpora

  • EuropePMC. Annotated full text corpus for gene/proteins, diseases and organisms. Link is to the BioarXiv paper, with details on how to access and reuse the resource.

Papers/Books

Interesting articles

Talks

  • NLP Summit A collection of talks about NLP, particularly focused on biomedical/healthcare. Look into tab "Watch past summits".

Graph ML

Bayesian

  • Statistical Rethinking Course developed by Richard McElreath focused on (bayesian) modeling. Given live once a year, the link contains the most recent (2023) material.

Neural networks

Software/Packages/Tools

Technical tutorials

Productivity tools

  • Foam. Knowledge management and sharing tool.
  • Explainpaper. Upload a paper, highlight confusing terms and get an explanation returned (provided by with GPT-3 model).

Python packages

  • Polars. New library claimed to be much faster than Pandas for dataframes in Python.
  • FAISS A library for efficient similarity search between vectors. Written in C++ but Python wrappers.
  • SafeTensors. New (as of June 2023) default in HuggingFace to save/load models, as it has improved security (vs Pickle).
  • Guardrails AI. Python package to verify structure and quality of LLM outputs. Can be useful to check for bias or type errors, but not a resource to verify facts.

R packages

  • RPolars. The above library now also for R. Again, claims to be much faster than all tidverse tool, but I find the grammar of it quite repulsive TBH.
  • WebR. R package to run R code directly in browser from a website.
  • Rang R package to help reproducibility of old code in R.
  • messydates. A package to make date formats tidy.

Miscellaneous

For interesting information that is not necessarily DS-related. Mostly dataviz TBH.

  • Bioicons A free library of science icons and logos (SVG), that goes beyond your standard PPT icons.
  • Designing color keys. Blogpost explaining how to create easy-to-read color keys (=legends) for your charts).
  • Friends don't let friends. An opinionated guide on common mistakes in the use of (scientific) graphs.