Centralisation of resources

Please add links in the corresponding section or a new section if you find something interesting.

These can be papers, blog posts, podcasts, models, videos, courses, books, etc. Demos developed within the HU (ex: markdown files demonstrating a packages) should also be included. If possible, please include a short description/key takeaway(s).

Science

The Palindrome Stories about how mathematics impact science and technology.
Intro to modern statistics
Consensus.app lets you ask a question and the app finds papers that answer yes or no to this question.
Connected papers Finds papers that are similar to each other and presents them in a graph. (these papers can site each other but don't have too)

General Data Science / Machine Learning

The Gradient. Digital magazine about trends in AI/machine learning, founded by Stanford Artificial Intelligence Laboratory. A lot of essay-style articles (a few are linked below).
Discovering novel algorithms with AlphaTensor. Discovery of novel algorithms by AI. Broad possibilities for application in data science/computer science in general. Link to the GitHub repo and article in the article.
Illustrated Machine Learning A website illustrating a variety of ML concepts, a bit "back of the envelop" drawing style.
New machine setup. Tutorial on how to set up a new machine for Data Science. Based on MacOS.
Equality of odds. One technique to prevent bias in ML.
Bruno Rodrigues blog. Nice technical blog, more on the software engineering side. Posts include a 4-parter on Nix for reproducible science.
Telling stories with data. Haven't read it all, but seems like a nice book/reference for data analysis, reproducibility and communication.
ML papers explained. Links to a collection of Medium posts to explain ML papers/concepts, e.g. Transformers and all the LLM zoo, CNNs, vision models, etc.
Four kinds of optimization. A short dive into how to optimize the running time of your programs. Explores trade-offs of 4 solutions: use a better algorithm, a better data structure, a lower-level programming language (e.g. rewrite some Python in Rust) or accept a less precise solution.

Natural Language Processing

Creating knowledge graphs from unstructured text. FAIR cookbook recipe with all steps (and tools) from literature to knowledge graph.
Prodigy Open AI recipes. Recipes to obtain high quality dataset from LLM and small annotation effort.
Understanding LLM: A reading list. Reading list of founding papers for Transformers-based language models.
LLM applications for production. Very interesting post with tips for prompt engineering, evaluation and optimization, model finetuning, and best practices for LLM use in general.
Impact of LLMs on scientific discovery. A Microsoft Research paper investigating the performance of GPT-4 for various scientific tasks (drug discovery, materials design, molecular simulations, etc.). 230 pages...
Tackling hallucinations ion LLMs. A blog post with multiple links to research papers detailing how to deal with (and reduce) LLMs generating factually wrong information.
The Pipe: Python package that does markdown extraction from a variety of formats, including PDFs and Word.

RAG

paperQA. Minimal package to do QA on PDFs.
LlamaIndex. Data framework for LLM applications.
LlamaParse. Connected to previous link - with json parsing capabilities for tables, text (and images?).
Tips & Tricks for RAG. From LlamaIndex, YouTube video for putting RAG in production.
Real time RAG chatbot. Blogpost on building RAG with "real time" updates of the knowledge base.
Production RAG. Presentation on choices to make while building a RAG (also on evaluation, deployment, budget).
Embeddings quantization. Cost and latency reductions thanks to embeddings quantization.
Retrieval Augmented fine tuning. Approach to fine tune LLMs to retrieve relevant documents.
Rerankers: Low-dependency python package to unify interface to most common rerankers models.

Models

SAPBERT. HuggingFace model for biomedical entities representation.
PubMedBERT
REBEL. Relationship extraction model. HuggingFace model, pluggable on spaCy. Ranked 3rd on relationship extraction the Adverse Drug Events dataset.
BioGPT. Pre-trained on biomedical literature, claims to outperform SoT models on most biomedical NLP tasks.
Efficient Transformers. How to make LLM more efficient. Covers knowledge distillation and fine-tuning.
PKPDAI. Suite of models specialized for PK modeling. Document classifier and NER.

Corpora

EuropePMC. Annotated full text corpus for gene/proteins, diseases and organisms. Link is to the BioarXiv paper, with details on how to access and reuse the resource.

Papers/Books

Natural Language Processing with Transformers

Interesting articles

Designing a GPT-3 for science. Key TAs: death to PDF format! Articles are a substrate for information combination.
Lessons from the GPT-4chan controversy. Article about ethics in AI. Interesting TA: possibility to "gate" potentially harmful models so that they are only accessible to researchers.

Talks

NLP Summit A collection of talks about NLP, particularly focused on biomedical/healthcare. Look into tab "Watch past summits".

Graph ML

Bayesian

Statistical Rethinking Course developed by Richard McElreath focused on (bayesian) modeling. Given live once a year, the link contains the most recent (2023) material.

Neural networks

Understanding LSTMs -What are embeddings? A deep (82 pages!) dive into embeddings and what they are on a conceptual, mathematical and engineering level.

Software/Packages/Tools

Technical tutorials

Installing PyTorch Geometric on Mac M1 with Accelerated GPU Support. Building an environment for graph machine learning on Macbook with M1.
Setting Python Development Environment with VScode and Docker.

Productivity tools

Foam. Knowledge management and sharing tool.
Explainpaper. Upload a paper, highlight confusing terms and get an explanation returned (provided by with GPT-3 model).

Python packages

Polars. New library claimed to be much faster than Pandas for dataframes in Python.
FAISS A library for efficient similarity search between vectors. Written in C++ but Python wrappers.
SafeTensors. New (as of June 2023) default in HuggingFace to save/load models, as it has improved security (vs Pickle).
Guardrails AI. Python package to verify structure and quality of LLM outputs. Can be useful to check for bias or type errors, but not a resource to verify facts.

R packages

RPolars. The above library now also for R. Again, claims to be much faster than all tidverse tool, but I find the grammar of it quite repulsive TBH.
WebR. R package to run R code directly in browser from a website.
Rang R package to help reproducibility of old code in R.
messydates. A package to make date formats tidy.

Miscellaneous

For interesting information that is not necessarily DS-related. Mostly dataviz TBH.

Bioicons A free library of science icons and logos (SVG), that goes beyond your standard PPT icons.
Designing color keys. Blogpost explaining how to create easy-to-read color keys (=legends) for your charts).
Friends don't let friends. An opinionated guide on common mistakes in the use of (scientific) graphs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralisation of resources

Science

General Data Science / Machine Learning

Natural Language Processing

RAG

Models

Corpora

Papers/Books

Interesting articles

Talks

Graph ML

Bayesian

Neural networks

Software/Packages/Tools

Technical tutorials

Productivity tools

Python packages

R packages

Miscellaneous

FilesExpand file tree

knowledgeLinks.md

Latest commit

History

knowledgeLinks.md

File metadata and controls

Centralisation of resources

Science

General Data Science / Machine Learning

Natural Language Processing

RAG

Models

Corpora

Papers/Books

Interesting articles

Talks

Graph ML

Bayesian

Neural networks

Software/Packages/Tools

Technical tutorials

Productivity tools

Python packages

R packages

Miscellaneous