Please add links in the corresponding section or a new section if you find something interesting.
These can be papers, blog posts, podcasts, models, videos, courses, books, etc. Demos developed within the HU (ex: markdown files demonstrating a packages) should also be included. If possible, please include a short description/key takeaway(s).
- The Palindrome Stories about how mathematics impact science and technology.
- Intro to modern statistics
- Consensus.app lets you ask a question and the app finds papers that answer yes or no to this question.
- Connected papers Finds papers that are similar to each other and presents them in a graph. (these papers can site each other but don't have too)
- The Gradient. Digital magazine about trends in AI/machine learning, founded by Stanford Artificial Intelligence Laboratory. A lot of essay-style articles (a few are linked below).
- Discovering novel algorithms with AlphaTensor. Discovery of novel algorithms by AI. Broad possibilities for application in data science/computer science in general. Link to the GitHub repo and article in the article.
- Illustrated Machine Learning A website illustrating a variety of ML concepts, a bit "back of the envelop" drawing style.
- New machine setup. Tutorial on how to set up a new machine for Data Science. Based on MacOS.
- Equality of odds. One technique to prevent bias in ML.
- Bruno Rodrigues blog. Nice technical blog, more on the software engineering side. Posts include a 4-parter on Nix for reproducible science.
- Telling stories with data. Haven't read it all, but seems like a nice book/reference for data analysis, reproducibility and communication.
- ML papers explained. Links to a collection of Medium posts to explain ML papers/concepts, e.g. Transformers and all the LLM zoo, CNNs, vision models, etc.
- Four kinds of optimization. A short dive into how to optimize the running time of your programs. Explores trade-offs of 4 solutions: use a better algorithm, a better data structure, a lower-level programming language (e.g. rewrite some Python in Rust) or accept a less precise solution.
- Creating knowledge graphs from unstructured text. FAIR cookbook recipe with all steps (and tools) from literature to knowledge graph.
- Prodigy Open AI recipes. Recipes to obtain high quality dataset from LLM and small annotation effort.
- Understanding LLM: A reading list. Reading list of founding papers for Transformers-based language models.
- LLM applications for production. Very interesting post with tips for prompt engineering, evaluation and optimization, model finetuning, and best practices for LLM use in general.
- Impact of LLMs on scientific discovery. A Microsoft Research paper investigating the performance of GPT-4 for various scientific tasks (drug discovery, materials design, molecular simulations, etc.). 230 pages...
- Tackling hallucinations ion LLMs. A blog post with multiple links to research papers detailing how to deal with (and reduce) LLMs generating factually wrong information.
- The Pipe: Python package that does markdown extraction from a variety of formats, including PDFs and Word.
- paperQA. Minimal package to do QA on PDFs.
- LlamaIndex. Data framework for LLM applications.
- LlamaParse. Connected to previous link - with json parsing capabilities for tables, text (and images?).
- Tips & Tricks for RAG. From LlamaIndex, YouTube video for putting RAG in production.
- Real time RAG chatbot. Blogpost on building RAG with "real time" updates of the knowledge base.
- Production RAG. Presentation on choices to make while building a RAG (also on evaluation, deployment, budget).
- Embeddings quantization. Cost and latency reductions thanks to embeddings quantization.
- Retrieval Augmented fine tuning. Approach to fine tune LLMs to retrieve relevant documents.
- Rerankers: Low-dependency python package to unify interface to most common rerankers models.
- SAPBERT. HuggingFace model for biomedical entities representation.
- PubMedBERT
- REBEL. Relationship extraction model. HuggingFace model, pluggable on spaCy. Ranked 3rd on relationship extraction the Adverse Drug Events dataset.
- BioGPT. Pre-trained on biomedical literature, claims to outperform SoT models on most biomedical NLP tasks.
- Efficient Transformers. How to make LLM more efficient. Covers knowledge distillation and fine-tuning.
- PKPDAI. Suite of models specialized for PK modeling. Document classifier and NER.
- EuropePMC. Annotated full text corpus for gene/proteins, diseases and organisms. Link is to the BioarXiv paper, with details on how to access and reuse the resource.
- Designing a GPT-3 for science. Key TAs: death to PDF format! Articles are a substrate for information combination.
- Lessons from the GPT-4chan controversy. Article about ethics in AI. Interesting TA: possibility to "gate" potentially harmful models so that they are only accessible to researchers.
- NLP Summit A collection of talks about NLP, particularly focused on biomedical/healthcare. Look into tab "Watch past summits".
- Statistical Rethinking Course developed by Richard McElreath focused on (bayesian) modeling. Given live once a year, the link contains the most recent (2023) material.
- Understanding LSTMs -What are embeddings? A deep (82 pages!) dive into embeddings and what they are on a conceptual, mathematical and engineering level.
- Installing PyTorch Geometric on Mac M1 with Accelerated GPU Support. Building an environment for graph machine learning on Macbook with M1.
- Setting Python Development Environment with VScode and Docker.
- Foam. Knowledge management and sharing tool.
- Explainpaper. Upload a paper, highlight confusing terms and get an explanation returned (provided by with GPT-3 model).
- Polars. New library claimed to be much faster than Pandas for dataframes in Python.
- FAISS A library for efficient similarity search between vectors. Written in C++ but Python wrappers.
- SafeTensors. New (as of June 2023) default in HuggingFace to save/load models, as it has improved security (vs Pickle).
- Guardrails AI. Python package to verify structure and quality of LLM outputs. Can be useful to check for bias or type errors, but not a resource to verify facts.
- RPolars. The above library now also for R. Again, claims to be much faster than all tidverse tool, but I find the grammar of it quite repulsive TBH.
- WebR. R package to run R code directly in browser from a website.
- Rang R package to help reproducibility of old code in R.
- messydates. A package to make date formats tidy.
For interesting information that is not necessarily DS-related. Mostly dataviz TBH.
- Bioicons A free library of science icons and logos (SVG), that goes beyond your standard PPT icons.
- Designing color keys. Blogpost explaining how to create easy-to-read color keys (=legends) for your charts).
- Friends don't let friends. An opinionated guide on common mistakes in the use of (scientific) graphs.