Data analysis and inference in the paper Quantifying the dynamics of memory B cells and plasmablasts in healthy individuals
arxiv: https://arxiv.org/pdf/2510.02812
- Phad data: it contains all the scripts to download the dataset of https://www.nature.com/articles/s41590-022-01230-1, process the B-cell reperotire data of memory cells and plasmablasts, and infer the different models discussed in the paper.
- Mikelov data: similar type of analysis of Phad data but for https://elifesciences.org/articles/79254.
- func_py: phyton functions used in the analysis.
- func_build: c++ functions that are wrapped in python scripts.
Within the two folders Phad data and Mikelov data there are numbered scripts that can
- Scripts 1.: download the fastq files
- Scripts 2. to 5.: process the fastq files, align the IG sequences
- Scripts 6.: cluster the sequences into clonal families
- Scripts 7.: run the noise inference and plot the marginals
- Scripts 8.: run the Geometric Brownian Motion inference for memory cells and plot the marginals
- Scripts 9.: run the inference of the coupled memory-plasmablast system and plot the marginals
Note that the numbering is consistent in the two folders for the two datasets, but the specific type of analysis can differ
We can split the analysis into two parts:
The repository does not contains intermediate files (beacuse of size reasons), and the scripts have to be run in the precise sequence to obtain the final tables. To run the scripts one needs the following software:
- Cell Ranger (version 7.1.0): for the single-cell data analysis of the Phad dataset.
- Change-O toolkit (https://changeo.readthedocs.io/en/stable/): for the Ig sequence alignment using IgBlast (version 1.22.0) of both the datasets.
- Hilary (https://github.com/statbiophys/hilary) for the clonal family assignment of both datasets. The final tables of clone clusters are uploaded into the repository (the output of the scripts 6). This makes possible to directly run the downstream inference analysis without re-run the previous pipeline.
All the intermediate files generated by those scripts are saved in the repo, therefore each notebook can be run independently. Here we provide a Dockerfile that installs all the dependencies and directy launch the python notebooks with jupyter lab.
To build and run the Dockerfile, clone the repository and make sure to have docker installed (https://www.docker.com/). Then change directory inside the repo and type the following commands:
sudo docker build -t infer-b-cell-nb .
sudo docker run -p 8888:8888 -v $(pwd):/workspace infer-b-cell-nbThe last command execute jupyter, but you still need to link that to your favourite browser. You just need to copy the second of the two urls printed at terminal below the sentence Or copy and paste one of these URLs:, and paste it on your browser.
For each figure we point at the notebook (for both the datasets) that is used to generate it.
- Fig. 2:
7.3_infer_noise_marginals.ipynb - Fig. 3:
8.3_gbm_marginals.ipynb - Fig. 4:
9.2_memplasm_marginals.ipynb
- Figs. S1, S2:
5_seq_info.ipynb - Figs. S3, S4:
7.2_infer_noise_results.ipynb - Figs. S5:
7.3_infer_noise_marginals.ipynb - Figs. S6:
8.2_infer_gbm.ipynb(Phad data) - Figs. S7:
8.2_infer_gbm.ipynb(Mikelov data) - Figs. S8:
8.3_gbm_marginals.ipynb - Figs. S9:
8.3_gbm_marginals.ipynb - Figs. S10:
9.1_infer_memplasm.ipynb(Mikelov data) - Figs. S11:
9.1_infer_memplasm.ipynb(Phad data)
This code can be used for reproducing all the analysis and the figures of the paper. I made some effort in commenting and explaining how to execute it, however it can still be hard to read in some parts or execute it from external users. If you are interested in running it and you're having hard time, please send me an email at andrea.mazzolini.90@gmail.com