Skip to content

FM framework for guided de-novo drug design implemented with PytorchLightning

License

Notifications You must be signed in to change notification settings

ghaith-mq/MSFlow

Repository files navigation

MSFlow: De novo molecular structure elucidation from mass spectra via flow matching

This is the codebase for our preprint: MSFlow: De novo molecular structure elucidation from mass spectra via flow matching. For running the repo please follow the instructions:

Environment installiation:

  • Install conda/miniconda if needed
  • Use flow.yml to create the necessary conda environment for using this codebase:
    conda env create -f flow.yml
    conda activate flow

Data download/preprocessing

  • To download data used for training MSFlow, please follow the same steps for download/preprocessing of data as illustrared in the repository DiffMS. You need to clone DiffMS repository into the ms_scripts directory for obtaining identical train/validation and also test sets.
  • Then, you can derive CDDD representations for all datasets as illustrated in the repository CDDDs

Encoder training:

  • You can use CANOPUS and MassSpyGym training and validation data for training MS-CDDD encoder.
  • You can check the original repository for retraining MIST using the provided script train_mist.py but with CDDD representations.

Decoder training:

  • After downloading the necessary training data, you can use convert_smiles_to_safe.py script for pre-processing decoder training and validation datasets and converting smiles into SAFE representation.
  • For training the flow decoder, you can run cfg_pretrain.py. You will need to set the paths in config.py to match the data directory.

Inference with model weights

We provide weights for our encoder-decoder pipeline for running inference here.

  • For MS-to-CDDD inference: Inside the directory ms_scripts you need to clone DiffMS repo, install the environment and the repo as a package and download the benchmarks following the authors instructions listed until preprocessing/downloading NPLIB1 and MSG benchmarks. We advice to create a seperate conda environment for encoder inference following the authors instructions. Then you can use condition_inference.py script to run inference with our provided checkpoints and save MS embeddings to an output dataframe.
  • Additionally, we provide some examples for running decoder inference using inference.py that can be used after downloading the checkpoint and storing it in the existing checkpoints placeholder directory.

License

MSFlow is released under the MIT license.

Contact

If you have any inquiries, please reach out to ghaith.mqawass@tum.de

About

FM framework for guided de-novo drug design implemented with PytorchLightning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages