Skip to content

Data submission - Decarboxylative Olefination (Machine learning optimized)#237

Merged
bdeadman merged 4 commits intoopen-reaction-database:#237from
DrHermit:DecarboxylativeOlefination_Submission
Apr 8, 2026
Merged

Data submission - Decarboxylative Olefination (Machine learning optimized)#237
bdeadman merged 4 commits intoopen-reaction-database:#237from
DrHermit:DecarboxylativeOlefination_Submission

Conversation

@DrHermit
Copy link
Copy Markdown

This pull request contains two datasets generated from optimization of a decarboxylative olefination reaction between aldehydes and malonic acid derivatives:

  1. 120-experiment dataset generated from Bayesian optimization of 5 substrates
  2. 136-experiment dataset generated from transfer learning optimization of 26 substrates
    Data generated in Alan Healy lab at NYU Abu Dhabi

Here is the link to the ChemRxiv pre-print of the paper:
https://chemrxiv.org/doi/full/10.26434/chemrxiv.15001213/v1

Thank you for taking time to review my data and I hope I made the correct pull request this time.

@bdeadman bdeadman changed the base branch from main to #237 March 26, 2026 14:42
@bdeadman
Copy link
Copy Markdown
Collaborator

@DrHermit - many thanks for this, and also for already correcting the inputs in your reactions.

I'll do the formal review tomorrow and let you know if there are any changes we recommend. The pull request is looking as I expect though so we're hopefully on the home stretch now.

@bdeadman
Copy link
Copy Markdown
Collaborator

Review reports:

review_#237_1.ipynb (Transfer learning dataset)
review_#237_2.ipynb (BO dataset)

@bdeadman
Copy link
Copy Markdown
Collaborator

@DrHermit I've done my review and have a few 'must fix' issues, and some suggestions to consider. I'll list them here in decreasing order of importance. You may also be interested in checking the Jupyter Notebooks I've attached above - these go through my review checks and will include many of the comments below:

For both datasets

  • It looks like you enumerated smiles fields after adding them using the lookup tool in the reaction editor. As a result I'm seeing some mismatches between the text name (which wasn't updated) and the smiles (which was replaced during enumeration).
    • Options to correct this:
      • Delete the details from the pre-template reaction and then enumerate again so the placeholder name doesn't get included in the final dataset.
      • add a smiles field directly to the reaction (ie not using lookup) and then enumerate.
      • If you want to include the name (can be helpful for human readability), best practice would be to add a separate identifier field for name and enumerate this (instead of relying on the details field that lookup populates).
  • Add the pre-print doi to the provenance in the reactions
  • The DMAP component is missing a reaction role. I suggest using REAGENT as the most generic option we have.
  • Ideally, we'd include the workup steps in the reactions as well. I've already mocked up a quick example for your described method and can attach the code for it in a later message so it should be easy to incorporate into your template before redoing the enumeration.
  • The internal standard should really be removed from the reaction inputs to the workups since it is added post-reaction.
  • Adding a name field to the molecular sieves input component would help with human readability. Keep the CAS number as well - we can have multiple identifiers.

For the Bayesian Optimisation dataset:

  • My review notebook found benzene as one of the chemical inputs when the smiles strings were visualised. I don't know which specific reactions this would be in, but suspect this may have been a typo for the toluene smiles string. My compound list for this dataset didn't include toluene.

If you are unsure about how to implement any of these suggestions these do reach out to me. We could solve them quite quickly on a call, and I can send you some snippets of code to paste into the reaction editor for some bits.

@bdeadman
Copy link
Copy Markdown
Collaborator

I was also going to say thanks for revealing the smiles/name mismatch to me. I hadn't considered that possibility before so it is very useful to see that this can happen in normal usage of the reaction editor enumerator tool. I'll have to give some thought to how we might prevent that in future, through changes to the app and/or training.

@bdeadman
Copy link
Copy Markdown
Collaborator

I'd also like to strengthen the dataset names and descriptions to help more users find them. I can put together a suggestion for you to consider, and I can make that change at the last minute before the dataset goes into the public database.

@bdeadman
Copy link
Copy Markdown
Collaborator

bdeadman commented Mar 30, 2026

Suggested name/descriptions:

Bayesian optimization of 6 decarboxylative Knoevenagel condensation reactions

The Knoevenagel condensation between 6 pairings of aldehydes and malonic acid half-thioesters were studied in a Bayesian optimization campaign of 120 reaction datapoints. For each pairing, the catalyst, solvent, temperature and equivalents were optimized across 4 rounds of 6 experiments. Reactions performed by the Alan R. Healy group at New York University Abu Dhabi, and the pre-print publication is available on ChemRxiv at https://doi.org/10.26434/chemrxiv.15001213/v1. This dataset was used as training data for a subsequent transfer learning optimization of similar aldehydes and malonic acid derivatives (XXXX - will add dataset id here during submission processing).

@DrHermit
Copy link
Copy Markdown
Author

DrHermit commented Mar 30, 2026

@bdeadman Thanks a lot for the careful review and thoughtful suggestions!!
I have implemented the following changes and have attached the data files:

  • corrected the name of the aldehydes (as well as solvents and catalysts, for which I also forgot) by adding an identifier to the name field. This was a field I forgot to enumerate, thanks for noticing!
  • DOI link is now added to the provenance
  • DMAP is now a REAGENT type.
  • The internal standard is moved to the workup step
  • The molecular sieve now has a new identifier with NAME '4Å Molecular Sieve'.
  • For the benzene-toluene issue. I have already fixed it. Likely because when I copied structures from ChemDraw, the methyl group was not selected.

For the workup steps, I added the following steps. Let me know if you have any suggestions for this:

  • FILTRATION (keep the organic phase)
  • wash (1.5 mL ethyl acetate)
  • concentration
  • addition (0.05 mmol of ethylene carbonate and 0.7 mL of CDCl3)

And thanks for the suggestions to the name and description of the dataset! I saw your comment while I was writing this reply. I have already implemented the changes to both the BO and the transfer learning dataset.

Please take a look at my revisions and let me know if there are more changes you would like to see! Thank you very much for the help!!

Bayesian optimization of 5 decarboxylative Knoevenagel condensation reactions.json
Transfer Learning Optimization of 26 decarboxylative Knoevenagel condensation reactions.json

@bdeadman
Copy link
Copy Markdown
Collaborator

@DrHermit - Thanks for the speedy corrections. I've checked it over and it is looking good now. If you don't mind making some final tweaks I would suggest the following for the workup descriptions:

  1. Add a temperature change step as a the first workup ("The product mixture was cooled to 25 C"). Unfortunately the online editor doesn't support reordering workup steps so it is a bit fiddly to insert a step. To make it easier I have reordered the workups in a text editor and attached a single reaction for each dataset.
  2. Check the volume of ethyl acetate. SI method states 3x1ml but ORD reaction has 1.5mL.
  3. Check the volume_includes_solutes: false setting on the internal standard in CDCl3 addition. If you add 0.7 mL of a pre-prepared stock solution of ethylene carbonate in CDCl3 then I would set this as true.

Below is a zip file with a single reaction for each dataset. I've inserted a temp change as the first workup, changed the ethyl acetate volume to 3mL and added a few more details in the text descriptions. Note that I haven't done anything for item 3 above.

reordered_reaction_workups.zip

You should be able to upload these reactions (separately) to your ORD editor, check and make any final changes you want, then turn them into templates for enumeration. If any of this doesn't work smoothly then please come back to me rather than struggle with it. We're only making small optional improvements to the dataset now so we'll take the datasets as is if necessary.

For the next steps:

  1. You can replace the previous versions of your 2 datasets in your branch folder and commit the changes through GitHub. I've not tried it with the json form of an ORD dataset yet, so I'd recommend using the txtpb or binpb forms to be safe. Just paste in the new dataset files, and delete the earlier version of your datasets. After you push the change online it should appear in this pull request as a new commit.
  2. We (you or me) can then approve the merge of your datasets into the Data submission - Decarboxylative Olefination (Machine learning optimized) #237 branch.
  3. From there we will setup another pull request from the Data submission - Decarboxylative Olefination (Machine learning optimized) #237 branch to main, and this will trigger some automated processing including the assignment of formal ord dataset and reaction IDs. During this step I will also swap the dataset names and descriptions to whatever we agree on (we can iterate a bit in GitHub until we're both happy), and importantly include the cross-refencing the dataset ids.

@DrHermit
Copy link
Copy Markdown
Author

DrHermit commented Apr 3, 2026

@bdeadman Hi! Sorry for the delay in making the changes. I was busy on some other projects for the past few days.

Thank you very much for creating the workup procedure for me!! It was really helpful. I've updated the workup steps and uploaded the new dataset. Please check and let me know if I did everything correctly.
If I understand correctly, the new files should appear automatically without creating a new pull request, right?

I'm looking forward to having the dataset online! Thank you very much for all the suggestions and help!!

@bdeadman
Copy link
Copy Markdown
Collaborator

bdeadman commented Apr 8, 2026

Thanks @DrHermit - I'll merge those into the branch now.

Notes:

  • the check_file_types and count_reactions checks are failing because the datasets are in the newer txtpb format, but the scripts are expecting the older pbtxt format. Update dataset validation and submission scripts to also accept txtpb and binpb formats ord-schema#782 issue created to update these scripts for future datasets. Submission checked manually and approved by me.
  • once datasets are in the public repo branch I'll swap them out with files using the txtpb extension at the same time that I update the dataset descriptions.

@bdeadman bdeadman merged commit 6c8261a into open-reaction-database:#237 Apr 8, 2026
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants