Skip to content

Upload Suzuki-coupling reaction dataset#226

Open
weiqiz3 wants to merge 1 commit intoopen-reaction-database:mainfrom
weiqiz3:suzuki-coupling-submission
Open

Upload Suzuki-coupling reaction dataset#226
weiqiz3 wants to merge 1 commit intoopen-reaction-database:mainfrom
weiqiz3:suzuki-coupling-submission

Conversation

@weiqiz3
Copy link
Copy Markdown

@weiqiz3 weiqiz3 commented May 23, 2025

Hi Ben, this is Weiqi Zhang. We had a meeting about exporting Suzuki-coupling reaction data last month. Here's the dataset and the template.

The attempts_dataset.pbtxt file contains all the Suzuki-coupling reactions from the paper https://www.nature.com/articles/s41586-024-07892-1.
template+data.zip

@bdeadman
Copy link
Copy Markdown
Collaborator

bdeadman commented Jun 4, 2025

Hi @weiqiz3, many thanks for the updated submission. Its looking much better. I've reviewed it now and have the following feedback.

  • Please improve the dataset name and description for the dataset. I can add these for you, or show you how to do it using ord-schema or ord-app. They should be sufficiently unique and descriptive so that users can gain some insights into your dataset. For your dataset I would suggest including the following in the description:
    • AI guided synthesis of 44 products on a 2nd generation Burke-type automated small molecule synthesis machine.
    • DOI to the paper.
    • A short description of the reaction type/class and/or the purpose (e.g. for organic electronics).
  • The result measurements have type YIELD but have been given mass measurements. Clarify if these numbers are mass measured, percentage yield (gravimetric) or something else. Based on your response I'll update on how to put this in the schema.
  • The outcomes should have at least one analysis method associated with them. For gravimetric yield you can use analysis type= WEIGHT.
    • NMR and LC should be specified as separate analysis messages if there is relevant data to add.
  • The "Reaction Note / Purification / Characterization" feature should ideally be split into 3 separate messages.
    • Reaction Note could go in as a note.
    • Purification can be put in as a workup of CUSTOM type.
    • Characterization would probably go in as an outcome.analysis.data entry, preferably split by NMR/LC etc
    • If you can export the data into a table with these fields as separate columns, then we can split the dataset (by types of data present) and the splits separately using modified templates.

@weiqiz3
Copy link
Copy Markdown
Author

weiqiz3 commented Jan 5, 2026

Hi @bdeadman, sincere apologies for the delayed response. We have updated our dataset according to your specifications. You can find both the dataset and the spreadsheet here:

Suzuki-Coupling for light-harvesting materials.json

https://docs.google.com/spreadsheets/d/1ES6cmpph9pYGcF3P2MeAww4hT4tk0G_4TLcVykMWY9w/edit?gid=0#gid=0

Thank you!

@bdeadman
Copy link
Copy Markdown
Collaborator

bdeadman commented Feb 3, 2026

Review Comments

Hi @weiqiz3 - this is getting close to being ready now. I have a couple of changes that I will strongly request, and some more that I would recommend to improve the usability of the ORD dataset. Let me know if you have any questions.

Mandatory changes

  • Please clarify what the yield measurement data is. My interpretation (based on how the schema has been used) is that LC was used to estimate the weight of product obtained. Was this calibrated for each product, or is this an estimate from a CAD or ELSD detector?
    • The details field associated with this data is currently "Weight". This details description could be made more descriptive to address this concern. For example "isolated yield", "weight determined by LC-ELSD", "weight determined by LC with calibration against an external standard".
    • Note that there is also a WEIGHT type of analysis in ORD (instead of the LC type currently selected) which we normally use for isolated yields measured by weighing on scales. Change to this if it is more appropriate.
    • Also please double check that the values being reported in the ORD dataset are what you expect. The yields reported in your ORD dataset don't match the yield column in your source data so I suspect you have applied a conversion. That's OK, but please check them as I cannot without knowing more about your conversion.
  • The catalyst is currently defined with a name, and a CAS number. Could we please add a SMILES and/or InChI string - I know that they don't always work perfectly for organometallic catalysts, but they will allow the ORD viewer to make an approximate visual preview for the molecule.
    • for Pd XPhos G4 use:
      • O=S(=O)(C)OPd(N(C)C1=CC=2)P(C(=CC1)C(=CC=1)C(C(=CC1C(C)C)C(C)C)=C(C=1)C(C)C)(C(CCC1)CC1)C(CCC1)CC1
      • InChI=1B/C47H64NO3PPdS/c1-33(2)36-31-42(34(3)4)47(43(32-36)35(5)6)41-27-16-19-29-45(41)52(37-21-11-9-12-22-37,38-23-13-10-14-24-38)53(51-54(8,49)50)46-30-20-17-26-40(46)39-25-15-18-28-44(39)48(53)7/h15-20,25-35,37-38,52H,9-14,21-24H2,1-8H3
    • for Pd P(tBu)3 G4 use:
      • O=S(C)(=O)OPd(N(C)C1=CC=2)P(C(C)(C)C)(C(C)(C)C)C(C)(C)C
      • InChI=1B/C26H42NO3PPdS/c1-24(2,3)31(25(4,5)6,26(7,8)9)32(30-33(11,28)29)23-19-15-13-17-21(23)20-16-12-14-18-22(20)27(32)10/h12-19,27H,1-11H3
    • for Pd XPhos G2 use:
      • ClPd(C(=CC3)C(C1=CC=2)=CC=3)P(C(CCC1)CC1)(C(CCC1)CC1)C(=CC1)C(=CC=1)C(C(=CC1C(C)C)C(C)C)=C(C=1)C(C)C
      • InChI=1B/C45H59ClNPPd/c1-31(2)34-29-40(32(3)4)45(41(30-34)33(5)6)39-25-14-17-27-43(39)48(35-19-9-7-10-20-35,36-21-11-8-12-22-36)49(46)44-28-18-15-24-38(44)37-23-13-16-26-42(37)47-49/h13-18,23-33,35-36H,7-12,19-22,47H2,1-6H3
    • If you don't want to edit your translation code, it might be possible to do a simple find and replace on the .pbtxt dataset to add the additional identifiers to each of these three catalysts.

Recommended Changes

  • I will suggest some minor edits to the name and description for clarity.
    • Name: Automated Suzuki-coupling to prepare light-harvesting molecules.
    • Description: AI-guided Suzuki-coupling to prepare 44 light-harvesting products on a custom-made automated small molecule synthesis machine. The closed loop optimization campaign targeted increasing photostability of the products. Further details about the closed loop optimization are available in Nature, 2024, DOI: 10.1038/s41586-024-07892-1.
  • Add reaction identifiers to help users (and yourselves) connect this dataset to other data you have on this project. Use the CUSTOM type and give the identifier a name and description in the details field.
    • the data spreadsheet has a hexadecimal "_id"
    • Extended Data Table 1 in the preprint paper has "Round-ID"
  • For the same reason as above, the products could be given custom identifiers
    • Extended Data Table 1 in the preprint paper has "DBA_Name" - how do these map onto the synthesised molecules?
  • For the same reason as above, the halide and boron reactants could have their names (DB_01 and A_010) add to the respective input components as identifiers of CUSTOM type. This would help map reactions and reactants from the ORD dataset to your source data.
  • Clarify how the 44 reactions have been selected from the full set of 78 in your source data table. The ORD encourages inclusion of 'negative results' unless there is a known problem with the quality of the data (e.g. vial spilled).
    • Some questions to ORD in the source data spreadsheet. Yes ORD would normally take reactions where synthesized==FALSE and/or synthesized==TRUE but purified==FALSE.

Tidying up the MML to ORD conversion code

The following are some awkward 'bugs' that I found in your ORD dataset. While we can wave them through this time (or just exclude them from the dataset), we should put some more thought into how your internal reaction data is mapped into the ORD schema.

  • The parsing of additional info into the 'procedure details' of ORD notes is not working correctly. Probably better to drop this information if it cannot be fixed.
    • if 'extra_additions' is empty then it should not be added to the ORD procedure details. It looks like a blank 'extra_additions' is included if any of the other concatenated fields are included in ORD procedure details.
    • same for 'observations' - this does not appear to be used in this ORD dataset, yet the empty tag is included in many reaction records.
    • 'solvents' looks like it is associated with "purification_step_1" and is for MPLC or Prep LC purification. This should technically go into ORD as a WORKUP message. If it is to be left in 'procedure details' there should also be included the associated "purification step method" from your table. Otherwise this could be interpreted as additional solvents added to the reaction.

@bdeadman
Copy link
Copy Markdown
Collaborator

bdeadman commented Feb 3, 2026

Answering some of the more specific questions tagged for ORD in your source data spreadsheet here. Some of the responses will also be evident from my suggestions in the review comments.

catalyst: Can we find the mass of the catalyst in the spreadsheet? If not, can we determine it? Is it okay to name an input Catalyst in the ORD database?

This one is a little complex. Yes you could have an input named as 'Catalyst' in the ORD and that would be fine. However, for your dataset the catalyst needs to be included in the same input as the other chemical components since we don't have a defined addition order in your method. When the chemical components are listed in separate inputs this shows definitive information about the order of addition, and how the chemicals were added into the vessel.

name_boron, name_halide, molecule_id_boron, molecule_id_halide: Will these be usefull to insert into database since they are only tokens?

Yes I think these are useful to include in the ORD dataset. While they won't be directly applicable to external users of the dataset, they will have use to your group, and they are useful for mapping the ORD dataset onto any other associated data you may publish in the future. They can be included as CUSTOM type identifiers at the component level, and include a name and brief description in the associated details field.

ratio_B_to_X: Is it necessary to export this since it can be calculated?

This is not necessary for ORD since the specific quantities are included with the input components. If you did want to include it you can add a FEATURE (type: NUMBER, data: the ratio, description: "ratio of boron reactant to halide") to the boron input component.

purification_loss: Is it necessary to insert this into the database since it's all 0?

This is not necessary for ORD without more context about what it means. In this dataset since it is always ZERO it can be ignored.

person_name: Should we normalize all names, such as David Friday, Dave, dave, etc?

ORD preference would be to name the individual experimenters if possible. If this is not possible then a lab name with contact email will be OK.

_id: Is this a good value to use for the reaction ID?

Yes, please do include that as a reaction ID. If nothing else it will help us run spot checks on the ORD dataset against the source data spreadsheet.

(calculated descriptors): What do we have available? What does ORD want?

If you want to add calculated descriptors to your molecules these can be included as a FEATURE under the associated input or outcome component chemical. Be sure to include a useful description of what it means. Features can be of type NUMBER, TEXT, URL, or even a file UPLOAD.

gas: Is the way we've represented it correct?

Yes it is correct to put the atmosphere gas under conditions - pressure - atmosphere. The control type could be specified as SEALED. e.g.

"""
pressure {
control {
type: SEALED
}
atmosphere {
type: NITROGEN
}
}
"""

purification steps: Are we representing these correctly in the ORD data model?

In the long term I think we need to have another look at how your purification steps are mapped into the ORD data model. It would be much better if the translation script parsed them, and created the appropriate WORKUP messages in the ORD dataset. That said, we can postpone this for now and focus on getting this dataset 'good enough' for release.

synthesized: Does ORD want rows where synthesized==FALSE?

Yes we would take these in the database unless there is a known problem with the execution of the reaction.

purified: Does ORD want rows where synthesized==TRUE but purified==FALSE?

Yes we would take these in the database unless there is a known problem with the execution of the reaction.

notes: Does anything in this field belong in ORD?

The examples shown in your source data look like they would be a good fit for OBSERVATIONS in the ORD schema.

@bdeadman
Copy link
Copy Markdown
Collaborator

bdeadman commented Feb 3, 2026

Attaching my review notes in notebook form here. No need for the dataset authors to use this file since all relevant comments have been copied out into the GitHub issue.

MMLI_suzuki_review.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Formal Review

Development

Successfully merging this pull request may close these issues.

2 participants