Skip to content

Conversation

@lionel42
Copy link

@lionel42 lionel42 commented Nov 6, 2025

Hello,

We are the laboratory for Air Pollution of Empa and we would like to contribute to MassBank with our spectras.

I wanted to test the format locally but ran into issues with the check software. see MassBank/MassBank-web#414 and MassBank/MassBank-web#413

This is just a draft for now, we have hundreds of spectra to upload, but we wanted first to ask about the format and the metatdata.

I created names and identifiers for our lab: EAP for Empa Air Pollution

Happy to receive any feedback ;)

@lionel42
Copy link
Author

lionel42 commented Nov 6, 2025

I have opened an issue for asking help.

I will be away for one week (holidays) so I will continue working on this later on.

@lionel42
Copy link
Author

lionel42 commented Nov 6, 2025

One point that i find wierd is that the validator seems to not like the Accession strings

@schymane
Copy link
Member

schymane commented Nov 6, 2025

One point that i find wierd is that the validator seems to not like the Accession strings

You can find the details about how to construct the Accession IDs here:
https://github.com/MassBank/MassBank-web/blob/main/Documentation/MassBankRecordFormat.md#2.1.1

It appears that you've put the name in the Accession, whereas we expect a number, e.g.: ACCESSION: MSBNK-AAFC-AC000101

@schymane
Copy link
Member

schymane commented Nov 6, 2025

I have opened an issue for asking help.

I will be away for one week (holidays) so I will continue working on this later on.

Please note that we have detailed record specifications to help explain what is needed in the various record entries:
https://github.com/MassBank/MassBank-web/blob/main/Documentation/MassBankRecordFormat.md#table-1--massbank-record-format-summary
...and then lots of details and examples in subsequent subsections.

It seems from the validation output that at least one other compulsory field is missing: AC$INSTRUMENT

The IPB Halle team are at BioHackEU25 this week, so they are a bit distracted, but will look into this once they are back.

@lionel42
Copy link
Author

@schymane Thanks for the answers, I managed to fix the format of our files.

Before I add the whole library, is it possible to confirm/register our laboratory and the prefix ? do you need any additional information from our side ?

@meier-rene
Copy link
Collaborator

Hi Lionel,
do you consider this contribution as complete? At the moment there are just two little issues left. One space too much and an empty table with peak annotations which needs to go. If yes, I can finish this minor things and merge your contribution. We also maintain a table with our contributors: https://github.com/MassBank/MassBank-data/blob/dev/List_of_Contributors_Prefixes_and_Projects.md. It would be welcome if you tell me what you want to see there or I will guess something for you.
Best, Rene

@lionel42
Copy link
Author

Hi Lionel, do you consider this contribution as complete? At the moment there are just two little issues left. One space too much and an empty table with peak annotations which needs to go. If yes, I can finish this minor things and merge your contribution. We also maintain a table with our contributors: https://github.com/MassBank/MassBank-data/blob/dev/List_of_Contributors_Prefixes_and_Projects.md. It would be welcome if you tell me what you want to see there or I will guess something for you. Best, Rene

Hi Rene,

Thanks for reaching out,

we would still need more time (we want to go manually though all files to do a quality check.
Also we build them automatically, so I will try to fix the 2 issues in our code.

About the table of contributors, we discussed and suggest the following :

  • Database: Empa_Air_Pollution
  • Research Group / Research Project: Empa - Laboratory for Air Pollution / Environmental Technology
  • Country: Switzerland
  • Prefix of ID: EAP
  • Project Tag: HALOHUNTER

I had initially also changed in the file in the PR, should I do it this way or do you want to update it from a separate PR ?

We will notify you when ready to merge ;)

AC$CHROMATOGRAPHY: KOVATS_RTI 818
PK$SPLASH: splash10-000t-9000000000-90ef1466a5c67cf33c97
PK$ANNOTATION: m/z formula_count exact_mass error(ppm) tentative_formula intensity_fraction
49.98421 1 49.99178 151.43 H3CCl+ 0.76

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete or review. unlikely to be H3CCl+ from structure and if so missing isotope signal

69.94142 1 69.93716 -60.95 Cl2+ 1.00
71.93848 1 71.93421 -59.40 Cl[37Cl]+ 0.77
81.94018 1 81.93716 -36.89 CCl2+ 1.00
83.94540 1 83.93421 -133.34 CCl[37Cl]+ 0.60

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is weird, twice in a row 83.94540 m/z with two different assignments? Also, NIST spectrum has strong signal at 83 m/z HCCl2, but here it is absent?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are generally more than one formula assigned to a given mass? I see that up to 3 formulas assigned per formula for this compound (some other compounds have up to 4 assignments. Do we want that? Seems weird to me

81.94018 1 81.93716 -36.89 CCl2+ 1.00
83.94540 1 83.93421 -133.34 CCl[37Cl]+ 0.60
83.94540 2 83.95281 88.23 H2CCl2+ 0.86
85.94626 1 85.94986 41.85 H2CCl[37Cl]+ 1.00

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIST has a strong 85 m/z signal (maybe H3CCl2)...but it is absent here? --> OH no I see that the unassigned peaks are listed separately below...but this seems silly to me that two of the most abundant peaks 83 and 85 m/z are not assigned and not listed here...

93.93877 1 93.93716 -17.17 C2Cl2+ 0.57
94.94653 1 94.94498 -16.31 HC2Cl2+ 1.00
95.95457 1 95.94834 -64.96 HC[13C]Cl2+ 0.01
95.95457 2 95.95281 -18.37 H2C2Cl2+ 1.00

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is number 1 and number 2 assignment decided? It looks like intensity_fraction shows how much of the mass is assignable to the formula, wouldn't it make more sense to have the higher intensity_fraction assigned as 1?

AC$CHROMATOGRAPHY: KOVATS_RTI 566
PK$SPLASH: splash10-002o-9000000000-17c33adb4eb05f58d77f
PK$ANNOTATION: m/z formula_count exact_mass error(ppm) tentative_formula intensity_fraction
23.98798 1 0.00000 0.00 - 0.00

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's happening here? something seems wrong...

AC$CHROMATOGRAPHY: KOVATS_RTI 396
PK$SPLASH: splash10-0udi-3900000000-5d7701f39c27b4d50277
PK$ANNOTATION: m/z formula_count exact_mass error(ppm) tentative_formula intensity_fraction
42.99847 1 42.99785 -14.31 C2F+ 0.98

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so many peaks and so few assignments, what's going on here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, not sure what those masses <40 m/z are, CH-fragments in NIST?
grafik

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HCl ? But this would be coming from a wierd recombination effect in the fragmentation ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but missing CO(?) which is in NIST, ratio for S+ isotopes is a bit off 3.3% instead of 4.3%?

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that molecule ion is there C10H22, why is it not being assigned a fragment correctly? 142.1721 is -0.36 ppm...?

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be improved. Missing the Cl-isotopes for two highest signals HC2F3Cl+ and HC2F4Cl+and all masses below 50 m/z

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add NIST spectrum 1,2-dichloro-1,1-difluoroethane!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are 2 nist spectra already available (HCFC-132a and HCFC-123b)

Should we rename ours to include the a or the b ?
Or extract both spectra separately ?
We also have a HCFC-123c registered in the target list and in nist there are 6 compounds for that formula (https://webbook.nist.gov/cgi/cbook.cgi?Formula=H2C2F2Cl2&NoIon=on&Units=SI)

Do you think we should clean that up ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHECK! something really off in the ratios, missing peaks

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird: prominence of 63.00332 HC2F2+ masses in the 30 m/z range...

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issues? missing 83 m/z H2C2F3+ and 32 m/z HCF+? because we filter out 32?

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relative intensities pretty different from NIST

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing [13C]F+ and [13C]F2+ isotope signals

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, missing 32 HCF+, relative intensity of 64.01210 too low?

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but sparse on low intensity ions, maybe noise threshold set too high?

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok but also maybe noise threshold set too high?

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing again 32 HCF+

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again missing 32 HCF+

grafik

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again missing 32 CHF+ and also 44 HC2F+ - probably overlap with CO2

grafik

@lionel42
Copy link
Author

@Alina-beal I had a look to your comments, many issues are similar.

It is good taht you have opened these threads for each spectra, now we can comment for each spectra how we want to go forward.

Identification of fragment by alpinac

You mentionned empty lines.
In a first step, I added only identified fragements in the identification part.
In a second step, I thought it is also good to add masses that were not identified. Those lines are the empty ones with 0.

I guess we need to choose if we want to include the non identified (and maybe increase manually the mass uncertainty of those fragments so that alpinac can identify them ? )

Modifiying spectra files

For the files where the spectra is wrong, we will need to discuss how we coordinate the modifications of the current files.
The easiest would be to extract from a better sample (like std or high signal air)
We can also check missing important fragments manually to see if the issue is from the extraction algorithm or something else.

Masses below 20 or too high

Since we have the cutoff and we document it in the metadata, I assume it is okay that we miss masses for some compounds under 20.

@Alina-beal
Copy link

We could proceed as follows:

  • download missing NIST spectra, compare
  • re-add filtered masses or adjust mass filtering window for 32 CHF+ (vs O2), 44 HC2F+ (vs CO2), 83 m/z H2C2F3+ (Kr?), missing CO+ (COS) where necessary/appropriate
  • manually check and maybe re-extract HFC-152, HCFC-124, HCFC-141a, ...
  • assign unassigned peaks or clarify why sometimes clearly assignable ones are not being assigned like in the case of decane...
  • retract Mr and Ms
  • I would go through the non-NIST compounds that are still not viewed manually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants