get_by_id() - Identifying Scans by an ID Dictionary #30

CCranney · 2021-03-05T21:02:53Z

CCranney
Mar 5, 2021

Hello!

In a program I'm writing I need to identify several hundred MS spectra from an MzXML file from a list of scan numbers. I found the get_by_id() function in the pyteomics XML class that generally does this. Looking at the code, it appears there are two options - either it parses from the beginning of the file until it identifies a spectrum with the scan number passed in as a parameter, or it identifies the spectrum by dictionary. This latter option is what I'm looking for, because starting from the beginning of the file causes huge delays for large scan numbers. However, I have not been able to activate the identification by dictionary method.

Some ideas I have been trying. My test for determining if these methods work is to check if it still takes significantly longer to identify spectra with large scan numbers compared to those with small scan numbers. Using a dictionary should significantly reduce the time complexity of this test.

Set the 'use_index' parameter of the mzxml.read() function to 'True.' This does not appear to work.
Use the xml.build_id_cache() function to initialize the desired dictionary prior to using the xml.get_by_id() function. This does not appear to work.

I am continuing to investigate the issue and ways to bypass it, but I thought I'd reach out and ask about this. Is there a more straightforward way to trigger this identification by dictionary method? Am I misunderstanding the possible uses of the xml.get_by_id() function? Do you have recommendations for how to move forward?

Answered by levitsky

Mar 5, 2021

Hi!

Generally, you shouldn't need to do anything special to achieve this. use_index=True is needed if you create the parser with mzxml.read(); if you call mzxml.MzXML instead, indexing is enabled by default.

In any case, the created object should have _offset_index populated with byte offsets of elements. You can check len(reader) and get the amount of items in the index.

The behavior you need is indeed defined in get_by_id, but not the version on the XML class, rather the one on IndexedXML. You don't need to call it directly though, you can just use dict-like syntax on the reader object. You should even be able to get all your spectra at once by requesting a list of IDs.

build_id_cache i…

View full answer

levitsky · 2021-03-05T22:40:11Z

levitsky
Mar 5, 2021
Maintainer

Hi!

Generally, you shouldn't need to do anything special to achieve this. use_index=True is needed if you create the parser with mzxml.read(); if you call mzxml.MzXML instead, indexing is enabled by default.

In any case, the created object should have _offset_index populated with byte offsets of elements. You can check len(reader) and get the amount of items in the index.

The behavior you need is indeed defined in get_by_id, but not the version on the XML class, rather the one on IndexedXML. You don't need to call it directly though, you can just use dict-like syntax on the reader object. You should even be able to get all your spectra at once by requesting a list of IDs.

build_id_cache is an obsolete mechanism and I don't recommend relying on it.

I hope this helps somewhat. If not, I suggest sharing some of the relevant code where you do the lookups.

1 reply

CCranney Mar 5, 2021
Author

Thank you! You are absolutely right, I was looking at the wrong class. I'm not sure what I did differently, but I tried setting use_index=True and it worked this time. I appreciate it, and thanks for making such a great package!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_by_id() - Identifying Scans by an ID Dictionary #30

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

get_by_id() - Identifying Scans by an ID Dictionary #30

Uh oh!

CCranney Mar 5, 2021

Replies: 1 comment · 1 reply

Uh oh!

levitsky Mar 5, 2021 Maintainer

Uh oh!

CCranney Mar 5, 2021 Author

CCranney
Mar 5, 2021

Replies: 1 comment 1 reply

levitsky
Mar 5, 2021
Maintainer

CCranney Mar 5, 2021
Author