Skip to content

Conversation

@lfoppiano
Copy link
Member

@lfoppiano lfoppiano commented Jul 17, 2025

This PR extends the functionality already implemented to recognise URLs and provide a clean target URI, by covering the cases where the URLs are not identified by regex (here the DATAmic13 token is annotated with an URL):

image

Here the result:

<div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>Data availability</head> All 43,191  
                    <p>genomes recovered in this study, the GOMC database containing 
                        <ref type="bibr" target="#b23">24,</ref>195  unique genomes and other supporting data can be interactively accessed at China National GeneBank DataBase (CNGBdb) (
                        <ref type="url" target="https://db.cngb.org/maya/datasets/MDB0000002">https://db.cngb.org/maya/datasets/MDB0000002</ref>). The previously available public marine bacterial and archaeal genomes in NCBI have been also collected and backed up in China National Gen-eBank Sequence Archive (CNSA) under the accession DATAmic13. The two marine microbial genome catalogues OMD and OceanDNA were downloaded from OMD (
                        <ref type="url" target="https://microbiomics.io/ocean/">https://microbiomics.io/ocean/</ref>) and figshare (OceanDNA, 
                        <ref type="url" target="https://doi.org/10.6084/m9.figshare.c.5564844.v1">https://doi.org/10.6084/m9.figshare.c.5564844.  v1</ref>). The Earth's Microbiomes (GEM) catalogue and Tibetan Glacier Genome and Gene (TG2G) catalogue were downloaded from 
                        <ref type="url" target="https://genome.jgi.doe.gov/GEM">https://  genome.jgi.doe.gov/GEM</ref> and 
                        <ref type="url" target="https://www.biosino.org/node/project/detail/OEP003083">https://www.biosino.org/node/project/  detail/OEP003083,</ref> respectively. The BiG-FAM database can be accessed at 
                        <ref type="url" target="https://bigfam.bioinformatics.nl/">https://bigfam.bioinformatics.nl/</ref>. Additional materials generated in this study are available on request.
                    </p>
                </div>

We've got a workable version however, this functionality should be integrated with the current one that uses the regex considering

@coveralls
Copy link

coveralls commented Jul 18, 2025

Coverage Status

coverage: 40.482% (+0.09%) from 40.394%
when pulling 9f10930 on feature/extract-any-url-in-fulltext
into 01fe109 on master.

@lfoppiano lfoppiano added this to the 0.9.0 milestone Nov 6, 2025
@lfoppiano lfoppiano self-assigned this Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants