Extract URLs from annotations in any part of the fulltext #1315

lfoppiano · 2025-07-17T13:14:07Z

This PR extends the functionality already implemented to recognise URLs and provide a clean target URI, by covering the cases where the URLs are not identified by regex (here the DATAmic13 token is annotated with an URL):

Here the result:

<div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <head>Data availability</head> All 43,191  
                    <p>genomes recovered in this study, the GOMC database containing 
                        <ref type="bibr" target="#b23">24,</ref>195  unique genomes and other supporting data can be interactively accessed at China National GeneBank DataBase (CNGBdb) (
                        <ref type="url" target="https://db.cngb.org/maya/datasets/MDB0000002">https://db.cngb.org/maya/datasets/MDB0000002</ref>). The previously available public marine bacterial and archaeal genomes in NCBI have been also collected and backed up in China National Gen-eBank Sequence Archive (CNSA) under the accession DATAmic13. The two marine microbial genome catalogues OMD and OceanDNA were downloaded from OMD (
                        <ref type="url" target="https://microbiomics.io/ocean/">https://microbiomics.io/ocean/</ref>) and figshare (OceanDNA, 
                        <ref type="url" target="https://doi.org/10.6084/m9.figshare.c.5564844.v1">https://doi.org/10.6084/m9.figshare.c.5564844.  v1</ref>). The Earth's Microbiomes (GEM) catalogue and Tibetan Glacier Genome and Gene (TG2G) catalogue were downloaded from 
                        <ref type="url" target="https://genome.jgi.doe.gov/GEM">https://  genome.jgi.doe.gov/GEM</ref> and 
                        <ref type="url" target="https://www.biosino.org/node/project/detail/OEP003083">https://www.biosino.org/node/project/  detail/OEP003083,</ref> respectively. The BiG-FAM database can be accessed at 
                        <ref type="url" target="https://bigfam.bioinformatics.nl/">https://bigfam.bioinformatics.nl/</ref>. Additional materials generated in this study are available on request.
                    </p>
                </div>

We've got a workable version however, this functionality should be integrated with the current one that uses the regex considering

coveralls · 2025-07-18T12:47:37Z

coverage: 40.482% (+0.09%) from 40.394%
when pulling 9f10930 on feature/extract-any-url-in-fulltext
into 01fe109 on master.

…d annex

lfoppiano added 4 commits July 17, 2025 14:14

add method to extract urls without regex, using only PDF annotations

c8a6600

cleanup edges

423c693

cleanup

2aae458

consolidate URL extraction with the previous implementation regex based

aaa371c

lfoppiano added 12 commits July 20, 2025 11:27

remove wrong filter

9632308

avoid running out of tokens

76fa5e0

fix tests

d9bc31d

merge non-annotation backed URLs into the annotated-based URLs

1770319

more conservative merging

3324e12

Merge branch 'master' into feature/extract-any-url-in-fulltext

842b2c5

avoid overlapping of figures and tables unique identifiers in body an…

1f8acb2

…d annex

avoid that equations identifiers are duplicated

bd0ba9f

fix the figure and table again id again broken

bb54412

conservative indexing

326b9a4

reindex figures, tables and equations

719b071

add annex figure, tables and equations in the document object

9f10930

lfoppiano added this to the 0.9.0 milestone Nov 6, 2025

lfoppiano self-assigned this Nov 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extract URLs from annotations in any part of the fulltext #1315

Extract URLs from annotations in any part of the fulltext #1315

Uh oh!

lfoppiano commented Jul 17, 2025 •

edited

Loading

Uh oh!

coveralls commented Jul 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Extract URLs from annotations in any part of the fulltext #1315

Are you sure you want to change the base?

Extract URLs from annotations in any part of the fulltext #1315

Uh oh!

Conversation

lfoppiano commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coveralls commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lfoppiano commented Jul 17, 2025 •

edited

Loading

coveralls commented Jul 18, 2025 •

edited

Loading