The HTML samples in the WebMainBench 545-sample subset were saved from a browser
with annotation plugins active. This injected non-standard DOM elements and
attributes into the saved HTML, affecting parser evaluation fairness. Most of htmls have <marked-tail> / <marked-text>, trafilatura would probably strip these tags and recall less text.
Fix Method
_ANNOTATION_TAG_RE = re.compile(
r'</?(?:marked-tail|marked-text|marked-inline)[^>]*>', re.IGNORECASE
)
_ANNO_ATTR_RE = re.compile(r'\s+data-anno-uid="[^"]*"', re.IGNORECASE)
def clean_html(html: str) -> str:
"""Strip browser annotation plugin artifacts from saved HTML."""
html = _ANNOTATION_TAG_RE.sub('', html)
html = _ANNO_ATTR_RE.sub('', html)
return html
Evaluation Results
| Metric |
trafilatura(fixed) |
trafilatura(official) |
magic-html(fixed) |
magic-html(fixed) |
| text_edit |
0.7795 |
0.6887 |
0.7800 |
0.7791 |
| code_edit |
0.1687 |
0.1305 |
0.4149 |
0.4117 |
Contamination Sources
1. Content Annotation Plugin (<marked-tail> / <marked-text>)
A Chrome-based content annotation extension (similar to ContentCat) wraps every
text node in custom elements to assign unique IDs for selection and labelling:
<!-- Original HTML -->
<pre><code>import numpy as np
arr = np.array([1, 2, 3])
print(arr)</code></pre>
<!-- After annotation plugin injection -->
<pre><code><marked-text data-anno-uid="abc">import numpy as np</marked-text>
<marked-tail data-anno-uid="def">arr = np.array([1, 2, 3])</marked-tail>
<marked-text data-anno-uid="ghi">print(arr)</marked-text></code></pre>
<marked-tail> wraps lxml "tail text" (text after a child element)
<marked-text> wraps lxml "text" (direct text content of an element)
Prevalence in the 545-sample subset:
| Artifact |
Affected Samples |
<marked-tail> tags |
526 / 545 (96.5%) |
<marked-text> tags |
516 / 545 (94.7%) |
data-anno-uid attributes |
545 / 545 (100.0%) |
2. Immersive Translate Extension
A browser translation extension left residual elements in the saved HTML.
| Artifact |
Affected Samples |
notranslate / immersive-translate elements |
167 / 545 (30.6%) |
Approximately 5% of samples contain actually injected translation text, which
can cause extractors to output duplicate or translated content.
The HTML samples in the WebMainBench 545-sample subset were saved from a browser
with annotation plugins active. This injected non-standard DOM elements and
attributes into the saved HTML, affecting parser evaluation fairness. Most of htmls have
<marked-tail>/<marked-text>, trafilatura would probably strip these tags and recall less text.Fix Method
Evaluation Results
Contamination Sources
1. Content Annotation Plugin (
<marked-tail>/<marked-text>)A Chrome-based content annotation extension (similar to ContentCat) wraps every
text node in custom elements to assign unique IDs for selection and labelling:
<marked-tail>wraps lxml "tail text" (text after a child element)<marked-text>wraps lxml "text" (direct text content of an element)Prevalence in the 545-sample subset:
<marked-tail>tags<marked-text>tagsdata-anno-uidattributes2. Immersive Translate Extension
A browser translation extension left residual elements in the saved HTML.
notranslate/ immersive-translate elementsApproximately 5% of samples contain actually injected translation text, which
can cause extractors to output duplicate or translated content.