annotation tag contamination reduced trafilatura's evaluation score

The HTML samples in the WebMainBench 545-sample subset were saved from a browser
with **annotation plugins active**. This injected non-standard DOM elements and
attributes into the saved HTML, affecting parser evaluation fairness. Most of htmls have `<marked-tail>` / `<marked-text>`, trafilatura would probably strip these tags and recall less text.

## Fix Method

```python
_ANNOTATION_TAG_RE = re.compile(
    r'</?(?:marked-tail|marked-text|marked-inline)[^>]*>', re.IGNORECASE
)
_ANNO_ATTR_RE = re.compile(r'\s+data-anno-uid="[^"]*"', re.IGNORECASE)


def clean_html(html: str) -> str:
    """Strip browser annotation plugin artifacts from saved HTML."""
    html = _ANNOTATION_TAG_RE.sub('', html)
    html = _ANNO_ATTR_RE.sub('', html)
    return html
```

## Evaluation Results

Metric | trafilatura(fixed) | trafilatura(official) | magic-html(fixed) | magic-html(fixed)
-- | -- | -- | -- | --
text_edit | 0.7795 | 0.6887 | 0.7800 | 0.7791
code_edit | 0.1687 | 0.1305 | 0.4149 | 0.4117

## Contamination Sources

### 1. Content Annotation Plugin (`<marked-tail>` / `<marked-text>`)

A Chrome-based content annotation extension (similar to ContentCat) wraps every
text node in custom elements to assign unique IDs for selection and labelling:

```html

<pre><code>import numpy as np
arr = np.array([1, 2, 3])
print(arr)</code></pre>


<pre><code><marked-text data-anno-uid="abc">import numpy as np</marked-text>
<marked-tail data-anno-uid="def">arr = np.array([1, 2, 3])</marked-tail>
<marked-text data-anno-uid="ghi">print(arr)</marked-text></code></pre>
```

- `<marked-tail>` wraps lxml "tail text" (text after a child element)
- `<marked-text>` wraps lxml "text" (direct text content of an element)

**Prevalence in the 545-sample subset:**

| Artifact | Affected Samples |
|---|---|
| `<marked-tail>` tags | 526 / 545 (96.5%) |
| `<marked-text>` tags | 516 / 545 (94.7%) |
| `data-anno-uid` attributes | 545 / 545 (100.0%) |

### 2. Immersive Translate Extension

A browser translation extension left residual elements in the saved HTML.

| Artifact | Affected Samples |
|---|---|
| `notranslate` / immersive-translate elements | 167 / 545 (30.6%) |

Approximately 5% of samples contain actually injected translation text, which
can cause extractors to output duplicate or translated content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

annotation tag contamination reduced trafilatura's evaluation score #68

Fix Method

Evaluation Results

Contamination Sources

1. Content Annotation Plugin (`<marked-tail>` / `<marked-text>`)

2. Immersive Translate Extension

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	trafilatura(fixed)	trafilatura(official)	magic-html(fixed)	magic-html(fixed)
text_edit	0.7795	0.6887	0.7800	0.7791
code_edit	0.1687	0.1305	0.4149	0.4117

Artifact	Affected Samples
`<marked-tail>` tags	526 / 545 (96.5%)
`<marked-text>` tags	516 / 545 (94.7%)
`data-anno-uid` attributes	545 / 545 (100.0%)

annotation tag contamination reduced trafilatura's evaluation score #68

Description

Fix Method

Evaluation Results

Contamination Sources

1. Content Annotation Plugin (<marked-tail> / <marked-text>)

2. Immersive Translate Extension

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Content Annotation Plugin (`<marked-tail>` / `<marked-text>`)