These are PHP bindings for HTML Inspector.
<?php
function extract_anchors(string $html_utf8, string $document_uri)
{
$doc = new HtmlInspector\HtmlDocument($html_utf8);
$base_node = $doc->select(0)->child()->name('html')->child()->name('head')->child()
->name('base')->iterate();
$base = HtmlInspector\resolve_iri($doc->get_attribute($base_node, 'href'), $document_uri);
$base ??= $document_uri;
$selector = $doc->select(0)->descendant()->name('a')->attribute_starts_with('href', '#')->not();
while (($node_a = $selector->iterate()) !== -1) {
$href = $doc->get_attribute($node_a, 'href');
$uri = HtmlInspector\resolve_iri($href, $base);
print("$uri\n");
}
}
I have thought back and forth whether to implement PHP iterators to loop through nodes. How PHP
implements iterators is awkward. Firstly, two redundant implementations are needed to support
looping with foreach and to implement the Iterator interface. Moreover, it needs the two
methods next (with no return value) and current instead of just one, we have to implement a
caching of both the current value and of the validity state of the iterator, and in current we
conditionally have to make one implicit iteration. Python is an example where iteration is
implemented more elegantly using a single __next__ method that both iterates and then returns the
current value. Another complication is how to encode the non-existence of a node. With PHP iterators,
we need to use the value false and implement union type hints and a respective check for the get_*
methods to enable a concise syntax. Without iterators, we can use the value -1 and pass it to the
C functions without further checks.