Summary
When requesting a HTTP resource using the DOM or SimpleXML extensions, the wrong content-type header is used to determine the charset when the requested resource performs a redirect.
Details
When the HTTP stream wrapper follows a redirect, it does not clear the list of captured headers before performing the following requests. This means in the returned array containing the response headers, the headers of multiple requests are stored one after each other. The final request comes last in this array.
The php_libxml_input_buffer_create_filename() / php_libxml_sniff_charset_from_stream() function scans the header array from top to bottom, returning after finding the first content-type header. This content-type header does not necessarily belong to the response that corresponds to the HTML body that is being parsed.
PoC
redirect.php
<?php
header('content-type: text/html;charset=utf-16');
header('location: http://example.com');
Run: php -S localhost:8080 and then execute
<?php
// Or using DOMDocument / SimpleXML
$document = \Dom\HTMLDocument::createFromFile("http://localhost:8080/redirect.php");
if (\str_contains($document->querySelector('body')->textContent, 'Example')) {
throw new Exception('Refusing to store example content');
}
var_dump(\str_contains($document->saveHtml(), 'Example')); // bool(true)
Impact
This allows an attacker to cause a document to be parsed incorrectly, changing its meaning and possibly bypassing validation. When exporting such a document with ->saveHtml() the document will be returned with the original charset.
Users that request documents via HTTP using the DOM or SimpleXML extensions are impacted.
Summary
When requesting a HTTP resource using the DOM or SimpleXML extensions, the wrong
content-typeheader is used to determine the charset when the requested resource performs a redirect.Details
When the HTTP stream wrapper follows a redirect, it does not clear the list of captured headers before performing the following requests. This means in the returned array containing the response headers, the headers of multiple requests are stored one after each other. The final request comes last in this array.
The
php_libxml_input_buffer_create_filename()/php_libxml_sniff_charset_from_stream()function scans the header array from top to bottom, returning after finding the firstcontent-typeheader. Thiscontent-typeheader does not necessarily belong to the response that corresponds to the HTML body that is being parsed.PoC
redirect.php
Run:
php -S localhost:8080and then executeImpact
This allows an attacker to cause a document to be parsed incorrectly, changing its meaning and possibly bypassing validation. When exporting such a document with
->saveHtml()the document will be returned with the original charset.Users that request documents via HTTP using the DOM or SimpleXML extensions are impacted.