Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 17 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,15 +89,25 @@ The following namespaces are pre-registered for XPath queries:

### ESI Tag Support

The library handles Edge Side Includes (ESI) tags, converting empty ESI tags to self-closing format:
The library preserves Edge Side Includes (ESI) tags verbatim during HTML5 processing. ESI tags present multiple challenges:

```php
// Input
'<esi:include src="url"></esi:include>'
1. **Self-closing syntax**: Tags like `<esi:include src="..." />` don't exist in HTML5
2. **Arbitrary interleaving**: ESI tags can span across HTML element boundaries
3. **Attribute encoding**: Characters like `&` must not become `&amp;`

The [ESI Language Specification 1.0](https://www.w3.org/TR/esi-lang/) describes ESI as "XML-based" (Section 1), but also states that documents containing ESI markup are not valid. From Section 1.1:

> the markup that is emitted by the origin server is not valid; it contains interposed elements from the ESI namespace

ESI elements can be arbitrarily interleaved with the underlying content, which does not even need to be HTML. The standard makes no statements about whether HTML entities must be applied. Since XML parsing is not feasible for such documents, assuming XML encoding rules is not warranted.

This library wraps every ESI tag (opening, closing, or self-closing) in an HTML comment using the ESI comment syntax defined in Section 3.7 of the ESI specification (`<!--esi ... -->`). This hides the tags from the HTML5 parser while preserving them verbatim.

> [!IMPORTANT]
> During processing, ESI tags appear as Comment nodes in the DOM. If RewriteHandler
> transformations move or delete these comment nodes, the final result may not
> match expectations.

// Output
'<esi:include src="url" />'
```
## Credits, Copyright and License

This library is based on internal work that we have been using at webfactory GmbH, Bonn, at least
Expand Down
78 changes: 78 additions & 0 deletions src/Implementation/EsiTagProcessor.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
<?php

declare(strict_types=1);

namespace Webfactory\Html5TagRewriter\Implementation;

/**
* Preserves ESI (Edge Side Includes) tags verbatim during HTML5 parsing and serialization.
*
* ESI tags present multiple challenges for HTML5 parsing:
*
* 1. Self-closing syntax: ESI tags like <esi:include src="..." /> use self-closing syntax,
* which does not exist in HTML5. The parser treats them as opening tags without closing
* tags, causing all following content to be incorrectly nested inside the ESI element.
*
* 2. Arbitrary interleaving: ESI tags can span across HTML element boundaries in ways that
* violate well-formedness rules. For example, an opening ESI tag might appear in one
* HTML element while its closing tag appears in another. HTML5 parsers would "repair"
* such structures, breaking the intended ESI behavior.
*
* 3. Attribute preservation: ESI tags must not be modified because they may be processed
* on a text basis by an upstream component (e.g., a caching proxy or CDN) that does not
* apply HTML rules. Any transformation - such as encoding & as &amp; in attribute
* values - would break the ESI processor's ability to parse the tag correctly.
*
* This class solves these problems by wrapping every ESI tag (opening, closing, or
* self-closing) in an HTML comment during pre-processing, using the ESI comment syntax
* defined in Section 3.7 of the ESI Language Specification. The original tags are restored
* verbatim during post-processing.
*
* Important: During processing, ESI tags do not appear as Elements in the DOM, but as
* Comment nodes. If RewriteHandler transformations move or delete these comment nodes,
* the final result may not match expectations.
*/
final class EsiTagProcessor
{
private const COMMENT_PREFIX = 'esi html5-tagrewriter ';

/**
* Wraps all ESI tags in HTML comments.
*
* Each ESI tag (opening, closing, or self-closing) is wrapped as
* <!--esi html5-tagrewriter <original-tag> --> to hide it from the HTML5 parser
* while preserving the original content verbatim.
*/
public function preProcess(string $html): string
{
// Match opening tags: <esi:name ...>
// Match closing tags: </esi:name>
// Match self-closing tags: <esi:name ... />
// Note: The [^>]*? pattern does not correctly handle ">" inside quoted attribute
// values (e.g., <esi:include src="a>b" />). This is a known limitation that we
// ignore for now, as such attribute values are uncommon in practice.
return preg_replace_callback(
'#<(/?)esi:([a-z]+)([^>]*?)(/?)>#',
function (array $matches): string {
return '<!--'.self::COMMENT_PREFIX.$matches[0].'-->';
},
$html
) ?? $html;
}

/**
* Restores original ESI tags from HTML comments.
*/
public function postProcess(string $html): string
{
$prefix = preg_quote(self::COMMENT_PREFIX, '#');

return preg_replace_callback(
'#<!--'.$prefix.'(.+?)-->#',
function (array $matches): string {
return $matches[1];
},
$html
) ?? $html;
}
}
22 changes: 8 additions & 14 deletions src/Implementation/Html5TagRewriter.php
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
namespace Webfactory\Html5TagRewriter\Implementation;

use Dom\Document;
use Dom\Element;
use Dom\HTMLDocument;
use Dom\Node;
use Dom\XPath;
Expand All @@ -27,11 +26,13 @@ public function register(RewriteHandler $handler): void
#[Override]
public function process(string $html5): string
{
$document = HTMLDocument::createFromString($this->convertEsiSelfClosingTagsToEmptyElements($html5), LIBXML_NOERROR);
$esiProcessor = new EsiTagProcessor();

$document = HTMLDocument::createFromString($esiProcessor->preProcess($html5), LIBXML_NOERROR);

$this->applyHandlers($document, $document);

return $this->convertEsiEmptyElementsToSelfClosingTags($document->saveHtml());
return $esiProcessor->postProcess($document->saveHtml());
}

#[Override]
Expand All @@ -47,15 +48,17 @@ public function processBodyFragment(string $html5Fragment): string
* handling of fragments to such inputs that can equally be considered to be
* placed directly after the `<body>` tag.
*/
$esiProcessor = new EsiTagProcessor();

$document = HTMLDocument::createFromString('', overrideEncoding: 'utf-8');
$container = $document->body;
assert($container !== null);

$container->innerHTML = $this->convertEsiSelfClosingTagsToEmptyElements($html5Fragment);
$container->innerHTML = $esiProcessor->preProcess($html5Fragment);

$this->applyHandlers($document, $container);

return $this->convertEsiEmptyElementsToSelfClosingTags($container->innerHTML);
return $esiProcessor->postProcess($container->innerHTML);
}

private function applyHandlers(Document $document, Node $context): void
Expand All @@ -75,13 +78,4 @@ private function applyHandlers(Document $document, Node $context): void
}
}

private function convertEsiSelfClosingTagsToEmptyElements(string $html): string
{
return preg_replace('#(<esi:([a-z]+)(?:[^>]*))/>#i', '$1></esi:\\2>', $html) ?? $html;
}

private function convertEsiEmptyElementsToSelfClosingTags(string $html): string
{
return preg_replace('#(<esi:([a-z]+)(?:[^>]*))></esi:\\2>#i', '$1 />', $html) ?? $html;
}
}
91 changes: 91 additions & 0 deletions tests/Implementation/EsiTagProcessorTest.php
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
<?php

declare(strict_types=1);

namespace Webfactory\Html5TagRewriter\Tests\Implementation;

use PHPUnit\Framework\Attributes\DataProvider;
use PHPUnit\Framework\Attributes\Test;
use PHPUnit\Framework\TestCase;
use Webfactory\Html5TagRewriter\Implementation\EsiTagProcessor;

final class EsiTagProcessorTest extends TestCase
{
private EsiTagProcessor $processor;

protected function setUp(): void
{
$this->processor = new EsiTagProcessor();
}

#[Test]
#[DataProvider('providePreProcessCases')]
public function preProcess(string $input, string $expected): void
{
$result = $this->processor->preProcess($input);

self::assertSame($expected, $result);
}

public static function providePreProcessCases(): iterable
{
yield 'wraps self-closing tag in comment' => [
'<esi:include src="url" />',
'<!--esi html5-tagrewriter <esi:include src="url" />-->',
];

yield 'wraps opening tag in comment' => [
'<esi:remove>',
'<!--esi html5-tagrewriter <esi:remove>-->',
];

yield 'wraps closing tag in comment' => [
'</esi:remove>',
'<!--esi html5-tagrewriter </esi:remove>-->',
];

yield 'wraps opening and closing tags in separate comments' => [
'<esi:remove>content</esi:remove>',
'<!--esi html5-tagrewriter <esi:remove>-->content<!--esi html5-tagrewriter </esi:remove>-->',
];

yield 'handles multiple tags' => [
'<esi:include src="a" /><esi:include src="b" />',
'<!--esi html5-tagrewriter <esi:include src="a" />--><!--esi html5-tagrewriter <esi:include src="b" />-->',
];

yield 'preserves non-esi content' => [
'<div><p>Hello</p><esi:include src="url" /><span>World</span></div>',
'<div><p>Hello</p><!--esi html5-tagrewriter <esi:include src="url" />--><span>World</span></div>',
];

yield 'handles esi tags spanning html element boundaries' => [
'<p>Start <esi:remove>content</p><p>more</esi:remove> end</p>',
'<p>Start <!--esi html5-tagrewriter <esi:remove>-->content</p><p>more<!--esi html5-tagrewriter </esi:remove>--> end</p>',
];
}

#[Test]
#[DataProvider('provideRoundtripCases')]
public function roundtrip(string $html): void
{
$preProcessed = $this->processor->preProcess($html);
$result = $this->processor->postProcess($preProcessed);

self::assertSame($html, $result);
}

public static function provideRoundtripCases(): iterable
{
yield 'self-closing tag without attributes' => ['<esi:include />'];
yield 'self-closing tag with attribute' => ['<esi:include src="url" />'];
yield 'self-closing tag with multiple attributes' => ['<esi:include src="url" alt="fallback" onerror="continue" />'];
yield 'self-closing tag with ampersand in query string' => ['<esi:include src="url?foo=bar&bar=baz" />'];
yield 'multiple self-closing tags' => ['<esi:include src="a" /><esi:include src="b" />'];
yield 'opening and closing tags' => ['<esi:remove>content</esi:remove>'];
yield 'nested esi structure' => ['<esi:try><esi:attempt><esi:include src="url" /></esi:attempt><esi:except><esi:include src="fallback" /></esi:except></esi:try>'];
yield 'esi tags spanning html boundaries' => ['<p>Start <esi:remove>content</p><p>more</esi:remove> end</p>'];
yield 'esi wrapping partial html' => ['<p><esi:remove><b>Important:</esi:remove>text<esi:remove></b></esi:remove></p>'];
yield 'mixed esi and html content' => ['<div><esi:include src="header" /><p>Content</p><esi:include src="footer" /></div>'];
}
}
15 changes: 5 additions & 10 deletions tests/Implementation/Html5TagRewriterTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -184,16 +184,16 @@ public static function providePreservedFragments(): iterable
"<pre> Line 1\n Line 2\n Line 3</pre>",
];

yield 'ESI tag' => [
'<esi:include src="url?foo=bar&amp;bar=baz" />',
yield 'ESI tag kept literally, since it may be processed as raw text, not under HTML rules' => [
'<esi:include src="url?foo=bar&bar=baz" />',
];

yield 'ESI tags in context' => [
yield 'ESI multiple ESI tags with context' => [
'<div><p>test</p>
<hr>
<esi:include src="/_fragment?_hash=123&amp;foo=bar" />
<esi:include src="/_fragment?_hash=123&foo=bar" />
<hr>
<esi:include src="/_fragment?_hash=456" />
<esi:include src="/_fragment?_hash=456&bar=baz" />
</div>'
];
}
Expand All @@ -209,11 +209,6 @@ public function processBodyFragment_preserves_fragment(string $fragment): void

public static function provideFragmentsCleanedUp(): iterable
{
yield 'empty ESI include tag' => [
'<esi:include src="url?foo=bar&amp;bar=baz"/>',
'<esi:include src="url?foo=bar&amp;bar=baz" />',
];

yield 'qouted entities are replaced' => [
'<p>&lt;script&gt; &amp; &quot;quotes&quot;</p>',
'<p>&lt;script&gt; &amp; "quotes"</p>',
Expand Down