The lexer is responsible for providing tokens to the parser. Typical use of the library does not require direct
interaction with the lexer, as an appropriate lexer is created by PhpParser\ParserFactory
. The tokens produced
by the lexer can then be retrieved using PhpParser\Parser::getTokens()
.
While this library implements a custom parser, it relies on PHP's ext/tokenizer
extension to perform lexing. However,
this extension only supports lexing code for the PHP version you are running on, while this library also wants to support
parsing newer code. For that reason, the lexer performs additional "emulation" in three layers:
First, PhpParser uses the PhpToken
based representation introduced in PHP 8.0, rather than the array-based tokens from
previous versions. The PhpParser\Token
class either extends PhpToken
(on PHP 8.0) or a polyfill implementation. The
polyfill implementation will also perform two emulations that are required by the parser and cannot be disabled:
- Single-line comments use the PHP 8.0 representation that does not include a trailing newline. The newline will be
part of a following
T_WHITESPACE
token. - Namespaced names use the PHP 8.0 representation using
T_NAME_FULLY_QUALIFIED
,T_NAME_QUALIFIED
andT_NAME_RELATIVE
tokens, rather than the previous representation using a sequence ofT_STRING
andT_NS_SEPARATOR
. This means that certain code that is legal on older versions (namespaced names including whitespace, such asA \ B
) will not be accepted by the parser.
Second, the PhpParser\Lexer
base class will convert &
tokens into the PHP 8.1 representation of either
T_AMPERSAND_FOLLOWED_BY_VAR_OR_VARARG
or T_AMPERSAND_NOT_FOLLOWED_BY_VAR_OR_VARARG
. This is required by the parser
and cannot be disabled.
Finally, PhpParser\Lexer\Emulative
performs other, optional emulations. This lexer is parameterized by PhpVersion
and will try to emulate ext/tokenizer
output for that version. This is done using separate TokenEmulator
s for each
emulated feature.
Emulation is usually used to support newer PHP versions, but there is also very limited support for reverse emulation to older PHP versions, which can make keywords from newer versions non-reserved.
The Lexer::tokenize()
method returns an array of PhpParser\Token
s. The most important parts of the interface can be
summarized as follows:
class Token {
/** @var int Token ID, either T_* or ord($char) for single-character tokens. */
public int $id;
/** @var string The textual content of the token. */
public string $text;
/** @var int The 1-based starting line of the token (or -1 if unknown). */
public int $line;
/** @var int The 0-based starting position of the token (or -1 if unknown). */
public int $pos;
/** @param int|string|(int|string)[] $kind Token ID or text (or array of them) */
public function is($kind): bool;
}
Unlike PHP's own PhpToken::tokenize()
output, the token array is terminated by a sentinel token with ID 0.
The lexer is normally invoked implicitly by the parser. In that case, the tokens for the last parse can be retrieved
using Parser::getTokens()
.
Nodes in the AST produced by the parser always corresponds to some range of tokens. The parser adds a number of positioning attributes to allow mapping nodes back to lines, tokens or file offsets:
startLine
: Line in which the node starts. Used by$node->getStartLine()
.endLine
: Line in which the node ends. Used by$node->getEndLine()
.startTokenPos
: Offset into the token array of the first token in the node. Used by$node->getStartTokenPos()
.endTokenPos
: Offset into the token array of the last token in the node. Used by$node->getEndTokenPos()
.startFilePos
: Offset into the code string of the first character that is part of the node. Used by$node->getStartFilePos()
.endFilePos
: Offset into the code string of the last character that is part of the node. Used by$node->getEndFilePos()
.
Note that start
/end
here are closed rather than half-open ranges. This means that a node consisting of a single
token will have startTokenPos == endTokenPos
rather than startTokenPos + 1 == endTokenPos
. This also means that a
zero-length node will have startTokenPos -1 == endTokenPos
.
Note: The example in this section is outdated in that this information is directly available in the AST: While
$property->isPublic()
does not distinguish betweenpublic
andvar
, directly checking$property->flags
for the$property->flags & Class_::VISIBILITY_MODIFIER_MASK) === 0
allows making this distinction without resorting to tokens. However, the general idea behind the example still applies in other cases.
The token offset information is useful if you wish to examine the exact formatting used for a node. For example the AST
does not distinguish whether a property was declared using public
or using var
, but you can retrieve this
information based on the token position:
/** @param PhpParser\Token[] $tokens */
function isDeclaredUsingVar(array $tokens, PhpParser\Node\Stmt\Property $prop) {
$i = $prop->getStartTokenPos();
return $tokens[$i]->id === T_VAR;
}
In order to make use of this function, you will have to provide the tokens from the lexer to your node visitor using code similar to the following:
class MyNodeVisitor extends PhpParser\NodeVisitorAbstract {
private $tokens;
public function setTokens(array $tokens) {
$this->tokens = $tokens;
}
public function leaveNode(PhpParser\Node $node) {
if ($node instanceof PhpParser\Node\Stmt\Property) {
var_dump(isDeclaredUsingVar($this->tokens, $node));
}
}
}
$parser = (new PhpParser\ParserFactory())->createForHostVersion($lexerOptions);
$visitor = new MyNodeVisitor();
$traverser = new PhpParser\NodeTraverser($visitor);
try {
$stmts = $parser->parse($code);
$visitor->setTokens($parser->getTokens());
$stmts = $traverser->traverse($stmts);
} catch (PhpParser\Error $e) {
echo 'Parse Error: ', $e->getMessage();
}