PHP, as is the case with many other languages used for web applications, is an interpreted language. When running an application written in PHP, we usually don’t think what really happens to its code during the execution. In this article you will learn how the finished code is processed by a PHP interpreter.
Compilation and interpretation
Compiled languages, such as, for example, C, C++, differ from interpreted languages due to the fact that their processing into machine code is performed only once. After the compilation process, you can run the application many times without the need for another compilation. Once the application is compiled, there is no additional time overhead for its subsequent processing, but its development process is also more difficult (changes require recompilation). Alternatively, we have interpreted languages, such as PHP, Python and Ruby. They are less efficient as their code is processed by a separate application (an interpreter) that translates the application code “on the fly”. Such a strategy means lower performance and application execution time, but on the other hand — it allows for greater flexibility and ease of software development. So, let us take a closer look at how a PHP interpreter works.
Zend Engine is both the engine and the very heart of the PHP language. It consists of a source code to bytecode compiler and a virtual machine that executes this code. It is comes directly with PHP — when you install PHP, you install Zend Engine at the same time. It is responsible for the whole code processing, from the moment your HTTP server sends the requested PHP script execution to it, until the HTML code is generated and returned to the server. To put it simply, the whole processing of a PHP script is carried out by the interpreter in four stages:
- lexical analysis (lexing),
- syntax analysis (parsing),
With the introduction of the OPcache mechanism, the whole process can be basically skipped until the last step — launching/executing the application on a virtual machine. The situation becomes even more comfortable if you know what’s new in PHP version 8. I mean of course the JIT compiler that allows you to compile PHP code. As a result, it is possible to run the machine code directly — bypassing the process of interpretation or execution by a virtual machine.
I would like to add that in the past there was another curious option — code transpilation, e.g., into the C++ language. Such a solution was used in HipHop for PHP created by the Facebook programmers, that is not being developed anymore. At a later stage, however, the transpilation was replaced by the HHVM (HipHop Virtual Machine) project based on just-in-time (JIT) compilation.
Nevertheless, let us check out what the individual interpretation steps look like in their most basic form.
Lexical analysis (Lexing)
Sometimes also called tokenizing, it is a phase that literally consists of converting a string of characters from the source code written in PHP into a sequence of tokens that describe what each subsequent encountered value means. The set of tokens generated in this way helps the interpreter with processing the code further.
PHP uses the re2c lexer generator with the definition file zend_language_scanner.l. In its basic form, it runs regular expressions in the transferred file, which allows for the identification of individual code elements, e.g., from the language’s syntax, such as “ if”, “ switch”, “ function”, etc.
If you would like to better understand how such tokens are generated, this is well-presented by implementing the following PHP code:
Of course, the lexer doesn’t work exactly this way, but it should give you some idea of how the code is being analysed. However, if you would like to know what the generated tokens look like for a sample code:
It looks as follows:
At first glance, you may notice that not all the elements are tokens. Some characters like =, ;, :, ? are considered tokens by themselves.
Interestingly, the lexer not only handles the processing of the code into tokens, but it also stores the information about the value stored by the tokens, as well as about the reference to the specific line in which it was intercepted. This is used, among other things, to generate a stack trace of an application. Syntax analysis (parsing) This is another process consisting, like lexing, of processing the generated tokens into a more ordered and organised data structure. As with lexing, PHP uses here on an external tool called GNU Bison based on the BNF file containing the grammar of the language. It allows you to convert a context-free grammar into a more useful, cause-and-effect one. The LALR(1) method is used for conversion, which reads the input with a preview of n tokens forward (in the case of PHP 1) from left to right and produces a right-hand output. Through this process, the parser is able to match tokens to the grammar rules defined in the BNF file. In the process of matching tokens, it is validated whether the tokens form the correct syntax constructs.
The final product of this phase is the generation of an abstract syntax tree (AST) by the parser. This is the source code tree view that will be used in the compilation phase. Using the php-ast extension it is possible to preview such a structure example. Using a sample code snippet again:
As a result, you will get a tree with a structure like this:
While this structure may not tell you much from a programmer’s point of view, it is useful for carrying out static code analysis using tools like Phan.
AST is the last stage of the analysis — in the next step the code in this form is transferred for compilation.
Without the use of JIT, PHP in its standard form is compiled from the generated AST to OPCode, not — as is the case with JIT — into machine code. The compilation process is carried out by recursively traversing the AST, as part of which some optimisations are made as well. Most often, simple arithmetic calculations are performed, or the expressions such as strlen(“test”) are replaced with a direct int(4) value.
As with the previous phases, there are also tools for previewing the generated OPCode. Among the tools at your disposal, you have VLD or OPCache. Below is an example dump provided by the VLD from a compiled Greeting class providing a sayhello method:
By viewing the dump above, a skilled PHP developer can understand its structure on a basic level. Defined here is the class and method, followed by:
- assumption of value by the function
- creation of the temporary variable
- concatenation of the strings behind the variables
- printing the temporary variable
- return from the function after its completion
Describing the entire issue of OPCode and its components exhaustively would definitely go beyond the scope of this article. If you want to learn more about it, the official documentation will help you get started.
This is the last phase of the interpretation process. At this stage, you actually run the generated OPCode on the Zend virtual machine (Zend Engine VM). The end result is what a given script was supposed to generate, i.e. the same as the output of commands such as echo or print. From the point of view of web applications, it is usually a ready source code for a website.
Most of us do not think about how the PHP code is actually analysed and run on a server — especially when we entrust server and application monitoring to external service providers. Nevertheless, it is good to understand what really happens to the code of your application when it is transferred to an interpreter. Such knowledge can help with both the security and performance analysis of a project developed in PHP.
Originally published at https://www.droptica.com.