A typical Python programmer never bothers about the
.pyc files. Ideally they don’t have to overlook what happens inside those files as it is all about the python execution process optimization technique.
But, is there a compiler for Python?
When we talk about the word
compiler, we immediately think of
C programming level compilation, where the compiler converts the source code into a machine level executable. This may be true in the low-level languages, where they do not have a programme
interpreter to directly execute the source.
high level languages, as we instruct in dynamic and reusable manner, majority of these interpreted languages produce an intermediate bytecode to optimize the execution process.
Why intermediate bytecode?
Because reducing the redundant work makes systems more efficient.
Imagine, when we write an application with hundreds of modules and tens of packages, we may expect millions of requests at the same time - and each millisecond counts in these time critical systems. The compilation process may not take too long that a user can observe, but it does add some latency with each request.
In CPython, the execution process involves several steps:
1. Parse source code into a parse tree 2. Transform parse tree into an Abstract Syntax Tree 3. Transform AST into a Control Flow Graph 4. Emit bytecode based on the Control Flow Graph 5. Use the bytecode along with stack based interpreter (virtual machine) to evaluate the instructions.
Whenever a programme is executed, ideally these steps should happen in sequence.
Each step is equally important in the process - the first four steps (until the byte code generation) are part of the compilation process. For obvious reasons we can’t do anything with step 5, as execution happens here.
What if we compile a program once, store outcome and reuse them until there is a change in the source/version? This is one of the best ways to reduce the redundancy in the process - and it helps to skip the first four steps in the execution process.
To reduce the unnecessary steps each time, we can cache the result of step-4 in the execution process and reuse them whenever required. This is where
pyc file generation starts.
Pyc files are actually holding the
seralized python code objects along with some
meta data to identify the time and version with which the pyc files are generated.
Code Object? - Code objects are the outcome of python compilation process.
Let’s say for python3.7, the byte code file contains data as below, this is not same across versions.
It’s of 16 bytes (4 * 32 bit) length, if we consider them from 0 to 15 bytes. Let’s consider I’ve a pyc file
Open the file in binary mode.
>>> import struct >>> file_pointer = open('test.cpython-37.pyc', 'rb')
Read the header bytes…
==> Bytes: 0 to 3
First four bytes are the magic number bytes. The magic word is used to reject
.pyc files generated by other Python versions. It should change for each incompatible change to the bytecode.
>>> magic = file_pointer.read(4) >>> struct.unpack('<Hcc', magic) >>> (3394, b'\r', b'\n')
3394 is the magic number given for python version, you can check the magic number of the given version as below.
>>> import sys >>> sys.version_info >>> sys.version_info(major=3, minor=7, micro=1, releaselevel='final', serial=0) >>> import imp, struct >>> struct.unpack("Hcc", imp.get_magic()) 3394
==> Bytes 4 to 7
Next four bytes are deterministic bits, if the value is
0 then it is traditional timestamp based header, else if the lowest bit of the bit field is set, the pyc is a hash-based pyc. For more info, refer PEP552 https://www.python.org/dev/peps/pep-0552/
>>> dbytes = file_pointer.read(4) >>> struct.unpack('<I', dbytes) 0
==> Bytes 8 to 11
If the deterministic bit set to
0, then next four bytes represents the timestamp when the pyc file last updated, this will be used to invalidate the pyc file if the source is latest than stored timestamp in the pyc file.
>>> ts = file_pointer.read(4) >>> struct.unpack('<I', ts) >>> 1544717733 >>> datetime.utcfromtimestamp(struct.unpack("<I", ts1)).strftime('%Y-%m-%d %H:%M:%S') '2018-12-13 16:15:33'
==> Bytes 12 to 15
Next four bytes represents the source size..
>>> source_size = struct.unpack("<I", file_pointer.read(4)) 89
After the first 16 header bytes, the remaining whole bytes represent the serialized code object.
>>> import marshal >>> marshal.load(file_pointer) >>> <code object <module> at 0x102dc04b0, file "./test.py", line 1>
pyc file stays in the same directory where the
.py file exist, from python3 it’s changed to a bit and new dir has been
__pycache__ and the file format has been changed to
If we compile the same source file with multiple python versions, then all the versions will be stored in the same