-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathidd-notes
513 lines (386 loc) · 16.3 KB
/
idd-notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
Whatever you need to do with Python, you can probably import a library for it.
But what exactly happens when you use that import statement?
How does a source file that you've installed or written become a Python module
object, providing methods or classes for you to play with?
While the import mechanism is relatively well-documented in the reference and
dozens of PEPs, sometimes even Python veterans are caught by surprise.
And some details are little-known: did you know you can import from zip
archives? Write CPython modules in C, or a dialect of Lisp?
Or even import from URLs (which might not be a good idea)?
This talk explains exactly what can happen when you use the import statement –
from the mundane machinery of searching PYTHONPATH through subtle details
of packages and import loops, to deep internals of custom importers and
C extension loading.
Import deep dive
- The Basics: File-based importing
- Import statement
- __import__ function
- importlib
- imp
- Finding
- PEP 235 (Import on Case-Insensitive Platforms)
- Code
- PEP 263 (Defining Python Source Code Encodings)
- PEP 3120 (Using UTF-8 as the default source encoding)
- Cache
- PEP 3147 (PYC Repository Directories)
- Filename tags
- sys.path
- Reloading modules
- What reload() does
- IPython's %autoreload
- Reload in Web frameworks
- Packages
- Absolute vs. relative imports
- PEP 328 (Imports: Multi-Line and Absolute/Relative)
- Import loops
- Namespace packages
- Classic: __path__ manipulation
- Now: PEP 420 (Implicit Namespace Packages)
- The machinery
- PEP 302 (New Import Hooks)
- PEP 451 (A ModuleSpec Type for the Import System)
- The Meta Path
- Finders
- ModuleSpec
- Loaders
- Importers (Finder+Loader combined)
- Built-in Modules
- Frozen Modules
- Module Matrix (compiled-in vs external ; Python vs C)
- Extension modules
- Extension modules
- Module cache
- ABI tags
- PEP 3149 (ABI version tagged .so files)
- PEP 489 (Redesigning extension module loading)
- Custom finders
- Zip imports (and zipapp)
- PEP 441 (Improving Python ZIP Application Support)
- Cython's pyximport
- importlib.abc
- InspectLoader
- URL imports
- (Git imports)
- Other Loader Functionality
- get_data() (ResourceLoader)
- get_code() (InspectLoader)
- The -m switch and runpy
- How the __main__ module is special
- PEP 338 (Executing modules as scripts)
*******************************************************************************
__import__ function
- why is it there?
- use it?
sys.meta_path
finders
sys.path
path hooks
sys.path_importer_cache
loaders
Circular imports
__pycache__ / .pyc
removing the .py file now works
Module types
* source
* built-in
* extension
* frozen
Reloading
It took Brett Canon 2.5 years to write importlib
*******************************************************************************
[4m ] intro & high level
[4m ] packages ; circular imports
[4m-] the meta_path ; logging imports
[4m ] sys.path & path hooks ; file types
[4m-] PYC cache ; how that works
[4m ] import loops
*******************************************************************************
*******************************************************************************
[5min] __import__ && sys.modules
[5min] Find, common case.
[5min] Load, common case. Packages. Import loops.
[5min] Loaders in detail; sys.path; sys.metapath; Different loaders (module types). Finders in detail; different finders
[5min] Reloading; __main__
*******************************************************************************
Whatever you need to do with Python, you can probably import a library for it.
But what exactly happens when you use that import statement?
Let's say I import some random Python module.
import random
Now, of all the kinds of statements in Python, `import` is, on its own, one of
the least magical -- right after `pass`, `assert`, and Python 2's `print`.
What do I mean by "least magical"? Well, it's very easy to replace it by
a function:
random = __import__('random')
That's all import does: call a function (which is largely written in Python),
and assign the result to a variable.
If you're importing a subpackage
import httplib.parse
the top-level module is actually returned and assigned:
httplib = __import__('httplib.parse')
It *can* get a bit more complicated if you're importing individual items from
a module, like:
from random import randint, randrange
which roughly corresponds to:
_random = __import__('random', fromlist=['randint', 'randrange'])
randint = _random.randint
randrange = _random.randrange
Yet more complicated is the case of a relative import from a package:
import .util
in which case, the __import__ function needs a bit more context:
util = __import__('util', globals=globals(), level=1)
And it gets more complicated if you're importing everything from a module:
from math import *
where the algorithm for figuring out what actually gets assigned is not
entirely trivial:
_math = __import__('math', fromlist=['*'])
try:
_names = _math.__all__
except AttributeError:
_names = [n for n in dir(_math) if not n.startswith('_')]
for _name in getattr(_math, '__all__', dir(_math)):
globals()[_name] = getattr(_math, _name)
... okay, so maybe I lied about the import statement not being very magical.
But, well, it's pretty close.
The point is, it figures out arguments to an __import__ function,
calls it, and gets back a module object.
Then, it assigns either the module or some of its attributes to names.
Here's the full signature:
__import__(name, globals=None, locals=None, fromlist=(), level=0)
Now, the __import__ function is a bit of historical baggage.
You can replace it with your own implementation, but doing that sanely
is quite messy, and there are better ways to customize importing,
which I'll talk about later.
You can also call it yourself, but there's a better newer alternative,
`importlib.import_module`, which is better suited for programmatic use
and isn't bogged down by backwards compatibility requirements.
You simply give it the module name:
>>> import importlib
>>> math = importlib.import_module('math')
>>> math
<module 'math' from ...>
For relative imports, you need to give it the name of the package you're
importing from:
>>> mod = importlib.import_module('..mod', 'pkg.subpkg')
>>> mod
<module 'pkg.mod' from ...>
Et voilà, out pops a module object, ready to use.
In fact, import_module is just as powerful as __import__; in fact,
both are just a front-end for the *import machinery*.
So, where are we? When you run your import statement, two things happen:
* call the *import machinery*, out pops a module
* assign the result (or parts of it) to names
The rest of this talk will be about *import machinery*.
It's called a "deep dive", because we start at the top, and delve into the
details of the importing algorithm.
Now, the first thing the import machinery does when asked to load a module
is to look into the `sys.modules` dictionary.
* call the *import machinery*
* is it in `sys.modules`?
* yes -> return
* no
* import it
* assign the result (or parts of it) to names
This is Python's record of all modules that are currently loaded.
If the requested module is already there, it just gets returned.
Look at this: if I import a module, stash it somewhere, then import it again,
I get the same exact module object.
>>> import math
>>> previous = math
>>> import math
>>> math is previous
True
No kind of re-loading takes place, so the second import is very fast.
Now, this record can be easily invalidated: `sys.modules` is just a regular
dictionary:
>>> import sys
>>> sys.modules
{'__main__': <module '__main__' (built-in)>,
...
'math': <module 'math' from '...'>,
'sys': <module 'sys' (built-in)>,}
I can certainly delete a module from this dict, which will cause the
import machinery to load it again!
>>> import math
>>> previous = math
>>> del sys.modules['math']
>>> import math
>>> math is previous
False
We can even "poison the cache", putting a non-module object in.
The import machinery doesn't care in the slightest:
>>> sys.modules['impostor'] = "Hey! I'm not a module!"
>>> import impostor
>>> impostor
"Hey! I'm not a module!"
This is considered a feature.
Some wacky modules actually go so far as to replace themselves in sys.modules
*while they're being imported*, tricking the import machinery to return
a replacement object.
In fact, the import algorithm is carefully designed to allow this:
after doing the hard work of importing, it explicitly reaches into
sys.modules to retrieve the object it returns.
* call the *import machinery*
* is it in `sys.modules`?
* yes
* return it!
* no
* do the import (which puts it in sys.modules)
* return the sys.modules entry
* out pops a module
* assign the result (or parts of it) to names
*******************************************************************************
Now. How do we do an import?
There are two main things to do. When importing some random module,
import random
First, Python *finds* a corresponding file – on my system, it's in some system
directory in this case:
/usr/lib64/python3.4/random.py
and once it's found that, it reads and compiles and runs the file to give me
a module object. This is known as *loading*.
* call the *import machinery*
* if it is in `sys.modules`, return it!
* ???
* find source file
* load the module (which puts it in sys.modules)
* return the sys.modules entry
-> out pops a module
* assign the result (or parts of it) to names
And there's also an additional step before finding the source.
Can anyone guess what it is?
* call the *import machinery*
* if it is in `sys.modules`, return it!
* ??? (packages)
* if it is in `sys.modules`, return it!
* find source file
* load the module (which puts it in sys.modules)
* return the sys.modules entry
-> out pops a module
* assign the result (or parts of it) to names
Well, there's the usual sys.modules check, but there's one more step still,
and it has to do with packages.
We'll get to it in due time; now let's examine how finding and loading works.
First, finding. Finding is done by a Finder, and for the common case of
loading from a Python file on disk, it goes like this:
* Loop through sys.path
* Once a module is found, construct a "ModuleSpec"
There's a list of all the filesystem paths where Python modules may be found,
in order of priority.
You can add to it at runtime, or set the PYTHONPATH environment
variable before starting Python to add extra paths to it.
On my system, it looks like this:
>>> sys.path
['', '/usr/lib64/python34.zip', '/usr/lib64/python3.4',
'/usr/lib64/python3.4/plat-linux', '/usr/lib64/python3.4/lib-dynload',
'/home/petr/.local/lib/python3.4/site-packages',
'/usr/lib64/python3.4/site-packages', '/usr/lib/python3.4/site-packages']
*** *** *** TODO *** *** *** virtualenv comes in here
*** *** *** TODO *** *** *** importlib.util.find_spec()
Is there a `random.py` in the current directory? No.
Is there a `random.py` in `/usr/lib64/python34.zip`? No.
Is there a `random.py` in `/usr/lib64/python3.4`? Why, yes! Yes it's there!
In realyty, not just `.py` files are checked. There's also `.pyc` files
for sourceless compiled modules, or `.so` or `.pyd` for extension modules,
and some monster extensions for version-tagged extension modules
.cpython-36dm-x86_64-linux-gnu.so
.abi3.so
.so
.py
.pyc
So the search goes:
Is there a `random.cpython-36dm-x86_64-linux-gnu.so` in the current directory? No.
Is there a `random.abi3.so` in the current directory? No.
Is there a `random.so` in the current directory? No.
Is there a `random.py` in the current directory? No.
Is there a `random.pyc` in the current directory? No.
Is there a `random.cpython-36dm-x86_64-linux-gnu.so` in `/usr/lib64/python34.zip`? No.
... ... ...
And this happens for each directory. There's some caching going on,
so that Python doesn't actually ask the file system each time.
Python is smart to invalidate the cache when files are added to or removed
from the directory, but just in case, in case you ever add or remove module
files at runtime, you should clear this cache:
importlib.invalidate_caches()
Anyway, if all this searching doesn't find a module, then other Finders
are given a chance – I'll talk about that later – and if the module isn't found
at all you get an ImportError.
If the module *is* found, on the other hand, the finder writes up a formal
report of its findings: a ModuleSpec.
Think of it as a ticket, a prescription, a passport, documenting the basic
properties of the and how it is be loaded, something that the rest of the
machinery can look at and follow as the module is created, and that acts as
a permanent record of how the module was loaded.
With the ModuleSpec created, the Finder's job is done.
Now, the ModuleSpec is quite a simple object. It has a bunch of data
attributes, which should only ever be set by the finder. These are:
ModuleSpec:
name
The all-important name of the module, the key for the sys.modules dict.
origin
The path where the file should be loaded from
cached
The path where the a pre-compiled version of the file is to be found
loader
The object responsible for loading the module, gotten – in this case – from
the extension.
loader_state
And some further info for use by the loader, which is actually unused in all
the loaders that come with Python.
When you're importing a submodule, the parent package's name is also
put in the ModuleSpec:
parent
And when importing a package, there's an additional info about where to find
submodules:
submodule_search_locations
That's all the info the Finder finds, bundled up and ready to go to the Loader.
* call the *import machinery*
* if it is in `sys.modules`, return it!
* ??? (packages)
* if it is in `sys.modules`, return it!
* find source file
* go through sys.path until a suitable file is found
* produce a ModuleSpec
* load the module (which puts it in sys.modules)
* return the sys.modules entry
-> out pops a module
* assign the result (or parts of it) to names
*******************************************************************************
As for loading, the process is described quite well in PEP 451.
Here's the overview:
* create a module object (Loader.create_module)
First, the loader creates a module object. This is similar to any old object
that you can put attributes on. A __name__ attribute is filled in at this step.
* set module attributes
* __name__
* __loader__
* __package__
* __spec__
* __file__
* __cached__
Then, some special attributes are copied over from the ModuleSpec.
This is common to all modules; the Loader doesn't have a say in it.
And then comes the hard work: the Loader is asked to "execute" the module,
run its code to populate it with classes and functions we're importing it for.
In our case, it's a lot of work indeed:
* Loader.load_module
* Check pre-compiled bytecode file (from __pycache__)
* If it doesn't exist:
* read origin (__file__)
* compile the code
* write the bytecode (-B permitting)
* Execute the bytecode “inside” then module object
__builtins__, '__doc__
*** [Don't forget] Loader functions to get partial results
*** [Don't forget] sys.path_hooks
*** [Don't forget] __path__ & attributes for submodules *** ... os.path vs urllib.parse
*** [Don't forget] sys.modules entry replacement ***
*** [Don't forget] thread lock (do I get concurrent imports if there's an import statement inside a function?)
*** [Don't forget] why keep init files small
*** [Don't forget] zip files are basically interchangeable with directories
*** [Don't forget] WHY! Top-down approach; if you wan to read the source, here's a map
*** [forget?] Namespace packages
*******************************************************************************
podcast notes
- The machinery is all in Python now
tutorial notes
- zip files, __main__