This is a python3.6+ library, mainly does what WHATWG html5 spec calls 'prescan'.
- Check first UTF-8 and UTF-16 BOM
- Prescan (parse <meta> tag to get Encoding Name)
- Resolve the retrieved Name to a Python codec name
Note It just returns Python codec name, not codec object.
$ pip install html5prescan
html5prescan.get(buf, length=1024, jsonfile=None)
Parse input byte string buf
, and return (Scan, buf)
.
Scan
is a namedtuple
with fields:
label: Encoding Label name: Encoding Name pyname: Python codec name start: start position of the match end: end position of the match match: matched substring
The match
is from '<meta'
to the byte position
where successful parsing returned.
Encoding Label
and Encoding Name
are defined
in WHATWG Encoding.
The site provides encodings.json
file for convenience,
and the library uses the copy of it, when jsonfile
argument is None
.
See the docstring of html5prescan.get
for the details
(e.g. $ pydoc 'html5prescan.get'
).
---
As a commandline script, if there is no argument,
it reads standard input, and return Scan
.
$ html5prescan
<meta charset=greek>
(CTRL+D)
Scan(label='greek', name='ISO-8859-7', pyname='ISO-8859-7',
start=0, end=20, match='<meta charset=greek>')
In any other cases, it just prints help message.
To test, run make test
.
The test data files are derived from html5lib/encoding/tests*.dat
files.
The original tests are for the main html parser, not for prescan parser,
so I edited and renamed them (prescan1.dat
and prescan2.dat
).
See the first six commits for the diffs.
I also added some more tests ad hoc (prescan3.dat
).
Then, I tested the test data against well-known libraries
(validator
, jsdom
, html5lib
).
I reported all inconsistencies upstream,
and validator
and jsdom
maintainers confirmed my interpretations.
So I believe my library and tests are in a good state.
For the details, see test/resource/memo/201910-comparison.rst
.
The library imitates WHATWG prescan algorithm in Python code (countless small bytes slicing and copying). So it is naturally slow. But It is better to know how slow.
scrapy/w3lib
uses well maintained, therefore, relatively complex, regex search
to get encoding declaration.
(I think regex is mostly done in C or below in Python.)
From my humble tests, I've got the result that the library is about 20 times slower than w3lib.
I think this is in the range of expectation, not good, but not bad either.
For the details, see test/resource/memo/201910-performance.rst
.
Around 2013, WHATWG introduced a new encoding called 'replacement'.
It is to mask some insecure non-ascii-compatible encodings,
and it just decodes to one U+FFFD
unicode for any length of the input bytes.
Python doesn't have a codec corresponding to this encoding,
and this library returns None
for pyname
.
Users may need to add an extra check for this encoding.
The library includes an implementation of this codec (replacement.py
).
So in very rare cases, users may want to look at it.
If users want to register this codec, call replacement.register()
.
https://github.com/zackw/html5-chardet
It is a C version of validator's MetaScanner.java
.
He also uses html5lib tests edited for prescan.
So I am obviously following his path.
Relevant WHATWG html specs for prescan are:
- https://html.spec.whatwg.org/multipage/parsing.html#prescan-a-byte-stream-to-determine-its-encoding
- https://html.spec.whatwg.org/multipage/parsing.html#concept-get-attributes-when-sniffing
- https://html.spec.whatwg.org/multipage/urls-and-fetching.html#extracting-character-encodings-from-meta-elements
Is is just a part of the initial encode determination process.
---
validator, jsdom, html5-lib, w3lib:
- https://github.com/validator/htmlparser
- https://github.com/jsdom/html-encoding-sniffer
- https://github.com/html5lib/html5lib-python
- https://github.com/scrapy/w3lib
The software is licensed under The MIT License. See LICENSE.