Skip to content

Prescan a byte string and return WHATWG and Python encoding names

License

Notifications You must be signed in to change notification settings

openandclose/html5prescan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

html5prescan

This is a python3.6+ library, mainly does what WHATWG html5 spec calls 'prescan'.

  1. Check first UTF-8 and UTF-16 BOM
  2. Prescan (parse <meta> tag to get Encoding Name)
  3. Resolve the retrieved Name to a Python codec name

Note It just returns Python codec name, not codec object.

Install

$ pip install html5prescan

API

html5prescan.get(buf, length=1024, jsonfile=None)

Parse input byte string buf, and return (Scan, buf).

Scan is a namedtuple with fields:

label:  Encoding Label
name:   Encoding Name
pyname: Python codec name
start:  start position of the match
end:    end position of the match
match:  matched substring

The match is from '<meta' to the byte position where successful parsing returned.

Encoding Label and Encoding Name are defined in WHATWG Encoding. The site provides encodings.json file for convenience, and the library uses the copy of it, when jsonfile argument is None.

See the docstring of html5prescan.get for the details (e.g. $ pydoc 'html5prescan.get').

---

As a commandline script, if there is no argument, it reads standard input, and return Scan.

$ html5prescan
<meta charset=greek>
(CTRL+D)
Scan(label='greek', name='ISO-8859-7', pyname='ISO-8859-7',
    start=0, end=20, match='<meta charset=greek>')

In any other cases, it just prints help message.

Testing

To test, run make test.

Test Data

The test data files are derived from html5lib/encoding/tests*.dat files. The original tests are for the main html parser, not for prescan parser, so I edited and renamed them (prescan1.dat and prescan2.dat).

See the first six commits for the diffs.

I also added some more tests ad hoc (prescan3.dat).

Then, I tested the test data against well-known libraries (validator, jsdom, html5lib). I reported all inconsistencies upstream, and validator and jsdom maintainers confirmed my interpretations.

So I believe my library and tests are in a good state.

For the details, see test/resource/memo/201910-comparison.rst.

Performance

The library imitates WHATWG prescan algorithm in Python code (countless small bytes slicing and copying). So it is naturally slow. But It is better to know how slow.

scrapy/w3lib uses well maintained, therefore, relatively complex, regex search to get encoding declaration. (I think regex is mostly done in C or below in Python.)

From my humble tests, I've got the result that the library is about 20 times slower than w3lib.

I think this is in the range of expectation, not good, but not bad either.

For the details, see test/resource/memo/201910-performance.rst.

Replacement Encoding

Around 2013, WHATWG introduced a new encoding called 'replacement'. It is to mask some insecure non-ascii-compatible encodings, and it just decodes to one U+FFFD unicode for any length of the input bytes.

Python doesn't have a codec corresponding to this encoding, and this library returns None for pyname. Users may need to add an extra check for this encoding.

The library includes an implementation of this codec (replacement.py). So in very rare cases, users may want to look at it.

If users want to register this codec, call replacement.register().

Similar projects

https://github.com/zackw/html5-chardet

It is a C version of validator's MetaScanner.java. He also uses html5lib tests edited for prescan. So I am obviously following his path.

Reference

Relevant WHATWG html specs for prescan are:

Is is just a part of the initial encode determination process.

---

validator, jsdom, html5-lib, w3lib:

License

The software is licensed under The MIT License. See LICENSE.

About

Prescan a byte string and return WHATWG and Python encoding names

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published