Skip to content
This repository has been archived by the owner on Sep 11, 2023. It is now read-only.
/ MHTML Public archive

MHTML Utils for working with Chrome/Chromium Blink saved webarchives (.mhtml)

License

Notifications You must be signed in to change notification settings

Querela/MHTML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MHTML Utils

MHTML build status on Travis CI MHTML code coverage on Coveralls GitHub release GitHub code size in bytes MHTML License

Copyright (c) 2019 Querela. All rights reserved.

See the end of this file for further copyright and license information.

  • Free software: MIT license

This package contains MHTML utilities for working with Chrome/Chromium Blink saved webarchives in the .mhtml format. It may later also be able to work with any .mht, .mhtm / .mhtml files but currently strictly refers to the Blink implementation. See: Chromium Blink on Github or Chromium Blink on GoogleSource, Chromium Blink.git on GoogleSource.

This package was developed because the MIME / email utilities of the standard Python library were mangling binary content, e. g. in images. It tried to convert \r and \n linebreak characters according to some policy. Trying to switch or disable this behaviour was not successful. This package will not work for any MIME message but tries to be completely save for using with MHTML files saved by the Blink engine (?).

This package doesn't currently try to fully parse the MHTML file but rather provide a view onto the raw binary content. Extracting a resource is only getting a slice between two different offsets. The header detection should work for almost any MHTML data but I will have to try different input files from other sources to be sure.

This package contains severals example scripts to show how the package can be used. That include dumping embedded resources into a directory, extracting the main web page or listing all the resources in a MHTML archive. It also allows to remove existing resources from an MHTML file, e. g. for stripping adverts, images etc. as well as inserting new resources from another MHTML file. (Later it may be possible to create resources from any file.)

Since Chrome disables javascript and strips all unneccessary content from a newly created MHTML file, it is not really possible to make an interactive MHTML file containing a directory and linked pages. Work in progress is the ability to alter a resource so that client scripts can be written to combine multiple MHTML files into a single one and display the whole content.

It may later also be possible to create a MHTML archive from a given list or description but is not the priority.

This project requires at least Python 3.5. It has no other dependencies.

To work with the source and run test with py.test etc. it offers several development dependencies that can be installed:

pip install -e .[dev]

Tests can be run with:

python setup.py test

Run stylechecks:

python setup.py flake8
python setup.py pylint

Clean up:

python setup.py clean --all

Copyright (c) 2019 Querela. All rights reserved.

See the file "LICENSE" for information on the history of this software, terms & conditions for usage, and a DISCLAIMER OF ALL WARRANTIES.

All trademarks referenced herein are property of their respective holders.