python wayback machine downloader

Downloading archived web pages from the Wayback Machine.

Internet-archive is a nice source for several OSINT-information. This tool is a work in progress to query and fetch archived web pages.

This tool allows you to download content from the Wayback Machine (archive.org). You can use it to download either the latest version or all versions of web page snapshots within a specified range.

Content

➡️ Installation
➡️ notes / issues / hints
➡️ import
➡️ cli
➡️ Usage
➡️ Examples
➡️ Output
➡️ Contributing

Installation

Pip

Install the package
pip install pywaybackup
Run the tool
waybackup -h

Manual

Clone the repository
git clone https://github.com/bitdruid/python-wayback-machine-downloader.git
Install
pip install .
- in a virtual env or use --break-system-package

notes / issues / hints

Linux recommended: On Windows machines, the path length is limited. Files that exceed the path length will not be downloaded.
The tool uses a sqlite database to handle snapshots. The database will only persist while the download is running.
If you query an explicit file (e.g. a query-string ?query=this or login.html), the --explicit-argument is recommended as a wildcard query may lead to an empty result.
Downloading directly into a network share is not recommended. The sqlite locking mechanism may cause issues. If you need to download into a network share, set the --metadata argument to a local path.

import

You can import pywaybackup into your own scripts and run it. Args are the same as cli.

Additional args:

silent (default False): If True, suppresses all output to the console.
debug (default True): If False, disables writing errors to the error log file.

Use:

run()
status()
paths()
stop()

from pywaybackup import PyWayBackup

backup = PyWayBackup(
  url="https://example.com",
  all=True,
  start="20200101",
  end="20201231",
  silent=False,
  debug=True,
  log=True,
  keep=True
)

backup.run()
backup_paths = backup.paths(rel=True)
print(backup_paths)

output:

{
  'snapshots': 'output/example.com',
  'cdxfile': 'output/waybackup_example.cdx',
  'dbfile': 'output/waybackup_example.com.db',
  'csvfile': 'output/waybackup_https.example.com.csv',
  'log': 'output/waybackup_example.com.log',
  'debug': 'output/waybackup_error.log'
}

... or run it asynchronously and print the current status or stop it whenever needed.

import time
from pywaybackup import PyWayBackup

backup = PyWayBackup( ... )
backup.run(daemon=True)
print(backup.status())
time.sleep(10)
print(backup.status())
backup.stop()

output:

{
  'task': 'downloading snapshots',
  'current': 15,
  'total': 84,
  'progress': '18%'
}

cli

-h, --help: Show the help message and exit.
-v, --version: Show information about the tool and exit.

Required

-u, --url:
The URL of the web page to download. This argument is required.

Mode Selection (Choose One)

-a, --all:
All timestamps. Gives one folder per timestamp.
-l, --last:
Last Version. Gives one folder containing the last version of each file of specified --range.
-f, --first:
First Version. Gives one folder containing the first version of each file of specified --range.

Optional query parameters

Parameters for archive.org CDX query. No effect on snapshot download itself.

-e, --explicit:
Only the explicit URL. No wildcard subdomains or paths. For example get: root-only (https://example.com) or specific file (login.html, ?query=this).
--limit <count>:
Limits the snapshots fetched from archive.org CDX. (Will have no effect on existing CDX files)
Range Selection:
Set the query range in years (range) or a timestamp (start and/or end). If range then ignores start and end. Format for timestamps: YYYYMMDDhhmmss. Timestamp can as specific as needed (year 2019, year+month+day 20190101, ...).
- -r, --range:
  Specify the range in years for which to search and download snapshots.
- --start:
  Timestamp to start searching.
- --end:
  Timestamp to end searching.
Filtering:
- --filetype <filetype>:
  Specify filetypes to download. Example: --filetype jpg,css,js. You can only filter filetypes which are stored by archive.org (.html mostly not)
- --statuscode <statuscode>:
  Specify HTTP status codes to download. Example: --statuscode 200,301. PyWayBackup will always skip 404 and 301.
  Common status codes you may want to handle/filter:
  - 200 (OK)
  - 301 (Moved Permanently)
  - 404 (Not Found - snapshot seems to be empty)
  - 500 (Internal Server Error - snapshot is at least for now not available)

Optional Behavior Manipulation

Parameters will change the download behavior for snapshots.

-o, --output:
Defaults to waybackup_snapshots in the current directory. The folder where downloaded files will be saved.
-m, --metadata
Folder where metadata will be saved (cdx/db/csv/log). If you are downloading into a network share, you SHOULD set this to a local path because sqlite locking mechanism may cause issues with network shares.
--verbose:
Increase output verbosity.
--log :
Saves a log file into the output-dir. waybackup_<sanitized_url>.log.
--progress:
Shows a progress bar instead of the default output.
--workers <count>:
Number of simultaneous download workers. Default is 1, safe range is about 10. Too many workers may lead to refused connections by archive.org.
--no-redirect:
Disables following redirects of snapshots. Can prevent timestamp-folder mismatches caused by redirects.
--retry <attempts>:
Retry attempts for failed downloads.
--delay <seconds>:
Delay between download requests in seconds. Default is no delay (0).

Job Handling:

--reset:
If set, the job will be reset, and cdx, db, csv files will be deleted. This allows you to start the job from scratch.
--keep:
If set, cdx and db files will be kept after the job is finished. Otherwise they will be deleted.

Usage

Handling Interrupted Jobs

pywaybackup resumes interrupted jobs. The tool automatically continues from where it left off.

Only resumes queries if:

existing .cdx and .db files in an output dir
command is identical by URL, mode, and optional query parameters

Note: Changing URL, mode selection, query parameters or output prevents automatic resumption.

Examples

Download a specific single snapshot of all available files (starting from root):
waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000
Download a specific single snapshot of all available files (starting from a subdirectory):
waybackup -u https://example.com/subdir1/subdir2/assets/ -a --start 20210101000000 --end 20210101000000
Download a specific single snapshot of the exact given URL (no subdirs):
waybackup -u https://example.com -a --start 20210101000000 --end 20210101000000 --explicit
Download all snapshots of all available files in the given range:
waybackup -u https://example.com -a --start 20210101000000 --end 20231122000000

Output

Path Structure

The output path is currently structured as follows by an example for the query:
http://example.com/subdir1/subdir2/assets/

For the first and last version (-f or -l):

Will only include all files/folders starting from your query-path.

your/path/waybackup_snapshots/
└── the_root_of_your_query/ (example.com/)
    └── subdir1/
        └── subdir2/
            └── assets/
                ├── image.jpg
                ├── style.css
                ...

For all versions (-a):

Will create a folder named as the root of your query. Inside this folder, you will find all timestamps and per timestamp the path you requested.

your/path/waybackup_snapshots/
└── the_root_of_your_query/ (example.com/)
    ├── yyyymmddhhmmss/
    │   ├── subidr1/
    │   │   └── subdir2/
    │   │       └── assets/
    │   │           ├── image.jpg
    │   │           └── style.css
    ├── yyyymmddhhmmss/
    │   ├── subdir1/
    │   │   └── subdir2/
    │   │       └── assets/
    │   │           ├── image.jpg
    │   │           └── style.css
    ...

CSV

The CSV contains a snapshot per row:

[
   {
      "file": "/your/path/waybackup_snapshots/example.com/yyyymmddhhmmss/index.html",
      "id": 1,
      "redirect_timestamp": "yyyymmddhhmmss",
      "redirect_url": "http://web.archive.org/web/yyyymmddhhmmssid_/http://example.com/",
      "response": 200,
      "timestamp": "yyyymmddhhmmss",
      "url_archive": "http://web.archive.org/web/yyyymmddhhmmssid_/http://example.com/",
      "url_origin": "http://example.com/"
   },
    ...
]

Log

Verbose:

-----> Worker: 2 - Attempt: [1/1] Snapshot ID: [23/81]
SUCCESS   -> 200 OK
          -> URL:  https://web.archive.org/web/20240225193302id_/https://example.com/assets/css/custom-styles.css
          -> FILE: /home/manjaro/Stuff/python-wayback-machine-downloader/waybackup_snapshots/example.com/20240225193302id_/assets/css/custom-styles.css

Non-verbose:

55/81 - W:2 - SUCCESS - 20240225193302 - https://example.com/assets/css/custom-styles.css

Debugging

Exceptions will be written into waybackup_error.log (each run overwrites the file).

Contributing

I'm always happy for some feature requests to improve the usability of this tool. Feel free to give suggestions and report issues. Project is still far from being perfect.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github		.github
pywaybackup		pywaybackup
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

python wayback machine downloader

Content

Installation

Pip

Manual

notes / issues / hints

import

cli

Required

Mode Selection (Choose One)

Optional query parameters

Optional Behavior Manipulation

Job Handling:

Usage

Handling Interrupted Jobs

Examples

Output

Path Structure

CSV

Log

Debugging

Contributing

About

Uh oh!

Releases 13

Packages

Uh oh!

Languages

License

bitdruid/python-wayback-machine-downloader

Folders and files

Latest commit

History

Repository files navigation

python wayback machine downloader

Content

Installation

Pip

Manual

notes / issues / hints

import

cli

Required

Mode Selection (Choose One)

Optional query parameters

Optional Behavior Manipulation

Job Handling:

Usage

Handling Interrupted Jobs

Examples

Output

Path Structure

CSV

Log

Debugging

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 13

Packages 0

Uh oh!

Languages

Packages