Please read the design thoroughly before modifying the code. It lists some gotchas which may be helpful to know ahead of time.
Each command goes into the subcommands
module. each has a docopt string for the command-specific arguments. the .command()
function is invoked with the merging of the global arguments dict (from the main script) with the command-specific arguments dict.
If the module provides the args_schema
variable, the command-specific arguments will be passed through the schema validation.
Scraping is done with Selenium to allow for scraping private sites.
The Yahoo Groups undocumented JSON API is used:
https://groups.yahoo.com/api/v1/groups/<group_name>/messages?count=1&sortOrder=desc&direction=-1
to get the total number of messages.https://groups.yahoo.com/api/v1/groups/<group_name>/messages/<message_number>
to get the data, with HTML content, for the given messagehttps://groups.yahoo.com/api/v1/groups/<group_name>/messages/<message_number>/raw
to get the data, with raw content, for the given message
All the message data from the API is combined and inserted into a mongo
database with the same name as the group. Data is stored as returned
from the API except the message id is stored into the _id
field.
Files are scraped through the human-consumable interface (i.e. the website) as I couldn't figure out the JSON API calls for it.
They are stored in a GridFS instance with the name <group_name>_gridfs
.
All the group data - messages and files - can be dumped into a static site which is viewable without any internet connection whatsoever, and without needing to run a local browser.
The static site is an AngularJS app. The message index data is stored as a separate .js file and loaded with JSONP. This allows us to essentially load data from the local filesystem.
The messages themselves are stored in batches of n
messages (default 500) and
loaded on-demand.
A site is "dumped" by copying everything but the data from the
static_site_template
directory, and rendering the necessary data into
jsonp files.
The group files are copied into the files
directory, and it is left
up to the browser to display the contents, as if browsing any other
local directory.
Angular Templates are in the modules
subdirectory. The tricky part is
that there is no way to load these .html
files if the page is being
served from the file-system. Modern browsers prevent access to
arbitrary files due to security concerns, and they won't load the
<script type="text/ng-template" ...>
tags.
To work around this without being forced to include all the .html
in one file, there's the modules/load-templates.js
file. This
preloads all the templates into $templateCache
. This file is
automatically generated by dump_site
, which reads the
modules/*.html
files and inserts their data into the file.
NOTE: Template ids all start with ./
, e.g. modules/foo.html
will be loaded as the template called ./modules/foo.html
. If you
include a template without a leading .
, it will not work on the
generated site.
Paths are followed recursively.