-
-
Notifications
You must be signed in to change notification settings - Fork 17
Group Call Summary 2017 02 08
From @danielballan:
I invite others to comment on or edit the following. After discussion maybe it can more to a README or somewhere more permanent. Apologies if it duplicates info available elsewhere.
There is a spreadsheet with a list of 30 000 URLs of government web pages of interest. The pages at these URLS are captured every ~3 days for PageFreezer. Roughly ~80% of these URLs happen to be on the same ~150 domains and subdomains (and all .gov
). Weekly, these subdomains are captured recursively and stored in zip files.
Current workflow: Versionista flags changes to 25 000 URLs. Our versionista-outputter runs and populates a row for each changed URL into a series of CSVs, one for each of the ~100 domains being tracked - these CSVs are distributed to analysts who copy the rows into Google spreadsheets (and this should be automated). Analysts then hand label for 12 different "types of changes" and 6 different "types of significance" - each row can receive multiple labels. This system, in which no filters are used, will typically produce ~3 000-5 000 changes over the course of 3 days, from the total 25 000 URLs.
Our goal is to filter and/or prioritize those rows to direct analysts to important differences and also to provide more useful columns that will help them judge which rows to follow up on. Additionally, one column should be a link to visual diff, something with versionista currently provides but PageFreezer currently does not.
To start, two-tier prioritization:
- Automatically filter and prioritze the 30 000, directing analyst effort to the most likely important entries.
- Automatically highlight any major changes in the documents in the ~150 domains/subdomains that aren't in those 30 000.
From @titantiumbones:
For tier 1, how is this summary of the architecture we need:
pf-diff: command-line service capable of carrying out diffs on ca. 10^4 pages at a time. Ouputs JSON file that (hopefully) matches the format of the pagefreezer api.
pf-filter: filtration tool that accepts the output of pf-diff and adds priority ranking (can be Boolean for now)
pf-outputter: outputs a JSON or CSV report (pref both) of changes for use by analysts
pf-viewer: certain fields in the records created by pf-outputter link to diff views of the archived pages. pf-viewer serves them over the web.