Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic Story: Automated front page coding by topic #198

Open
schock opened this issue Jul 21, 2015 · 1 comment
Open

Epic Story: Automated front page coding by topic #198

schock opened this issue Jul 21, 2015 · 1 comment

Comments

@schock
Copy link

schock commented Jul 21, 2015

(Future User Story): I'm a PageOneX user, and I want to simply type in a phrase or keyword, choose my newspapers and date range, and see an automatically generated PageOneX visualization of front page coverage.

Notes:

  • step 1, solve search for keywords (use Lexis Nexis, Google news, MediaCloud, etc to find the dates of stories w/keywords, search limited to page A1).
  • step 2, use OCR to find the keywords on the front pages for those dates
  • step 3, use machine learning to train software to identify the spatial boundaries of the article that contains that keyword, and select those boundaries :)
  • voila! automated PageOneX :)
@numeroteca
Copy link
Member

This approach, using PDF extracted information looks promising: https://github.com/samzhang111/frontpages/
You have first to set up the script to daily download front pages from the Newseum.

@samzhang111 has also worked with some python libraries to detect spatial boundaries in PDFs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants