Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "rescraped_at" column to sources_source table? #788

Open
philbudne opened this issue Sep 16, 2024 · 1 comment
Open

Add a "rescraped_at" column to sources_source table? #788

philbudne opened this issue Sep 16, 2024 · 1 comment
Assignees

Comments

@philbudne
Copy link
Contributor

User Story:
When there is a volume issue for a source, my first impulse is to want to know when it was last rescraped for feeds!

An ancillary bit of data is for the feeds table: when it was last seen in a rescape (ie; when was the RSS file or sitemap last discoverable or advertised), but I'm prone to adding columns to table (look at the rss-fetcher feeds table):

id                  | 905793
sources_id          | 665279
name                | La Provincia
url                 | http://www.laprovinciacr.it/rss.jsp?sezione=503
active              | t
last_fetch_attempt  | 2024-09-16 10:53:40.698808
last_fetch_success  | 2024-09-16 10:53:40.698808
last_fetch_hash     | 8091191156edd9fd2906c551378c89a5
last_fetch_failures | 0
created_at          | 2022-12-23 17:44:04.736466
http_etag           | 
http_last_modified  | 
next_fetch_attempt  | 2024-09-17 10:53:41.189757
queued              | f
system_enabled      | t
update_minutes      | 
http_304            | 
system_status       | Working
last_new_stories    | 2024-05-29 03:03:12.928357
rss_title           | La Provincia
poll_minutes        | 1440
@philbudne
Copy link
Contributor Author

Once this column is available it should be easy to write a query against a collection to return recently scraped sources that have no (active) feeds. This could be just a canned SQL query, or a button on the web site.

From there, someone could do full-site sitemap crawls (the code already exists in sitemap-tools, along with a command line tool) to look for possible "deep links" for useful sitemaps, but interpreting the results might require experimentation, or development of some heuristics, like:

  1. has "last modified" on links, and they're recent
  2. has google news tags
  3. name ends in _1.xml (seems to be a convention for the most recent in a series of files)
  4. is reasonably small

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants