Add a "rescraped_at" column to sources_source table? #788

philbudne · 2024-09-16T18:39:38Z

User Story:
When there is a volume issue for a source, my first impulse is to want to know when it was last rescraped for feeds!

An ancillary bit of data is for the feeds table: when it was last seen in a rescape (ie; when was the RSS file or sitemap last discoverable or advertised), but I'm prone to adding columns to table (look at the rss-fetcher feeds table):

id                  | 905793
sources_id          | 665279
name                | La Provincia
url                 | http://www.laprovinciacr.it/rss.jsp?sezione=503
active              | t
last_fetch_attempt  | 2024-09-16 10:53:40.698808
last_fetch_success  | 2024-09-16 10:53:40.698808
last_fetch_hash     | 8091191156edd9fd2906c551378c89a5
last_fetch_failures | 0
created_at          | 2022-12-23 17:44:04.736466
http_etag           | 
http_last_modified  | 
next_fetch_attempt  | 2024-09-17 10:53:41.189757
queued              | f
system_enabled      | t
update_minutes      | 
http_304            | 
system_status       | Working
last_new_stories    | 2024-05-29 03:03:12.928357
rss_title           | La Provincia
poll_minutes        | 1440

The text was updated successfully, but these errors were encountered:

philbudne · 2024-10-09T22:00:19Z

Once this column is available it should be easy to write a query against a collection to return recently scraped sources that have no (active) feeds. This could be just a canned SQL query, or a button on the web site.

From there, someone could do full-site sitemap crawls (the code already exists in sitemap-tools, along with a command line tool) to look for possible "deep links" for useful sitemaps, but interpreting the results might require experimentation, or development of some heuristics, like:

has "last modified" on links, and they're recent
has google news tags
name ends in _1.xml (seems to be a convention for the most recent in a series of files)
is reasonably small

pgulley assigned Evan-Leon Sep 16, 2024

Evan-Leon assigned ZiningXie Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a "rescraped_at" column to sources_source table? #788

Add a "rescraped_at" column to sources_source table? #788

philbudne commented Sep 16, 2024

philbudne commented Oct 9, 2024

Add a "rescraped_at" column to sources_source table? #788

Add a "rescraped_at" column to sources_source table? #788

Comments

philbudne commented Sep 16, 2024

philbudne commented Oct 9, 2024