You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User Story:
When there is a volume issue for a source, my first impulse is to want to know when it was last rescraped for feeds!
An ancillary bit of data is for the feeds table: when it was last seen in a rescape (ie; when was the RSS file or sitemap last discoverable or advertised), but I'm prone to adding columns to table (look at the rss-fetcher feeds table):
id | 905793
sources_id | 665279
name | La Provincia
url | http://www.laprovinciacr.it/rss.jsp?sezione=503
active | t
last_fetch_attempt | 2024-09-16 10:53:40.698808
last_fetch_success | 2024-09-16 10:53:40.698808
last_fetch_hash | 8091191156edd9fd2906c551378c89a5
last_fetch_failures | 0
created_at | 2022-12-23 17:44:04.736466
http_etag |
http_last_modified |
next_fetch_attempt | 2024-09-17 10:53:41.189757
queued | f
system_enabled | t
update_minutes |
http_304 |
system_status | Working
last_new_stories | 2024-05-29 03:03:12.928357
rss_title | La Provincia
poll_minutes | 1440
The text was updated successfully, but these errors were encountered:
Once this column is available it should be easy to write a query against a collection to return recently scraped sources that have no (active) feeds. This could be just a canned SQL query, or a button on the web site.
From there, someone could do full-site sitemap crawls (the code already exists in sitemap-tools, along with a command line tool) to look for possible "deep links" for useful sitemaps, but interpreting the results might require experimentation, or development of some heuristics, like:
has "last modified" on links, and they're recent
has google news tags
name ends in _1.xml (seems to be a convention for the most recent in a series of files)
User Story:
When there is a volume issue for a source, my first impulse is to want to know when it was last rescraped for feeds!
An ancillary bit of data is for the feeds table: when it was last seen in a rescape (ie; when was the RSS file or sitemap last discoverable or advertised), but I'm prone to adding columns to table (look at the rss-fetcher feeds table):
The text was updated successfully, but these errors were encountered: