-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regularly delete RSS feeds that go into "system disabled"? #680
Comments
rahulbot
added
question
Further information is requested
directory
low
low priority
labels
Jul 3, 2024
NOTE: There isn't currently code to detect if a feed has been removed
from the web-search (mcweb) feed table....
tallies of last system_status for system_disabled feeds:
```
***@***.***:~/rss-fetcher$ cat status-disabled.psql
select system_status, count(1)
from feeds
where not system_enabled
group by system_status
order by count desc;
***@***.***:~/rss-fetcher$ psql rss-fetcher < !$
psql rss-fetcher < status-disabled.psql
Pseudo-terminal will not be allocated because stdin is not a terminal.
system_status | count
--------------------------------------------------------------+-------
HTTP 404 Not Found | 14494
HTTP 403 Forbidden | 12250
Working | 11978
parse error | 11009
unknown hostname | 2847
HTTP 410 Gone | 1424
SSL error | 1130
read timeout | 937
connect timeout | 894
HTTP 500 Internal Server Error | 740
connection error | 517
DNS error | 240
HTTP 429 Too Many Requests | 228
too many redirects | 196
HTTP 503 Service Unavailable | 189
HTTP 401 Unauthorized | 152
HTTP 400 Bad Request | 102
HTTP 404 | 101
HTTP 405 Method Not Allowed | 71
HTTP 405 Not Allowed | 62
HTTP 502 Bad Gateway | 59
HTTP 503 Service Temporarily Unavailable | 56
HTTP 522 | 43
HTTP 404 File Not Found | 33
HTTP 501 Origin hit suppressed (0) | 30
HTTP 404 404 Not Found | 22
HTTP 521 | 20
fetch error | 17
HTTP 409 Conflict | 14
HTTP 202 Accepted | 14
HTTP 418 Unknown Error | 13
HTTP 404 Not found | 13
HTTP 523 | 13
HTTP 520 | 11
job timeout | 10
HTTP 403 OK | 10
HTTP 530 | 8
HTTP 526 | 7
HTTP 404 Not Fround | 7
HTTP 404 404 | 6
HTTP 503 Backend fetch failed | 6
bad URL | 5
HTTP 404 NOT FOUND | 5
HTTP 423 Locked | 5
HTTP 404 OK | 4
HTTP 404 not found | 4
HTTP 504 Gateway Time-out | 4
HTTP 524 | 3
HTTP 406 Not Acceptable | 3
HTTP 401 Restricted | 3
HTTP 404 Unknown site | 3
HTTP 500 500 Service unavailable (with message) | 3
HTTP 204 No Content | 3
HTTP 404 Page not found | 2
HTTP 101 Switching Protocols | 2
HTTP 403 | 2
HTTP 500 | 2
HTTP 405 Not allowed. | 2
HTTP 503 Under Maintenance | 1
HTTP 509 | 1
HTTP 520 Origin Server Unavailable | 1
HTTP 403 Site Disabled | 1
HTTP 403 HTTP Forbidden | 1
HTTP 403 Access Denied | 1
HTTP 402 | 1
HTTP 401 HTTP Forbidden | 1
HTTP 302 Found | 1
HTTP 413 Request Entity Too Large | 1
HTTP 410 Not Found | 1
HTTP 418 I'm a teapot | 1
HTTP 410 Disparu | 1
HTTP 421 Misdirected Request | 1
HTTP 423 | 1
HTTP 451 | 1
HTTP 502 | 1
HTTP 404 Página no encontrada | 1
HTTP 404 Page not found: /rss.xml | 1
HTTP 503 Backend unavailable, connection timeout | 1
HTTP 404 Page not found: /feed/latest-rss.xml | 1
HTTP 503 Service unavailable | 1
HTTP 404 Page Not Found | 1
HTTP 503 Service Unavailable: Back-end server is at capacity | 1
(82 rows)
```
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
The RSS Fetcher can indicate to us that an RSS feed is not working by marking the "system enabled?" value as false. In these cases it sticks a machine-generated note in the "System Status" field. Should we audit the most common statuses and investigate if they mean those feeds should be deleted?
I this noted while looking at CNN that some say "HTTP 410 Gone". This is an indication that the feed no longer exists. Here's an example showing some from https://search.mediacloud.org/sources/1095/feeds:
The text was updated successfully, but these errors were encountered: