Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sitemaps links are not returned for some websites (ex: https://www.sainsburys.co.uk) #17

Open
kvamsij opened this issue Aug 15, 2024 · 0 comments

Comments

@kvamsij
Copy link

kvamsij commented Aug 15, 2024

when you try to get sitemap for a website like https://www.sainsburys.co.uk it returns an empty array. But i have checked https://www.sainsburys.co.uk/robots.txt. The sitemap url exists in robots.txt.

So I did a little digging and found out the server was denying the request. The response was this.

`https://www.sainsburys.co.uk/robots.txt

<TITLE>Access Denied</TITLE>

Access Denied

You don't have permission to access "http://www.sainsburys.co.uk/robots.txt" on this server.


Reference #18.878f7b5c.1723733159.22628e9

https://errors.edgesuite.net/18.878f7b5c.1723733159.22628e9

`

I can see that there were no headers added when requesting respective robots.txt url. So I added headers following headers in the get.concat and it worked for me.
headers: { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8', 'Accept-Language': 'en-US,en;q=0.9', 'Accept-Encoding': 'gzip, deflate, br' }

I'll be happy to contribute. As it is a small change.
Regards,
Vamse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant