Help Scraping GEVI #228

JPH71 · 2023-03-21T22:00:27Z

JPH71
Mar 21, 2023

For this film - https://gayeroticvideoindex.com/video/67509

I can not seem to get the Studio via xPath - the Studio been Johnny Rapid

I have tried: GEVI - init.py line 282
original - //a[contains(@href, "/company/")]/parent::td//text()[normalize-space()]
first change - //a[contains(@href, "/company/")]//text()[normalize-space()]
2nd Change - //td[@Class="ad"]//text()[normalize-space()]

All work when I use inspect in Chrome/Edge - but not through python

Anyone with any ideas?

Jason

fourstix · 2023-03-22T11:23:10Z

fourstix
Mar 22, 2023

It's been so long since I've struggled with XPath. XPath under python makes me shudder.

I would think that the second phrase: //a[contains(@href, "/company/")]//text()[normalize-space()] would return an array of two strings "CashModels", "Johnny Rapid". Is this what you're looking for? It seems to be the closest one to me to what you want.

This is from me examining the source page on GEVI in Firefox. (And again, I'm far from an Xpath expert, so I may be all wet.)

I would suggest debugging this the old fashioned way. Start with the first match parameter on left, and print out the element returned, then add the next search parameter and print, and so on to walk the chain until it fails to match what you expect. Tedious, but this is the only way I know.

Sorry I can't be more help, but I have always struggled by XPath. Also be aware that browsers will insert missing tags and otherwise pretty print malformed html, so what one parses with XPAth may not always match what one would expect by looking at the browser. That one has thrown me for a loop in the paste.

0 replies

JPH71 · 2023-03-22T11:59:39Z

JPH71
Mar 22, 2023
Author

That was what I was expecting... I end up with only CashModels... twice... This is rather irritating... I have a feeling that the Web browser auto corrects bad html but this does not happen when accessing it in python... I will try this with other titles and see... Thanks for taking the time to look at it...

…

On Wed, 22 Mar 2023, 12:23 fourstix, ***@***.***> wrote: It's been so long since I've struggled with XPath. XPath under python makes me shudder. I would think that the second phrase: ***@***.*** <https://github.com/href>, "/company/")]//text()[normalize-space()] would return an array of two strings "CashModels", "Johnny Rapid". Is this what you're looking for? It seems to be the closest one to me to what you want. This is from me examining the source page on GEVI in Firefox. (And again, I'm far from an Xpath expert, so I may be all wet.) I would suggest debugging this the old fashioned way. Start with the first match parameter on left, and print out the element returned, then add the next search parameter and print, and so on to walk the chain until it fails to match what you expect. Tedious, but this is the only way I know. Sorry I can't be more help, but I have always struggled by XPath. Also be aware that browsers will insert missing tags and otherwise pretty print malformed html, so what one parses with XPAth may not always match what one would expect by looking at the browser. That one has thrown me for a loop in the paste. — Reply to this email directly, view it on GitHub <#228 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKI3AKIBXNQHJ6YSGJHMZBLW5LOKRANCNFSM6AAAAAAWDBBF5E> . You are receiving this because you authored the thread.Message ID: ***@***.*** .com>

1 reply

fourstix Mar 22, 2023

Hi,
Unfortunately, XPath was built for XML which is a strict standard, and HTML is interpreted very loosely by almost all browsers today. That makes it painful for situations where one doesn't control the input. Like parsing a web site, which is arguably the most common use case today.

I think XPath always starts at the top, and works down, even with the relative operator. That's why it's repeating the first element for each match. Please check out the answer to this query on StackExchange. https://stackoverflow.com/questions/321805/how-can-you-identify-multiple-elements-with-the-same-name-in-xpath It seems to be related to the same general issue.

You may need to add some parentheses to get XPath to parse they way a human would expect.

Best regards,
Gaston

Charlotte-br560 · 2024-03-28T09:45:20Z

Charlotte-br560
Mar 28, 2024

Try this XPath:

//a[contains(@href, "/company/")]/text()[normalize-space()]

For consistency, ensure your Python scraping code handles dynamic content. Consider using Crawlbase for easier scraping.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help Scraping GEVI #228

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help Scraping GEVI #228

JPH71 Mar 21, 2023

Replies: 3 comments · 1 reply

fourstix Mar 22, 2023

JPH71 Mar 22, 2023 Author

fourstix Mar 22, 2023

Charlotte-br560 Mar 28, 2024

JPH71
Mar 21, 2023

Replies: 3 comments 1 reply

fourstix
Mar 22, 2023

JPH71
Mar 22, 2023
Author

Charlotte-br560
Mar 28, 2024