Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pause / wait to load before scrape #111

Closed
JReming85 opened this issue Sep 6, 2018 · 1 comment
Closed

Add pause / wait to load before scrape #111

JReming85 opened this issue Sep 6, 2018 · 1 comment

Comments

@JReming85
Copy link
Contributor

JReming85 commented Sep 6, 2018

Expected Behavior

I am rewriting certain URLs to goto outline.com/https://website.com

However outline.com takes a few moments to clean it up and display the results. Is there anyway to halt the scrape until it finishes loading / bypassing paywalls, etc

Current Behavior

Scrapes the loading page

Steps to Reproduce

URL - https://www.wsj.com/articles/the-nfls-best-players-are-getting-richer-than-ever-1536163544

{
"type": "xpath",
"xpath": [
"div[@Class='article-wrapper']"
],
"reformat": [
{
"type": "regex",
"pattern": "/.+.com/",
"replace": "https://outline.com/https://wsj.com"
}
]
}

@dugite-code
Copy link
Contributor

Currently There is no way to add a delay into the html body fetch. I have hacked php-curl into feed iron in the past by adding it into The Function at Line 271. That said I'm not 100% sure you could get the desired result from curl.

The other idea I had been working on, but have put on hold for the moment I mentioned #38. Adding the ability to call phantomjs of selenium. But these are potentially complex and will require significant re-works of the code-base to integrate. I might re-visit them when I can break configs in version 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants