Skip to content

Navigation Menu

Explore
By company size
By use case
By industry
View all solutions
Topics
- AI
- DevOps
- Security
- Software Development
- View all
Explore
- GitHub Sponsors
  Fund open source developers
- The ReadME Project
  GitHub community articles
Repositories
- Enterprise platform
  AI-powered developer platform
Available add-ons
Pricing

Search code, repositories, users, issues, pull requests...

Search

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

crwlrsoft / crawler Public

Notifications You must be signed in to change notification settings
Fork 12
Star 335

Code
Issues 1
Pull requests 1
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Releases: crwlrsoft/crawler

Releases · crwlrsoft/crawler

v1.6.1

16 Feb 22:28

otsch

Compare

Choose a tag to compare

Loading

v1.6.1

Changed

Make method HttpLoader::addToCache() public, so steps can update a cached response with an extended version.

Assets 2

Loading

All reactions

v1.6.0

13 Feb 02:04

otsch

Compare

Choose a tag to compare

Loading

v1.6.0

Added

Enable dot notation in Step::addToResult(), so you can get data from nested output, like: $step->addToResult(['url' => 'response.url', 'status' => 'response.status', 'foo' => 'bar']).
When a step adds output properties to the result, and the output contains objects, it tries to serialize those objects to arrays, by calling __serialize(). If you want an object to be serialized differently for that purpose, you can define a toArrayForAddToResult() method in that class. When that method exists, it's preferred to the __serialize() method.
Implemented above-mentioned toArrayForAddToResult() method in the RespondedRequest class, so on every step that somehow yields a RespondedRequest object, you can use the keys url, uri, status, headers and body with the addToResult() method. Previously this only worked for Http steps, because it defines output key aliases (HttpBase::outputKeyAliases()). Now, in combination with the ability to use dot notation when adding data to the result, if your custom step returns nested output like ['response' => RespondedRequest, 'foo' => 'bar'], you can add response data to the result like this $step->addToResult(['url' => 'response.url', 'body' => 'response.body']).

Fixed

Improvement regarding the timing when a store (Store class instance) is called by the crawler with a final crawling result. When a crawling step initiates a crawling result (so, addToResult() was called on the step instance), the crawler has to wait for all child outputs (resulting from one step-input) until it calls the store, because the child outputs can all add data to the same final result object. But previously this was not only the case for all child outputs starting from a step where addToResult() was called, but all children of one initial crawler input. So with this change, in a lot of cases, the store will earlier be called with finished Result objects and memory usage will be lowered.

Assets 2

Loading

All reactions

v1.5.3

07 Feb 14:49

otsch

Compare

Choose a tag to compare

Loading

v1.5.3

Fixed

Merge HttpBaseLoader back to HttpLoader. It's probably not a good idea to have multiple loaders. At least not multiple loaders just for HTTP. It should be enough to publicly expose the HeadlessBrowserLoaderHelper via HttpLoader::browserHelper() for the extension steps. But keep the HttpBase step, to share the general HTTP functionality implemented there.

Assets 2

Loading

All reactions

v1.5.2

07 Feb 10:08

otsch

Compare

Choose a tag to compare

Loading

v1.5.2

Fixed

Issue in GetUrlsFromSitemap (Sitemap::getUrlsFromSitemap()) step when XML content has no line breaks.

Assets 2

Loading

All reactions

v1.5.1

06 Feb 22:40

otsch

Compare

Choose a tag to compare

Loading

v1.5.1

Fixed

For being more flexible to build a separate headless browser loader (in an extension package) extract the most basic HTTP loader functionality to a new HttpBaseLoader and important functionality for the headless browser loader to a new HeadlessBrowserLoaderHelper. Further, also share functionality from the Http steps via a new abstract HttpBase step. It's considered a fix, because there's no new functionality, just refactoring existing code for better extendability.

Assets 2

Loading

All reactions

v1.5.0

29 Jan 15:41

otsch

This commit was created on GitHub.com and signed with GitHub’s verified signature.

GPG key ID: B5690EEEBB952194

Learn about vigilant mode.

Compare

Choose a tag to compare

Loading

v1.5.0

Added

The DomQuery class (parent of CssSelector (Dom::cssSelector) and XPathQuery (Dom::xPath)) has a new method formattedText() that uses the new crwlr/html-2-text package to convert the HTML to formatted plain text. You can also provide a customized instance of the Html2Text class to the formattedText() method.

Fixed

The Http::crawl() step won't yield a page again if a newly found URL responds with a redirect to a previously loaded URL.

Assets 2

Loading

HelgeSverre reacted with heart emoji

All reactions

❤️ 1 reaction

1 person reacted

v1.4.0

14 Jan 13:27

otsch

Compare

Choose a tag to compare

Loading

v1.4.0

Added

The QueryParamsPaginator can now also increase and decrease non first level query param values like foo[bar][baz]=5 using dot notation: QueryParamsPaginator::paramsInUrl()->increaseUsingDotNotation('foo.bar.baz', 5).

Assets 2

Loading

jamesrweb reacted with rocket emoji

All reactions

🚀 1 reaction

1 person reacted

v1.3.5

20 Dec 01:13

otsch

Compare

Choose a tag to compare

Loading

v1.3.5

Fixed

The FileCache can now also read uncompressed cache files when compression is activated.

Assets 2

Loading

All reactions

v1.3.4

19 Dec 12:22

otsch

Compare

Choose a tag to compare

Loading

v1.3.4

Fixed

Reset paginator state after finishing paginating for one base input, to enable paginating multiple listings of the same structure.

Assets 2

Loading

All reactions

v1.3.3

01 Dec 01:07

otsch

Compare

Choose a tag to compare

Loading

v1.3.3

Fixed

Add forgotten getter method to get the DOM query that is attached to an InvalidDomQueryException instance.

Assets 2

Loading

All reactions

Previous 1 2 3 4 5 6 Next

Footer

© 2024 GitHub, Inc.

Footer navigation

Terms
Privacy
Security
Status
Docs
Contact

You can’t perform that action at this time.