Skip to content

Commit 1d71641

Browse files
authored
Sync new Zyte API features (#218)
1 parent 3ec31bd commit 1d71641

File tree

15 files changed

+750
-174
lines changed

15 files changed

+750
-174
lines changed

.github/workflows/test.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ jobs:
2929
toxenv: pinned-scrapy-2x5
3030
- python-version: '3.9'
3131
toxenv: pinned-scrapy-2x6
32+
- python-version: '3.9'
33+
toxenv: pinned-scrapy-2x7
3234
- python-version: '3.10'
3335
- python-version: '3.11'
3436
- python-version: '3.12'

CHANGES.rst

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,51 @@
11
Changes
22
=======
33

4+
Unreleased
5+
----------
6+
7+
* Added :ref:`automatic mapping <automap>` support for new Zyte API request
8+
fields:
9+
:http:`request:customAttributes`,
10+
:http:`request:customAttributesOptions`,
11+
:http:`request:ipType`,
12+
:http:`request:followRedirect`,
13+
:http:`request:forumThread`,
14+
:http:`request:forumThreadOptions`,
15+
:http:`request:jobPostingNavigation`,
16+
:http:`request:jobPostingNavigationOptions`,
17+
:http:`request:networkCapture`,
18+
:http:`request:serp`,
19+
:http:`request:serpOptions`,
20+
:http:`request:session`,
21+
:http:`request:tags`.
22+
23+
* You will now be warned when using their default values unnecessarily.
24+
25+
* By default, the following fields no longer affect request fingerprinting
26+
(i.e. 2 request identical except for the value of that field are now
27+
considered duplicate requests): :http:`request:ipType`,
28+
:http:`request:session`.
29+
30+
* When enabling :http:`request:serp`, :http:`request:httpResponseBody` and
31+
:http:`request:httpResponseHeaders` will no longer be enabled by default,
32+
and header mapping is disabled.
33+
34+
* Session pool IDs, of server-managed sessions (:http:`request:sessionContext`)
35+
or :ref:`set through the session management API <session-pools>`, now affect
36+
request fingerprinting: 2 requests identical except for their session pool ID
37+
are *not* considered duplicate requests any longer.
38+
39+
* When it is not clear whether a request will use browser rendering or not,
40+
e.g. an :ref:`automatic extraction request <zapi-extract>` without an
41+
:http:`extractFrom <request:productOptions.extractFrom>` value, the URL
42+
fragment is now taken into account for request fingerprinting, i.e.
43+
``https://example.com#a`` and ``https://example.com#b`` are *not* considered
44+
duplicate requests anymore in those scenarios.
45+
46+
* Fixes ``"auto"`` being considered the default value of :http:`request:device`
47+
instead of ``"desktop"``.
48+
449
0.27.0 (2025-02-04)
550
-------------------
651

docs/reference/fingerprint-params.rst

Lines changed: 38 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -7,49 +7,68 @@ Request fingerprinting parameters
77
The request fingerprinter class of scrapy-zyte-api generates request
88
fingerprints for Zyte API requests based on the following Zyte API parameters:
99

10-
- :http:`request:url` (:func:`canonicalized <w3lib.url.canonicalize_url>`)
10+
- :http:`request:url` (:func:`canonicalized <w3lib.url.canonicalize_url>`).
1111

1212
For URLs that include a URL fragment, like ``https://example.com#foo``, URL
13-
canonicalization keeps the URL fragment if :http:`request:browserHtml` or
14-
:http:`request:screenshot` are enabled, or if extractFrom_ is set to
15-
``browserHtml``.
16-
17-
.. _extractFrom: https://docs.zyte.com/zyte-api/usage/extract.html#extraction-source
13+
canonicalization keeps the URL fragment if the request *may* be a browser
14+
request.
1815

1916
- Request attribute parameters (:http:`request:httpRequestBody`,
2017
:http:`request:httpRequestText`, :http:`request:httpRequestMethod`), except
21-
headers
18+
headers.
2219

2320
Equivalent :http:`request:httpRequestBody` and
2421
:http:`request:httpRequestText` values generate the same signature.
2522

2623
- Output parameters (:http:`request:browserHtml`,
2724
:http:`request:httpResponseBody`, :http:`request:httpResponseHeaders`,
28-
:http:`request:responseCookies`, :http:`request:screenshot`, and
25+
:http:`request:responseCookies`, :http:`request:screenshot`,
2926
:ref:`automatic extraction outputs <zapi-extract-fields>` like
30-
:http:`request:product`)
27+
:http:`request:product`, and :http:`request:customAttributes`).
28+
29+
Same for :http:`request:networkCapture`, although it is not a proper output
30+
parameter (it needs to be combined with another browser rendering parameter
31+
to work).
3132

3233
- Rendering option parameters (:http:`request:actions`,
3334
:http:`request:device`, :http:`request:javascript`,
3435
:http:`request:screenshotOptions`, :http:`request:viewport`, and automatic
35-
extraction options like :http:`request:productOptions`)
36+
extraction options like :http:`request:productOptions` or
37+
:http:`request:customAttributesOptions`).
38+
39+
- :http:`request:geolocation`.
40+
41+
- :http:`request:sessionContext`.
42+
43+
When using the :ref:`session management API <session>`, :ref:`session pool
44+
IDs <session-pools>` are treated the same as
45+
:http:`request:sessionContext`.
3646

37-
- :http:`request:geolocation`
47+
- :http:`request:followRedirect`.
3848

39-
- :http:`request:echoData`
49+
- :http:`request:echoData`.
50+
51+
- :http:`request:tags`.
4052

4153
The following Zyte API parameters are *not* taken into account for request
42-
fingerprinting:
54+
fingerprinting by default:
4355

4456
- Request header parameters (:http:`request:customHttpRequestHeaders`,
45-
:http:`request:requestHeaders`)
57+
:http:`request:requestHeaders`).
4658

4759
- Request cookie parameters (:http:`request:cookieManagement`,
48-
:http:`request:requestCookies`)
60+
:http:`request:requestCookies`).
61+
62+
- :http:`request:sessionContextParameters`.
63+
64+
When using the :ref:`session management API <session>`, :ref:`session
65+
initialization parameters <session-init>` are treated the same as
66+
:http:`request:sessionContextParameters`.
67+
68+
- :http:`request:session.id`.
4969

50-
- Session handling parameters (:http:`request:sessionContext`,
51-
:http:`request:sessionContextParameters`)
70+
- :http:`request:ipType`.
5271

53-
- :http:`request:jobId`
72+
- :http:`request:jobId`.
5473

55-
- Experimental parameters (:http:`experimental.* <request:experimental>`)
74+
- Experimental parameters (:http:`experimental.* <request:experimental>`).

docs/usage/automap.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Changing parameters
4545

4646
You may set :reqmeta:`zyte_api_automap` in :attr:`Request.meta
4747
<scrapy.http.Request.meta>` to a :class:`dict` of Zyte API parameters to add,
48-
modify, or remove (by setting to ``False``) automatic request parameters. This
48+
modify, or remove (by setting to ``None``) automatic request parameters. This
4949
also works in :ref:`transparent mode <transparent>`.
5050

5151
Enabling :http:`request:browserHtml`, :http:`request:screenshot`, or an

docs/usage/default.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ in all requests:
1717
- :setting:`ZYTE_API_DEFAULT_PARAMS`, for :ref:`manual request parameters
1818
<manual>`.
1919

20+
- :setting:`ZYTE_API_PROVIDER_PARAMS`, for :ref:`dependency injection
21+
<scrapy-poet>`.
22+
2023
For example, if you set :setting:`ZYTE_API_DEFAULT_PARAMS` to
2124
``{"geolocation": "US"}`` and :reqmeta:`zyte_api` to ``{"browserHtml": True}``,
2225
``{"url: "…", "geolocation": "US", "browserHtml": True}`` is sent to Zyte API.

docs/usage/fingerprint.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,10 @@ Use :setting:`ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS` to define a custom
1717
request fingerprinting for requests that do not go through Zyte API.
1818

1919

20-
Request fingerprinting before Scrapy 2.7
21-
----------------------------------------
20+
Request fingerprinting below Scrapy 2.7
21+
---------------------------------------
2222

23-
If you have a Scrapy version older than Scrapy 2.7, Zyte API parameters are not
23+
If you have a Scrapy version lower than Scrapy 2.7, Zyte API parameters are not
2424
taken into account for request fingerprinting. This can cause some Scrapy
2525
components, like the filter of duplicate requests or the HTTP cache extension,
2626
to interpret 2 different requests as being the same.

docs/usage/session.rst

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ By default, scrapy-zyte-api will maintain up to 8 sessions per domain, each
7171
initialized with a :ref:`browser request <zapi-browser>` targeting the URL
7272
of the first request that will use the session. Sessions are automatically
7373
rotated among requests, and refreshed as they expire or get banned. You can
74-
customize most of this logic though request metadata, settings and
74+
customize most of this logic through request metadata, settings and
7575
:ref:`session config overrides <session-configs>`.
7676

7777
For session management to work as expected, your
@@ -269,6 +269,11 @@ sessions:
269269
initialization request <session-init>` is triggered to replace that
270270
session in the session pool.
271271

272+
The session pool assigned to a request affects the :ref:`fingerprint
273+
<fingerprint>` of the request. 2 requests with a different session pool ID are
274+
considered different requests, i.e. not duplicate requests, even if they are
275+
otherwise identical.
276+
272277

273278
.. _optimize-sessions:
274279

0 commit comments

Comments
 (0)