diff --git a/CHANGES.rst b/CHANGES.rst index b890acfe..43ea783a 100644 --- a/CHANGES.rst +++ b/CHANGES.rst @@ -1,13 +1,18 @@ Changes ======= -N.N.N (YYYY-MM-DD) ------------------- +0.22.0 (2024-07-DD) +------------------- -* ``scrapy-zyte-api[provider]`` now requires zyte-common-items >= 0.20.0. +* ``scrapy-zyte-api[provider]`` now requires :doc:`zyte-common-items + ` 0.20.0+. * Added the :setting:`ZYTE_API_AUTO_FIELD_STATS` setting. +* Added the :func:`~scrapy_zyte_api.is_session_init_request` function. + +* Added the :data:`~scrapy_zyte_api.session_config_registry` variable. + 0.21.0 (2024-07-02) ------------------- @@ -120,8 +125,7 @@ N.N.N (YYYY-MM-DD) * The ``Accept``, ``Accept-Encoding``, ``Accept-Language``, and ``User-Agent`` headers are now dropped automatically during :ref:`header mapping ` unless they have user-defined values. This fix can improve - success rates on some websites when using :ref:`HTTP requests - `. + success rates on some websites when using :ref:`HTTP requests `. 0.18.1 (2024-04-19) ------------------- diff --git a/docs/reference/fingerprint-params.rst b/docs/reference/fingerprint-params.rst index 1e8cc144..604be54e 100644 --- a/docs/reference/fingerprint-params.rst +++ b/docs/reference/fingerprint-params.rst @@ -26,7 +26,7 @@ fingerprints for Zyte API requests based on the following Zyte API parameters: - Output parameters (:http:`request:browserHtml`, :http:`request:httpResponseBody`, :http:`request:httpResponseHeaders`, :http:`request:responseCookies`, :http:`request:screenshot`, and - :ref:`automatic extraction outputs ` like + :ref:`automatic extraction outputs ` like :http:`request:product`) - Rendering option parameters (:http:`request:actions`, diff --git a/docs/reference/request.rst b/docs/reference/request.rst index 70c04daf..fbecdef5 100644 --- a/docs/reference/request.rst +++ b/docs/reference/request.rst @@ -222,10 +222,10 @@ combinations that Zyte API does not currently support, and may never support: :http:`request:requestHeaders`. - You can set :http:`request:httpResponseBody` to ``True`` or use - :ref:`automatic extraction from httpResponseBody `, + :ref:`automatic extraction from httpResponseBody `, and also set :http:`request:browserHtml` or :http:`request:screenshot` to ``True`` or use :ref:`automatic extraction from browserHtml - `. In this case, :attr:`Request.headers + `. In this case, :attr:`Request.headers ` is mapped both as :http:`request:customHttpRequestHeaders` and as :http:`request:requestHeaders`, and :http:`request:browserHtml` is used as diff --git a/docs/reference/settings.rst b/docs/reference/settings.rst index 77922721..563052f9 100644 --- a/docs/reference/settings.rst +++ b/docs/reference/settings.rst @@ -15,10 +15,10 @@ Default: ``False`` Enables stats that indicate which requested fields :ref:`obtained through scrapy-poet integration ` come directly from -:ref:`zyte-api-extract`. +:ref:`zapi-extract`. If for any request no page object class is used to override -:ref:`zyte-api-extract` fields for a given item type, the following stat is +:ref:`zapi-extract` fields for a given item type, the following stat is set: .. code-block:: python @@ -29,7 +29,7 @@ set: all fields. If for any request a custom page object class is used to override some -:ref:`zyte-api-extract` fields, the following stat is set: +:ref:`zapi-extract` fields, the following stat is set: .. code-block:: python @@ -434,7 +434,7 @@ ZYTE_API_SESSION_MAX_ERRORS Default: ``1`` Maximum number of :ref:`unsuccessful responses -` allowed for any given session before +` allowed for any given session before discarding the session. You might want to increase this number if you find that a session may continue diff --git a/docs/setup.rst b/docs/setup.rst index 5efe15ec..58f9187f 100644 --- a/docs/setup.rst +++ b/docs/setup.rst @@ -16,7 +16,7 @@ Requirements You need at least: - A :ref:`Zyte API ` subscription (there’s a :ref:`free trial - `). + `). - Python 3.8+ diff --git a/docs/usage/manual.rst b/docs/usage/manual.rst index e2f5da4e..ace3e1d0 100644 --- a/docs/usage/manual.rst +++ b/docs/usage/manual.rst @@ -64,4 +64,4 @@ remember to also request :http:`request:httpResponseHeaders`: # "…" To learn more about Zyte API parameters, see the upstream :ref:`usage -` and :ref:`API reference ` pages. +` and :ref:`API reference ` pages. diff --git a/docs/usage/retry.rst b/docs/usage/retry.rst index b0a23a46..ee45b15b 100644 --- a/docs/usage/retry.rst +++ b/docs/usage/retry.rst @@ -4,7 +4,7 @@ Retries ======= -To make :ref:`error handling ` easier, scrapy-zyte-api lets +To make :ref:`error handling ` easier, scrapy-zyte-api lets you :ref:`handle successful Zyte API responses as usual `, but :ref:`implements a more advanced retry mechanism for rate-limiting and unsuccessful responses `. @@ -14,7 +14,7 @@ unsuccessful responses `. Retrying successful Zyte API responses ====================================== -When a :ref:`successful Zyte API response ` is +When a :ref:`successful Zyte API response ` is received, a Scrapy response object is built based on the upstream website response (see :ref:`response`), and passed to your :ref:`downloader middlewares ` and :ref:`spider callback `. @@ -30,8 +30,8 @@ them using Scrapy’s built-in retry middleware Retrying non-successful Zyte API responses ========================================== -When a :ref:`rate-limiting ` or an :ref:`unsuccessful -` Zyte API response is received, no Scrapy +When a :ref:`rate-limiting ` or an :ref:`unsuccessful +` Zyte API response is received, no Scrapy response object is built. Instead, a :ref:`retry policy ` is followed, and if the policy retries are exhausted, a :class:`zyte_api.RequestError` exception is raised. diff --git a/docs/usage/session.rst b/docs/usage/session.rst index 6dd1b151..09a7d792 100644 --- a/docs/usage/session.rst +++ b/docs/usage/session.rst @@ -6,10 +6,10 @@ Session management Zyte API provides powerful session APIs: -- :ref:`Client-managed sessions ` give you full control +- :ref:`Client-managed sessions ` give you full control over session management. -- :ref:`Server-managed sessions ` let Zyte API +- :ref:`Server-managed sessions ` let Zyte API handle session management for you. When using scrapy-zyte-api, you can use these session APIs through the @@ -17,11 +17,11 @@ corresponding Zyte API fields (:http:`request:session`, :http:`request:sessionContext`). However, scrapy-zyte-api also provides its own session management API, similar -to that of :ref:`server-managed sessions `, but -built on top of :ref:`client-managed sessions `. +to that of :ref:`server-managed sessions `, but +built on top of :ref:`client-managed sessions `. scrapy-zyte-api session management offers some advantages over -:ref:`server-managed sessions `: +:ref:`server-managed sessions `: - You can perform :ref:`session validity checks `, so that the sessions of responses that do not pass those checks are refreshed, and the @@ -35,11 +35,11 @@ scrapy-zyte-api session management offers some advantages over :ref:`optimize-sessions` and :ref:`session-configs`. However, scrapy-zyte-api session management is not a replacement for -:ref:`server-managed sessions ` or -:ref:`client-managed sessions `: +:ref:`server-managed sessions ` or +:ref:`client-managed sessions `: -- :ref:`Server-managed sessions ` offer a longer - life time than the :ref:`client-managed sessions ` +- :ref:`Server-managed sessions ` offer a longer + life time than the :ref:`client-managed sessions ` that scrapy-zyte-api session management uses, so as long as you do not need one of the scrapy-zyte-api session management features, server-managed sessions can be significantly more efficient (fewer total sessions needed @@ -49,7 +49,7 @@ However, scrapy-zyte-api session management is not a replacement for website. With scrapy-zyte-api session management, you need to :ref:`handle optimization yourself `. -- :ref:`Client-managed sessions ` offer full control +- :ref:`Client-managed sessions ` offer full control over session management, while scrapy-zyte-api session management removes some of that control to provide an easier API for supported use cases. @@ -68,7 +68,7 @@ override `. .. _session-init-default: By default, scrapy-zyte-api will maintain up to 8 sessions per domain, each -initialized with a :ref:`browser request ` targeting the URL +initialized with a :ref:`browser request ` targeting the URL of the first request that will use the session. Sessions are automatically rotated among requests, and refreshed as they expire or get banned. You can customize most of this logic though request metadata, settings and @@ -134,7 +134,7 @@ To change the :ref:`default session initialization parameters :reqmeta:`zyte_api_session_params` request metadata key. It works similarly to :http:`request:sessionContextParams` from - :ref:`server-managed sessions `, but it supports + :ref:`server-managed sessions `, but it supports arbitrary Zyte API parameters instead of a specific subset. If it does not define a ``"url"``, the URL of the request :ref:`triggering @@ -210,7 +210,7 @@ initialization request. If your session checking implementation relies on the response body (e.g. it uses CSS or XPath expressions), you should make sure that you are getting one, which might not be the case if you are mostly using :ref:`Zyte API automatic -extraction `, e.g. when using :doc:`Zyte spider templates +extraction `, e.g. when using :doc:`Zyte spider templates `. For example, you can use :setting:`ZYTE_API_AUTOMAP_PARAMS` and :setting:`ZYTE_API_PROVIDER_PARAMS` to force :http:`request:browserHtml` or :http:`request:httpResponseBody` to be set @@ -288,7 +288,7 @@ Here are some things you can try: (:setting:`ZYTE_API_SESSION_POOL_SIZE`). The more different sessions you use, the more slowly you send requests through each session. - Mind, however, that :ref:`client-managed sessions ` + Mind, however, that :ref:`client-managed sessions ` expire after `15 minutes since creation or 2 minutes since the last request `_. At a certain point, increasing :setting:`ZYTE_API_SESSION_POOL_SIZE` @@ -298,7 +298,7 @@ Here are some things you can try: counterproductive. - By default, sessions are discarded as soon as an :ref:`unsuccessful - response ` is received. + response ` is received. However, on some websites sessions may remain valid even after a few unsuccessful responses. If that is the case, you might want to increase @@ -308,9 +308,9 @@ Here are some things you can try: If you do not need :ref:`session checking ` and your :ref:`initialization parameters ` are only :http:`request:browserHtml` and :http:`request:actions`, :ref:`server-managed -sessions ` might be a more cost-effective choice, as +sessions ` might be a more cost-effective choice, as they live much longer than :ref:`client-managed sessions -`. +`. .. _session-configs: @@ -371,7 +371,7 @@ To include cookies in session initialization requests, use :http:`request:requestCookies` in :ref:`session initialization parameters `. But mind that those cookies are only set during that request, :ref:`they are not added to the session cookie jar -`. +`. Session retry policies @@ -441,7 +441,7 @@ The following stats exist for scrapy-zyte-api session management: ``scrapy-zyte-api/sessions/pools/{pool}/init/failed`` Number of times that initializing a session for pool ``{pool}`` resulted in - an :ref:`unsuccessful response `. + an :ref:`unsuccessful response `. ``scrapy-zyte-api/sessions/pools/{pool}/init/param-error`` Number of times that initializing a session for pool ``{pool}`` triggered @@ -473,7 +473,7 @@ The following stats exist for scrapy-zyte-api session management: ``scrapy-zyte-api/sessions/pools/{pool}/use/failed`` Number of times that a request that used a session from pool ``{pool}`` - got an :ref:`unsuccessful response `. + got an :ref:`unsuccessful response `. ``scrapy-zyte-api/sessions/use/disabled`` Number of processed requests for which session management was disabled. diff --git a/docs/usage/stats.rst b/docs/usage/stats.rst index 392838e4..3d2e7039 100644 --- a/docs/usage/stats.rst +++ b/docs/usage/stats.rst @@ -9,11 +9,11 @@ Stats from :doc:`python-zyte-api ` are exposed as For example, ``scrapy-zyte-api/status_codes/`` stats indicate the status code of Zyte API responses (e.g. ``429`` for :ref:`rate limiting -` or ``520`` for :ref:`temporary download errors -`). +` or ``520`` for :ref:`temporary download errors +`). .. note:: The actual status code that is received from the target website, i.e. the :http:`response:statusCode` response field of a :ref:`Zyte API - successful response `, is accounted for in + successful response `, is accounted for in the ``downloader/response_status_count/`` stat, as with any other Scrapy response. diff --git a/scrapy_zyte_api/_annotations.py b/scrapy_zyte_api/_annotations.py index 0c87c5d8..20336b59 100644 --- a/scrapy_zyte_api/_annotations.py +++ b/scrapy_zyte_api/_annotations.py @@ -4,7 +4,7 @@ class ExtractFrom(str, Enum): """:ref:`Annotation ` to specify the :ref:`extraction source - ` of an automatic extraction :ref:`input `, + ` of an automatic extraction :ref:`input `, such as :class:`~zyte_common_items.Product` or :class:`~zyte_common_items.Article`. diff --git a/scrapy_zyte_api/responses.py b/scrapy_zyte_api/responses.py index 1a8c4cef..dd5cb55a 100644 --- a/scrapy_zyte_api/responses.py +++ b/scrapy_zyte_api/responses.py @@ -57,7 +57,7 @@ def replace(self, *args, **kwargs): def raw_api_response(self) -> Optional[Dict]: """Contains the raw API response from Zyte API. - For the full list of parameters, see :ref:`zyte-api-reference`. + For the full list of parameters, see :ref:`zapi-reference`. """ return self._raw_api_response