Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix search engine API response handling #22075

Closed
wants to merge 1 commit into from

Conversation

biskweet
Copy link

Added manual escaping of HTML entities before automatic replacement.
Now, "Ubuntu 22.04.5 LTS ("Jammy Jellyfish")" becomes "Ubuntu 22.04.5 LTS (\"Jammy Jellyfish\")" instead of "Ubuntu 22.04.5 LTS ("Jammy Jellyfish")".
Closes #22074.

@glassez glassez requested a review from Chocobo1 December 28, 2024 05:20
@glassez glassez added the Search engine Issues related to the search engine/search plugins functionality label Dec 28, 2024
@glassez
Copy link
Member

glassez commented Dec 28, 2024

Fix issue #22074 is incorrect commit/PR title according to qBittorrent contribution guidelines.

@glassez
Copy link
Member

glassez commented Dec 28, 2024

IMO, the problem is that htmlentitydecode() is automatically applied in retrieve_url() independently from retrieved data format. Either retrieve_url() should check data format before apply it or applying it should be matter of plugins themselves.
@Chocobo1, what do you think?

@Chocobo1
Copy link
Member

Either retrieve_url() should check data format before apply it or applying it should be matter of plugins themselves.

I agree with the latter option.
Ideally dataStr = htmlentitydecode(dataStr) should be removed but I'm not sure whether it will break existing plugins. If it will, then the function could add a parameter to control whether htmlentitydecode(dataStr) is invoked and it defaults to True. The plugin can choose to turn it off.
Or we can just remove htmlentitydecode(dataStr) and require plugins to update/follow.

@biskweet
Copy link
Author

biskweet commented Dec 28, 2024

If it will, then the function could add a parameter to control whether htmlentitydecode(dataStr) is invoked and it defaults to True. The plugin can choose to turn it off.

If we are willing to change the function signature then maybe it would be simpler to add a boolean in arguments that defaults to false, and which determines whether we want to manually escape " or not before invoking htmlentitydecode. Something like

def retrieve_url(url: str, custom_headers: Mapping[str, Any] = {}, request_data: Optional[Any] = None, should_escape_quotes=False) -> str:
    # ...

    if should_escape_quotes:
        dataStr = dataStr.replace('"', '\\"')
    dataStr = htmlentitydecode(dataStr)
    return dataStr

Thus allowing plugins to control this behavior. I don't think applying htmlentitydecode is a problem per se since it's generally needed. It's just that in this particular case, it breaks JSON correctness.

@biskweet biskweet changed the title Fix issue #22074 Fix seach engine API response handling Dec 28, 2024
@glassez
Copy link
Member

glassez commented Dec 28, 2024

It's just that in this particular case, it breaks JSON correctness.

For me, the need to decode HTML entities is a special case (although it may be more common than others), and in the general case it can retrieve data in arbitrary format (not only HTML), which should keep HTML entities as-is.
Anyway, any assumptions about retrieved data is matter of specific plugins. Even if 90% of plugins would call htmlentitydecode() after retrieve_url() it doesn't seem bad for me.

the function could add a parameter to control whether htmlentitydecode(dataStr) is invoked and it defaults to True. The plugin can choose to turn it off.
Or we can just remove htmlentitydecode(dataStr) and require plugins to update/follow.

I would choose first option for v5.0.x and v5.1.x, and second one for v5.2.x and above.

@biskweet
Copy link
Author

biskweet commented Dec 28, 2024

For me, the need to decode HTML entities is a special case (although it may be more common than others), and in the general case it can retrieve data in arbitrary format (not only HTML), which should keep HTML entities as-is.

I agree it should be a togglable option, however I don't think it should be all-or-nothing. Almost any request to apibay.org yields JSON results containing HTML entities (at least &, try this query with Moana 2, warning NSFW results). I believe it is important to correctly escape quotes while still parsing entities.

Another, maybe better option would be to parse these entities later in the process, when using the JSON data in the UI. That would make more sense since parsing entities is just a matter of making data human-readable.

@Chocobo1
Copy link
Member

I would choose first option for v5.0.x and v5.1.x, and second one for v5.2.x and above.

Just a side note. Let's limit the change only for >= 5.1.x. Backporting doesn't seem like a good idea.

@glassez
Copy link
Member

glassez commented Dec 29, 2024

Backporting doesn't seem like a good idea.

Why, considering that existing plugins are not supposed to be affected?
However, I don't mind (as long as the v5.1 release doesn't take too long).

@Chocobo1
Copy link
Member

Chocobo1 commented Dec 29, 2024

Why, considering that existing plugins are not supposed to be affected?

The fastest way to fix #22074 is for the plugin to fetch the web data by itself (duplicate/copy the code) and not rely on qbt helpers. Either backport to v5.0 or releasing v5.1 will still take a lot of time to reach users.
Also I would like to avoid breaking anything (in v5.0) if something goes wrong.

@xavier2k6 xavier2k6 changed the title Fix seach engine API response handling Fix search engine API response handling Dec 29, 2024
@Chocobo1
Copy link
Member

Chocobo1 commented Jan 4, 2025

The fastest way to fix #22074 is for the plugin to fetch the web data by itself (duplicate/copy the code) and not rely on qbt helpers.

@biskweet
It seems that we (main contributors) agreed on this course. Would you mind making the changes to the plugin directly instead? The affected plugins is located in another repo: https://github.com/qbittorrent/search-plugins/blob/master/nova3/engines/piratebay.py
As for helpers.py, I'll take care of it.

@biskweet
Copy link
Author

biskweet commented Jan 7, 2025

Created a new pull request here on the qbittorrent/search-plugins repo.

Chocobo1 pushed a commit to qbittorrent/search-plugins that referenced this pull request Jan 12, 2025
## Following
- [this issue](qbittorrent/qBittorrent#22074) on the main qBittorrent repository
- and the discussion on [this subsequent pull request](qbittorrent/qBittorrent#22075) 

Here is the fix for the piratebay search engine. A gist of the code is available [here](https://gist.github.com/biskweet/f06ff7b260ef1ce3a31d27ac1a9edcbf) for testing.

## Recalling the problem:
API apibay.org returns weird JSON that causes the piratebay search engine to crash when handling its response. If some search results contain `"` (quotation marks) characters, the server escapes them by replacing `"` with `"` HTML entities in order to still provide a syntactically valid JSON response. While this is not incorrect, it would be best if apibay.org returned properly escaped quotes, i.e. using backslashes.

When handling the response data, functions [`retrieve_url`](https://github.com/LightDestory/qBittorrent-Search-Plugins/blob/master/src/helpers.py#L75-L117) and [`htmlentitydecode`](https://github.com/LightDestory/qBittorrent-Search-Plugins/blob/master/src/helpers.py#L75-L117) blindly unescape all entities thereby corrupting previously valid JSON. As a consequence, `json.loads` crashes. For example:
```json
{
  "title": "Ubuntu 22.04.5 LTS ("Jammy Jellyfish")"
}
```
becomes
```json
{
  "title": "Ubuntu 22.04.5 LTS ("Jammy Jellyfish")"
}
```

## Solution proposed
We no longer use the `retrieve_url` function -- instead, I created a dedicated `retrieve_url` function (which is almost a copy-paste of the original) that fixes the problem by manually escaping quotes *before* escaping the rest of the data.

PR #331.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Search engine Issues related to the search engine/search plugins functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Search plugins fail to parse API from apibay.org due to invalid handling of JSON response
3 participants