|
| 1 | +2023-05-30 9pm in the middle of the night |
| 2 | +=== |
| 3 | + |
| 4 | +## Endpoints |
| 5 | +* `GET/PUT /episodes` |
| 6 | + * returns only episodes changed |
| 7 | + * parameter `since` |
| 8 | +* ~~`GET/PUT /episodes/{guid-hash}`~~ |
| 9 | + * Don't allow this endpoint to prevent problems with duplicate GUIDs |
| 10 | +* `GET /subscriptions/{guid}/episodes` |
| 11 | + * parameter `since` |
| 12 | + * parameter `guid`? |
| 13 | +* `GET/PUT /subscriptions/{guid}/episodes/{fetch-hash}` (hash: SHA1?) |
| 14 | + * if fetch-hash clash, server expected to return BAD REQUEST |
| 15 | + * Hash here, because GUIDs can be any String |
| 16 | + |
| 17 | + |
| 18 | +We want to explain in the specs why we have endpoints 'under' subscriptions, and why we might refuse updates. (i.e. how this will help avoid gPodder API pitfalls.) |
| 19 | + |
| 20 | +## Episode endpoint |
| 21 | + |
| 22 | +The episode endpoint is required to synchronize playback positions and played status for specific episodes. At a minimum, the endpoint should accept and return the following: |
| 23 | + |
| 24 | +1. The episode's **Podcast GUID** (most recent) |
| 25 | +2. The episode's **GUID** (sent by the client if found in the RSS feed, or generated by the server if not): String (not necessarily GUID/URL formatted).` |
| 26 | +4. A **Status** field containing lifecycle statuses. E.g.: |
| 27 | + * `New` |
| 28 | + * `Played` |
| 29 | + * `Ignored` |
| 30 | + * `Queued` |
| 31 | +6. A **Playback position** marker, updated by a PUT request |
| 32 | +7. A **timestamp** of the last time the episode was played/paused (used for conflict resolution on the playback position) |
| 33 | +8. A **Favorite** field to mark episodes |
| 34 | +9. A **timestamp** for the last time some metadata (except playback position) was updated |
| 35 | + |
| 36 | +We discussed if it makes sense to use episode numbers, but it's not part of the feed anyways so we don't have this information and don't need it anyways |
| 37 | + |
| 38 | +https://www.rssboard.org/rss-specification#ltguidgtSubelementOfLtitemgt |
| 39 | + |
| 40 | + |
| 41 | +### Episode identification |
| 42 | +#### Fetch-hash vs GUID |
| 43 | +Discussion whether to generate a new (static?) identifier per episode and use that for synchronisation (clients would have to store it additionally per episode?) or to use existing GUIDs as sync identifier and generate them if none is present (one endpoint needs the GUIDs to be passed by their hash/base64 then for REST-compliancy) |
| 44 | + |
| 45 | +#### Fetch-hash |
| 46 | +Fetch-hash creation: SHA1/MD5 hash of |
| 47 | +1. `<guid>` https://www.rssboard.org/rss-specification#ltguidgtSubelementOfLtitemgt |
| 48 | + |
| 49 | +x. `<link>` https://www.rssboard.org/rss-specification#hrelementsOfLtitemgt |
| 50 | +x. `<enclosure>` (aka media file URL) https://www.rssboard.org/rss-specification#ltenclosuregtSubelementOfLtitemgt |
| 51 | + |
| 52 | +Priority of latter 2 tbd: `<link>` might be less likely to be unique, while `<enclosure>` might be less stable (more likely to change). |
| 53 | + |
| 54 | +Consideration: why not BASE64? (REST-compliant, can be "unhashed", so hash wouldn't have to be stored on the server) |
| 55 | + |
| 56 | +Good practice/required: store all 3 (GUID, link, media file URL). This will allow for later matching of episodes if one or two of these are missing. For example, if a totally new client is connecting to a server, and an episode doesn't have a GUID and the `<link>` has changed, matching would still be possible based on media file URL. (If we don't do this, finding the right episode locally might be hard when receiving a fetch-hash that's not unique, or a GUID that's missing. We know the podcast and within each podcast there'll be only a limited set of 'wrong' episodes, so a client would only have to create hashes for a few episodes in order to find a match. But still, not very economic.) |
| 57 | + |
| 58 | +<details> |
| 59 | + <summary markdown="span">Matching proposal in pseudo-code (click to expand)</summary> |
| 60 | + |
| 61 | +```pseudo-code |
| 62 | +are_episodes_equal(client-episode c, server-episode s): |
| 63 | + // this filters out any potential GUID duplicates |
| 64 | + if c.podcast_guid != s.podcast_guid then |
| 65 | + return False |
| 66 | + |
| 67 | + // if GUID is present, decide exclusively according to it |
| 68 | + if c.guid not empty then |
| 69 | + return c.guid == s.guid |
| 70 | + |
| 71 | + // if enclosure matches, probably the same (since they share the media file) |
| 72 | + if c.enclosure not empty && c.enclosure == s.enclosure then |
| 73 | + return True |
| 74 | + |
| 75 | + // case: no media file |
| 76 | + if c.enclosure empty then |
| 77 | + // no guid, enclosure or link -> not matchable |
| 78 | + if c.link empty then |
| 79 | + return False |
| 80 | + |
| 81 | + // no media file, but episode URL matches - very probably the same |
| 82 | + // (how large is the error here?) |
| 83 | + if c.link == l.link then |
| 84 | + return True |
| 85 | + |
| 86 | + // All other cases: not matching |
| 87 | + return False |
| 88 | +``` |
| 89 | +</details><br> |
| 90 | + |
| 91 | +?? Each field that is empty/not present in the RSS is stored & sent empty. ~~The fetch-hash is only used when sending a request about a specific episode.~~ (that wouldn't work well in case of batch updates - see below) Payloads don't contain fetch-hashes, only the three separate fields. |
| 92 | + |
| 93 | +Two options for identifying episodes in communication: |
| 94 | +[I don't think these are the only options, see [here](#Fetch-hash-vs-GUID)] |
| 95 | +* For each episode (e.g. in queue; batch update), all three fields/tags are included. Lot of (unnecessary) data exchange. |
| 96 | +* Each episode gets a calculated fetch-hash, which is used for communication. Clients can decide to store or generate on the fly. (Generating on-the-fly is dangerous, episode identifier should be static even if episode changes) |
| 97 | + |
| 98 | +Server creates fetch-hash, similar to creation of Podcast GUID, based on the logic described above. |
| 99 | + |
| 100 | +Why do we trust the server to create the hash, more than the client? Because for each person, there's probably just 1 server in the game, more likely multiple clients. So if the server messes it up, there's still a single outcome for each user. |
| 101 | + |
| 102 | +#### GUID |
| 103 | +Why shouldn't the server just create a GUID (seed: available payloads or whole episode, can also be just random) and send this back to the client? (the client would map using `<enclosure>` and `<link>` and then store this GUID) |
| 104 | +[Advantage: less payload fields, only `<enclosure>`, `<link>` and `<guid>` and after first sync only `<guid>` (`guid-hash` only for `PUT /subs../{guid}/epis../{guid-hash}`)] |
| 105 | +[Further advantage: easier to implement for clients, they probably already have an `episode_guid` field in their DB] |
| 106 | + |
| 107 | +Only create GUID if none is present, otherwise use existing one. |
| 108 | +Identify episode always by `podcast_guid`+`episode_guid` (e.g. when referencing queue items, settings, ...) |
| 109 | +[PodcastIndex seems to handle this [the same way](https://podcastindex-org.github.io/docs-api/#get-/episodes/byguid)] |
| 110 | + |
| 111 | +The workflow if a new client connects could then be: |
| 112 | +1. Get subscriptions & fetch feeds |
| 113 | +2. Get episodes |
| 114 | +3. Feed with GUIDs: map by GUID |
| 115 | +4. Feed without GUIDs: map by matching algorithm [[above](#Matching-proposal-in-pseudo-code)], then store GUID from sync server |
| 116 | + |
| 117 | +#### Deduplication |
| 118 | + |
| 119 | +Two options: |
| 120 | +a. agree on a deduplication logic as part of the spec which is to be executed at server level (hard to 'enforce') |
| 121 | +b. let clients figure out deduplication, and spec the calls that will allow clients to merge episodes. |
| 122 | + |
| 123 | +To be discussed further. Latter is easier for us :-) |
| 124 | +Latter should be in the spec in either case, so that we don't have to change the whole spec if some podcast feeds mess up in a way we never anticipated. Clients can adapt a lot faster. |
| 125 | + |
| 126 | +#### New GUID/Fetch-hash logic |
| 127 | +Necessary for changing GUIDs, can also be used for deduplication? |
| 128 | + |
| 129 | +Options: |
| 130 | +1. `PUT /episodes` with additional field `old_fetch-hash` (or `old_guid`) |
| 131 | +2. `PUT /subscriptions/{guid}/episodes/{guid-/fetch-hash}` with additional field `new_fetch-hash` (or `new_guid`) |
| 132 | + |
| 133 | +Case where both episodes are contained in the feed (episode didn't change, but podcasters published twice): To mark duplicate, additional boolean `is_duplicate` so that the server handles `fetch-hash`/`guid` of both as aliases (tombstoning one, if one of them is requested, return aliases in field/array `aliases`/`duplicate_fetch-hashes/guids`) |
| 134 | + |
| 135 | +In both cases, server changes fetch-hash/GUID of episode entry, sets `fetch-hash/GUID_changed` timestamp and creates tombstone for old value |
| 136 | +[On `GET /episodes`, old value is in `fetch-hash`/`guid` and new value in `new_fetch-hash/new_guid`, same behaviour as in Subscriptions] |
| 137 | + |
| 138 | +Case to handle: |
| 139 | +1. Client 1 marks {`fetch-hash2`/`guid2`} as new guid of {`fetch-hash1`/`guid1`} |
| 140 | +2. Client 2 receives & stores this |
| 141 | +3. Client 2 marks {`fetch-hash1`/`guid1`} as new guid of {`fetch-hash2`/`guid2`} |
| 142 | + |
| 143 | +(could happen through e.g. slightly different podcast feed, e.g. one feed contains MP3s, the other AACs, but podcast GUID is the same) |
| 144 | + |
| 145 | + |
| 146 | +## Excursus Database Schema in the specs |
| 147 | + |
| 148 | +* We should focus on the format of the communications, not how the database is stored |
| 149 | +* We have all field data types specified anyways in the API endpoint specification |
| 150 | +* We can leave the proposed database schema as an example |
| 151 | + |
| 152 | + |
| 153 | +###### tags: `project-management` `meeting-notes` `OpenPodcastAPI` |
0 commit comments