-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdesign-doc.txt
593 lines (463 loc) · 25.2 KB
/
design-doc.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
Note:
=====
This document is old. Instead, see:
http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.1.html
============================================================================
>> Overview
============================================================================
An open, simple web-scale pubsub protocol, along with an open source
reference implentation targetting Google App Engine. Notably,
however, nothing in the protocol is centralized, or Google- or App
Engine-specific. Anybody can play.
As opposed to more developed (and more complex) pubsub specs like
XEP-0060, this spec's base profile (the barrier-to-entry to speak it)
is dead simple. The fancy bits required for high-volume publishers
and subscribers are optional. The base profile is HTTP-based, as
opposed to XMPP (see more on this below).
To dramatically simplify the spec in several places where we had to
choose between supporting A or B, we took it upon ourselves to say
"only A", rather than making it an implementation decision.
We offer this spec in hopes that it fills a need or at least advances
the state of the discussion in the pubsub space. Polling sucks. We
think a decentralized pubsub layer is a fundamental, missing layer in
the Internet architecture today and its existence, more than just
enabling the obvious lower latency feed readers, would enable many
cool applications, most of which we can't even imagine. But we're
looking forward to decentralized social networking.
<MOVED TO XML>
============================================================================
>> Terminology
============================================================================
Topic: an Atom feed URL. The unit to which one can subscribe to
changes. RSS isn't supported for simplicity. Further, the spec
currently only addresses public (unauthenticated) Atom feed URLs.
Pubsub Hub ("the hub"): the server (URL) which implements this protocol.
We're currently implementing this and running at server at
http://pubsubhubbub.appspot.com/ that's at least for now open for anybody
to use, as either a publisher or subscriber. Any hub is free to
implement its own policies on who can use it.
Publisher: an owner of a topic. Notifies the pubsub hub when the topic
(Atom feed) has been updated. Just notifies that it _has_ been updated,
but not how. As in almost all pubsub systems, the publisher is unaware
of the subscribers, if any.
Subscriber: an entity (person or program) that wants to be notified of
changed on a topic. Must be directly network-accessible, not being
a NAT. PubSubHubbub is a server-to-server protocol. If you're being
NAT, you're a client, out-of-scope for this protocol. (Browser channels,
long-polling a server would be more appropriate for you.)
Subscription: a tuple (Topic URL, Subscriber). For network-accessible
subscribers, the subscription's unique key is actually the tuple
(Topic URL, Subscriber Callback URL). For NAT'd subscribers,
the unique key for a subscription is (Topic URL, SubscriberToken).
In both cases, subscriptions may (at the hub's decision) have expiration
times akin to DHCP leases and then must be renewed.
Event: an event that's visible to multiple topics. For each event
that happens (e.g. "Brad posted to the Linux Community."), multiple
topics could be affected (e.g. "Brad posted." and "Linux community
has new post"). Publisher events update topics, and the hub looks
up all subscriptions for all affected topics, sending out
notifications to subscribers.
Notification: a delta on a topic, computed by the hub and sent to all
subscribers. (TBD: format of this delta. likely: an Atom feed
itself with just the new or changed stuff, and gravestones for
removed items?) The notification can be the result of a publisher
telling the hub of an update, or the hub proactively polling a topic
feed, perhaps for a subscriber subscribing to a topic that's not
pubsub-aware. Note also that a notification to a subscriber can be
a payload consisting of updates for multiple topics. Publishers MAY
choose to send multi-topic notifications as an optimization for
heavy subscribers, but subscribers MUST understand them.
</MOVED TO XML>
<FAQ>
============================================================================
>> Notes:
============================================================================
* There is no relationship or hierarchy between topics. In the future
such an Atom extension could exist, but that's entirely out of this
spec, both now and then. Non-goal. If a publisher wants to offer
a hierarchy, they need to offer 'n' Atom feeds.
* For HTTP callback subscribers, the add-subscription part of the
protocol requires that the hub verifies (via a pingback: "did you
really mean that?") before actually adding the subscription. This
is to prevent people from DoS'ing each other by subscribing victims
to many and/or high-volume publishers.
* In same way openid was bootstrappable with a simple <link> tag, should
be similar for publishers to delegate their pubsubhub with a simple
link tag. Example:
<link rel="hub.subscribe" href="http://pubsubhubbub.com/subscribe" />
* Multi-protocol would be nice, but simple would probably win... HTTP
only at first. XMPP later. XMPP has a few advantages, but really
only authentication. A good HTTP implementation can do long polling,
pingbacks, etc.
* Loops. Perhaps Atom child element (repeated) of all the Atom Entry
IDs that entry used to be or came from or is. Neat to see the
HTTP-like TRACE. (perhaps extension to Atom, not part of this spec)
(** Looked it up, and <atom:link rel="via"> implies this, but it
only works if all of the feeds in the trace correctly supply a 'via'
tag. Then it's on the client to iteratively follow the trace).
</FAQ>
============================================================================
>> High-level protocol flow:
============================================================================
<MOVED>
* Publishers POST a ping to their hub(s) URLs when their topic(s)
change.
* Subscribers POST to one or more of the advertised hubs for a topic they're interested in. Alternatively, some hubs may offer auto-polling capability, to let {their,any} subscribers subscribe to topics which don't advertise a hub.
* The hub caches minimal metadata (id, data, entry digest) about each topic's previous state. When the hub refetches a topic feed (on its own initiative or as a result of a publisher's ping) and finds a delta, it enqueues a notification to all registered subscribers. Subscribers can be notified of topic deltas in a variety of ways:
</MOVED>
<APPENDIX>
- In the base profile, subscribers must be directly network accessible (not behind a NAT),
running a listening webserver, and can receive an HTTP callback to notify them their topic
changed. To avoid authentication issues with HTTP, this
callback doesn't include any payload but rather just a note for
the subscriber to check the hub for the topic URL (which
presumably they trust, if they subscribed to it in the first
place). In the future, this HTTP callback could include a
signed (OAuth?) payload, avoiding the need for the extra HTTP
request in the other direction. In any high transaction
scenario, though, it's hoped that all parties (hub, publisher,
subscriber) would make proper use of HTTP Keep-Alive
connections, negating the ugliest part of the multiple HTTP
requests (new TCP connections: 3-way handshake, slow start,
ephemeral port exhaustion, etc).
- Also in the base profile, but slightly lower priority for us
implementation-wise, is support for NAT'd subscribers unable
to run a publicly accessible listening webserver. Instead,
these subscribers need to connect to the hub to retrieve their
enqueued notifications. A smart hub implementation here would
support HTTP long-polling (aka "comet") so the client doesn't
need to make HTTP requests often to get low-latency updates.
(TODO/FUTURE: define recommendations for this long-polling behavior
on both client and server: ideally server just does it, hanging
after the GET, but then what's the recommendation for the client's
HTTP client timeout value, which might not be under their control?
Ignore that and document it? Separate URL for long polling?
Then subscriber caches hub's long-polling ability? Server includes
X- header to signal that it did or wants to do long polling?)
- Fancier implementations may choose to use HTTP long polling
("comet") or XMPP. We're punting on this for now in the
interest of getting something basic working for the common case.
</APPENDIX>
============================================================================
>> Atom details
============================================================================
Notification and source formats will be Atom. More detail follows this example.
<atom:feed>
# ... source, title, etc ...
<link rel="hub.subscribe" href="http://myhub.com/subscribe" />
<link rel="self" href="http://publisher.com/happycats.xml" />
<updated>2008-08-11T02:15:01Z</updated>
# Example of a full entry.
<entry>
<title>Heathcliff</title>
<link href="http://publisher.com/happycat25.xml" />
<id>http://publisher.com/happycat25.xml</id>
<updated>2008-08-11T02:15:01Z</updated>
<content>
What a happy cat. Full content goes here.
</content>
</entry>
# Example of an entity that isn't full/is truncated. This is implied
# by the lack of a <content> element and a <summary> element instead.
<entry >
<title>Heathcliff</title>
<link href="http://publisher.com/happycat25.xml" />
<id>http://publisher.com/happycat25.xml</id>
<updated>2008-08-11T02:15:01Z</updated>
<summary>
What a happy cat!
</summary>
</entry>
# Meta-data only; implied by the lack of <content> and <summary> elements.
<entry>
<title>Garfield</title>
<link rel="alternate" href="http://publisher.com/happycat24.xml" />
<id>http://publisher.com/happycat25.xml</id>
<updated>2008-08-11T02:15:01Z</updated>
</entry>
# Context entry that's meta-data only and not new. Implied because the
# update time on this entry is before the //atom:feed/updated time.
<entry>
<title>Nermal</title>
<link rel="alternate" href="http://publisher.com/happycat23s.xml" />
<id>http://publisher.com/happycat25.xml</id>
<updated>2008-07-10T12:28:13Z</updated>
</entry>
</atom:feed>
Publisher makes the decision as to include full body, truncated body,
or meta data of most recent event(s). One of:
URL + metadata
URL + metadata + truncated
URL + metadata + full
The trade-off between including all content in outgoing notifications
or having the thundering herd (by clients who fetch the
//atom:feed/entry/link in response to a notification) is up to the
publisher.
Entries of most recent 10 events (for recipient to know whether or not
they'd missed any recent items... like TCP SACK) will be provided as
context. This is implied by the difference between the
//atom:feed/updated field and the //atom:feed/entry/updated
fields. The //atom:feed/updated field will be set to the time of the
*oldest* <entry> in the list that is new. All <entry> items with
<updated> times before then are context; all with times equal to or
after are new. This also lets subscribers know how long it has been
from when the notification was first sent by the publisher to when
they actually received it from the hub.
The //atom:feed/link[@rel="self"] element will indicate the original
URL for the entire event stream with no truncation (if available).
The //atom:feed/link[@rel="hub.delegate"] element indicates the URL
that the hub should use for retrieving new notifications from a
publisher. The publisher can make this delegate URL contain a
meta-data only or truncated view of the feed. If a hub.delegate is not
provided, then the 'self' URL is used as both the source of
notifications and the source for the topic URL feed.
Topic URLs must be unique, but multiple topics may use the same
hub.delegate. In this situation, the delegate URL may serve a
<OLD_INFO>MIME multipart response, each part of which will contain a
separate Atom document for an individual topic</OLD_INFO>. The hub
must understand this delegation. Once it has fetched the topic URL
once to see this delegation is present, it will use the delegation url
to pull the feed. This allows the publisher to be more efficient at
publishing across many topics at once with a single fetch from the
hub.
TODO: How do you indicate to the hub that you no longer want to have a
delegate URL?
Requirement is that topic URLs and delegate URLs can never overlap!
More info on atom:link tag meanings here:
http://intertwingly.net/wiki/pie/LinkTagMeaning
============================================================================
>> Subscribing
============================================================================
There are multiple ways to subscribe, depending on the type and
needs of the subscriber. Roughly, the types are as follows:
1. Internet-accessible subscriber using HTTP callback
(new subscriptions need to be verified to prevent using
the hub to DoS others)
1.1. verification synchronously
1.2. verification asynchronously ("deferred")
2. NAT'd subscribers or those without an HTTP server
(no verification necessary)
Flow for subscription, using the following example URLs:
http://subr.com/notify-callback.php
http://pubr.com/happycats.xml
http://hub.com/hubpoint
1. Subr does POST to /hubpoint with payload:
& hub.mode=subscribe
& hub.callback = http://subr.com/notify-callback.php
& hub.topic = http://pubr.com/happycats.xml
(may be repeated for large subscriptions)
& hub.verify = async,sync
& hub.verify_token = [opaque]
The hub.verify is an optional comma-separated list of the
subscribers ordered preferences and capabiliies,
verification-wise. One of:
sync -- Subr only supports synchronous verification.
async -- Subr only supports async verification.
WARNING: it's not required that servers support
async, so this type of subscription may fail.
sync,async -- Subr prefers sync to async.
async,sync -- Subr prefers async to sync.
The optional hub.verify_token is opaque to the hub and is simply
echoed back to the subscriber in the verification request.
Subscribers can put whatever they want in it: database primary
keys, encrypted data, etc... anything that makes processing the
hub.mode=subverify request easier.
2. Hub sends new request "oh do you want this topic?" to
/notify-callback.php with x-requester-ip: 1.2.3.4 (so DoSing
clients can be detected).
POST /notify-callback.php
Host: subr.com
hub.mode=subverify &
hub.topic=whatever
NOTE: Maybe this should be a GET to the callback URL instead of a POST, since
it represents a steady state for the subscriber? We should probably be rigid
about the 204 here, if possible; otherwise it's really hard to differentiate
between a callback success and just pointing at a random good page on the
web that will return a 200 no matter what you throw at it.
3. Subr says, "yes, I really do want this topic":
HTTP/1.1 204 No Content
4. Hub responds to Subr with "okay". Either 204 if the
subscription(s) were verified and created, or 202 if the
subscriptions were enqueued to be verified later.
TODO: Somewhere in here we should require the subscriber to re-confirm their subscription after a certain amount of time. We need to convey to them what the expiration period of their subscription will be.
If verification is being done asynchronously, steps 2 and 3 above are
skipped and Hub's 2xx response in step 4 is really just saying,
"Potential subscription enqueued for later verification."
Publisher must provide synchronous capability at a minimum.
Sub | Situation | Results
SA - fetch succeed --> 204 (no content)
SA - fetch fail or server prefers async, async logged --> 202 (accepted): best effort. min 1 retry in the future only.
SA - fetch fail, async not supported --> 501 (not implemented)
AS - async supported --> 202 accepted. best effort.
AS - async not supported (or not preferred) + fetch success --> 204 no content. success!
AS - async not supported (or not preferred) + fetch failure --> 5xx
S - fetch succeed --> 204 (no content; success!)
S - fetch failed --> 5xx
A - server supports --> 202 (accepted) best effort later
A - server doesn't support --> 501 (not implemented)
TODO: 5xx on fetch failure isn't clear enough. Maybe we should use 409 ("Conflict") to indicate when a synchronous subscription request tries to confirm and fails. Then it's clearly the requestor's fault and not a server error.
In the case of temporary server error, the server should return 503.
============================================================================-
>> Subscribe Protocol
============================================================================
POST
http://publisher.com/subpoint?
callback=http://subscriber.com/callback.php
topic=http://publisher.com/foo.xml
async={AS, SA, A, S}
mode=unsubscribe (optional: default is 'subscribe')
Error cases:
* If callback is invalid: TODO
* If topic isn't handled by this pubsubhub: TODO
- Probably if it's an unknown topic, issue a 404
* Async option is bogus (400 bad request)
TODO: What about support for multi-part data for the subscriber? For
very simple subscribers, we probably don't even want to do multipart
form-data, because it's more complex to parse? Or is it a minimum
requirement that the post body will always be multipart?
============================================================================
>> Publishing
============================================================================
Overview:
A publisher pings the hub with the URL(s) which have been updated
and the hub schedules them to be fetched & diffed. Because it's
just a ping to wake up the hub, no authentication from the publisher
is required.
Protocol:
POST
http://pubsubhubbub.com/hubpoint?
hub.mode=publish &
hub.url=http://publisher.com/topic1.xml &
hub.url=http://publisher.com/topic2.xml &
...
The 'url' field can be repeated for any combination of topic URL or
delegate URLs. The hub should deal properly with duplicate URLs.
Error cases:
* Topic(s) known/accepted. -> 204 No content.
* Topic(s) unknown/unaccepted -> 4xx Bad Request / Forbidden.
This will enqueue a feed-fetch for sometime in the future, followed by
pushing the new notifications of potential deltas to all subscribers.
The hub may decide to combine this publish notification with any
earlier publish notification that have not yet been pushed to
subscribers (this could happen if events are coming in faster than the
hub will allow).
The hub's GET request of the Atom topic URL may include a Google
Reader Feed-fetcher style thing where there is a statistics header on
the request for the feed every time we pull it. Then the publisher
always knows how many subscribers are on the hub. Example:
GET /foo.xml HTTP/1.1
Host: publisher.com
X-Hub-Subscribers: 120
============================================================================
>> Receive Events
============================================================================
POST
http://subscriber.com/callback.php
Post body will be the Atom notification feed described above. The hub will
keep track of the last N known <atom:id> elements for the topic, and send
updates only for the newest <atom:entry> elements (along with N entries for
context).
The subscriber will know the topic URL by looking at the
//atom:feed/link[@rel="self"] value? Or maybe we'll make it rel="source" for
the notifications?
The subscriber should return 200 or 204 on successful acceptance of
the event. 4xx and 5xx responses will be considered errors (and
delivery will be attempted again later). TODO: What should we do with
3xx responses?
===========================================================================
>> Meeting notes from 2008-09-16:
===========================================================================
Priorities:
- ignore for now NAT'ed token polling (requires https anyway)
- ignore for now XMPP (requires XMPP anyway)
- ignore for now huge subscribers:
- multi-topic notifications
- long-lived connections,
- one HTTP in-flight at a time,
- ignore for now huge publishers:
- publishing tons of updated URLs at a time (e.g. Blogger)
- ignore for now (until v2) all authentication issues:
- no pushing payloads to subscribers. send them notification
to poll us instead. perhaps with token.
- ignore for now private Atom URLs/topics. public topics for now.
OAuth or something later.
Keep atomid of all feed entries we've seen on an Atom URL in the past.
(or just the immediate past one perhaps? or 'n' days of them?). keep
(atomid, date, digest)
Lexicon:
topicid: an Atom URL
topicdeltaid: a diff of two Atom URLs (t1 and t2).
POST /pubber/?topic_url=http://lolcats/lolcatz.xml
SELECT subberid FROM subbers WHERE topicid=? LIMIT 1
("does anybody give a shit?")
If no,
return "Thanks bye! 200 OK!" (optionally tell google
crawlers, based on publisher's preference. TODO: put this
in spec somehow. perhaps reuse the term "noindex"?)
If yes,
enqueue a poll-this-url-later record. one insert. bounded
latency. return 200 OK
Cron:
GET /do-some-work/fetch-updated-feeds-and-find-deltas
pull feed,
compute digests. find ids, dates. compute deltas from our copy of that thing's previous value.
INSERT INTO topicdeltapayloads
SET topicdeltaid="yyyymmhhddmss.mmss:topicid",
payload=..., topicid=....
INSERT INTO topics_what_are_new_but_people_need_to_be_notified
SET topicdeltaid=?, subid-where-i-left-off=""
GET /do-some-work/send-notifications
SELECT topiciddeltaid, subid-where-i-left-off FROM topics_what_are_new_but_people_need_to_be_notified LIMIT 500
RANDOMIZE LIST
Foreach topicid:
try-to-get-lock {
SELECT the topicdeltaid payload
SELECT subscribers WHERE topicid = ? AND subid > subid-where-i-left-off
BATCH urlfetch POST to them all,
scatter-gather errors.
For those that fail from the 100-some batch, create
to-do-later (notification) records. increase subid if the
selected count == the previous limit,
else DELETE FROM
topics_where_people_need_to_be_notified WHERE
topiciddeltaid = ?
} // end lock
XMPP:
* in the future, if/when App Engine supports it. but it's a special
thingy. HTTP is base and required. XMPP support for pubbers and
subbers is optional.
Polling mode for subscribers:
* a) callbacks won't always work (subscribers behind NATs, etc)
* b) callbacks won't always fit all subscriber's model (not easy for them)
* so must have poll mode.
* in the future: can be long-poll, when App Engine supports it. maybe.
* needs auth
* 1MB payload on responses, so server needs ability to paginate and set "but there's more!" flag w/ continuation token.
The hub notifies all subbers: POST /callback/url/ "yo, something's new for you. don't trust me. fetch: http://pubsubhubbub.appspost.com/poll-for-new-shit/?subid=234&token=23482903482340923849023840923i4"
Large subscribers: (may be v2)
* one in-flight HTTP POST to subscribers at a time. use memcacheg 10 second or so lock.
* if another POST is attempted while another is already in flight, enqueue/append the payloadid to a new
table, contentious_or_big_subscriber. still mark that (topicdeltaid, subid) pair as done for the purposes of
/do-some-work/sent-notifications
* new do-some-work:
/do-some-work/sent-notifications-to-big-peeps
* optional property on subscriptions for big subscribers to say, "Yo, it's okay to mix my subscriptions together
in one HTTP payload post." in which case it's atom-stream.xml style (updates.sixapart.com) and the payloads are mixed:
Misc notes:
----------------
* can subscribe to anything, regardless of whether or not there are any publishers.
* server's choice whether or not to actually poll proactively for changes vs. getting notified.
Discovery:
--------------
in Atom.xml:
<link rel="hub.subscribe" href="http://pubsubhubbub.appspot.com/subscribe" />
(repeated. client should pick one)
in /index.html
<link rel="hub.subscribe" href="http://pubsubhubbub.appspot.com/publish" />
<link rel="alternate" type="application/atom+xml" href="http://lolcats.xml" />
then bookmarklet to ping the publish URL.
===========================================================================
end meeting notes from 2009-09-16
===========================================================================
=== Open issues... ===
Is there an existing standard for aggregators to specify how many readers they're requesting on behalf of?