design-doc.txt

Note:
=====

This document is old.  Instead, see:

http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.1.html


============================================================================
>> Overview
============================================================================

An open, simple web-scale pubsub protocol, along with an open source
reference implentation targetting Google App Engine.  Notably,
however, nothing in the protocol is centralized, or Google- or App
Engine-specific.  Anybody can play.

As opposed to more developed (and more complex) pubsub specs like
XEP-0060, this spec's base profile (the barrier-to-entry to speak it)
is dead simple.  The fancy bits required for high-volume publishers
and subscribers are optional.  The base profile is HTTP-based, as
opposed to XMPP (see more on this below).

To dramatically simplify the spec in several places where we had to
choose between supporting A or B, we took it upon ourselves to say
"only A", rather than making it an implementation decision.

We offer this spec in hopes that it fills a need or at least advances
the state of the discussion in the pubsub space.  Polling sucks.  We
think a decentralized pubsub layer is a fundamental, missing layer in
the Internet architecture today and its existence, more than just
enabling the obvious lower latency feed readers, would enable many
cool applications, most of which we can't even imagine.  But we're
looking forward to decentralized social networking.

<MOVED TO XML>

============================================================================
>> Terminology
============================================================================

Topic: an Atom feed URL.  The unit to which one can subscribe to
  changes.  RSS isn't supported for simplicity.  Further, the spec
  currently only addresses public (unauthenticated) Atom feed URLs.

Pubsub Hub ("the hub"):  the server (URL) which implements this protocol.
  We're currently implementing this and running at server at
  http://pubsubhubbub.appspot.com/ that's at least for now open for anybody
  to use, as either a publisher or subscriber.  Any hub is free to
  implement its own policies on who can use it.

Publisher:  an owner of a topic.  Notifies the pubsub hub when the topic
  (Atom feed) has been updated.  Just notifies that it _has_ been updated,
  but not how.  As in almost all pubsub systems, the publisher is unaware
  of the subscribers, if any.

Subscriber: an entity (person or program) that wants to be notified of
  changed on a topic.  Must be directly network-accessible, not being
  a NAT.  PubSubHubbub is a server-to-server protocol.  If you're being
  NAT, you're a client, out-of-scope for this protocol.  (Browser channels,
  long-polling a server would be more appropriate for you.)

Subscription: a tuple (Topic URL, Subscriber).  For network-accessible
  subscribers, the subscription's unique key is actually the tuple
  (Topic URL, Subscriber Callback URL).  For NAT'd subscribers,
  the unique key for a subscription is (Topic URL, SubscriberToken).
  In both cases, subscriptions may (at the hub's decision) have expiration
  times akin to DHCP leases and then must be renewed.

Event: an event that's visible to multiple topics.  For each event
  that happens (e.g. "Brad posted to the Linux Community."), multiple
  topics could be affected (e.g. "Brad posted." and "Linux community
  has new post").  Publisher events update topics, and the hub looks
  up all subscriptions for all affected topics, sending out
  notifications to subscribers.

Notification: a delta on a topic, computed by the hub and sent to all
  subscribers.  (TBD: format of this delta.  likely: an Atom feed
  itself with just the new or changed stuff, and gravestones for
  removed items?)  The notification can be the result of a publisher
  telling the hub of an update, or the hub proactively polling a topic
  feed, perhaps for a subscriber subscribing to a topic that's not
  pubsub-aware.  Note also that a notification to a subscriber can be
  a payload consisting of updates for multiple topics.  Publishers MAY
  choose to send multi-topic notifications as an optimization for
  heavy subscribers, but subscribers MUST understand them.

</MOVED TO XML>

<FAQ>

============================================================================
>> Notes:
============================================================================

* There is no relationship or hierarchy between topics.  In the future
  such an Atom extension could exist, but that's entirely out of this
  spec, both now and then.  Non-goal.  If a publisher wants to offer
  a hierarchy, they need to offer 'n' Atom feeds.

* For HTTP callback subscribers, the add-subscription part of the
  protocol requires that the hub verifies (via a pingback: "did you
  really mean that?") before actually adding the subscription.  This
  is to prevent people from DoS'ing each other by subscribing victims
  to many and/or high-volume publishers.

* In same way openid was bootstrappable with a simple <link> tag, should
  be similar for publishers to delegate their pubsubhub with a simple
  link tag. Example:
    <link rel="hub.subscribe" href="http://pubsubhubbub.com/subscribe" />

* Multi-protocol would be nice, but simple would probably win... HTTP
  only at first.  XMPP later.  XMPP has a few advantages, but really
  only authentication.  A good HTTP implementation can do long polling,
  pingbacks, etc.

* Loops.  Perhaps Atom child element (repeated) of all the Atom Entry
  IDs that entry used to be or came from or is.  Neat to see the
  HTTP-like TRACE.  (perhaps extension to Atom, not part of this spec)
  (** Looked it up, and <atom:link rel="via"> implies this, but it
  only works if all of the feeds in the trace correctly supply a 'via'
  tag. Then it's on the client to iteratively follow the trace).

</FAQ>

============================================================================
>> High-level protocol flow:
============================================================================

<MOVED>

* Publishers POST a ping to their hub(s) URLs when their topic(s)
  change.

* Subscribers POST to one or more of the advertised hubs for a topic they're interested in.  Alternatively, some hubs may offer auto-polling capability, to let {their,any} subscribers subscribe to topics which don't advertise a hub.

* The hub caches minimal metadata (id, data, entry digest) about each topic's previous state.  When the hub refetches a topic feed (on its own initiative or as a result of a publisher's ping) and finds a delta, it enqueues a notification to all registered subscribers. Subscribers can be notified of topic deltas in a variety of ways:

</MOVED>

<APPENDIX>

    - In the base profile, subscribers must be directly network accessible (not behind a NAT),
      running a listening webserver, and can receive an HTTP callback to notify them their topic
      changed.  To avoid authentication issues with HTTP, this
      callback doesn't include any payload but rather just a note for
      the subscriber to check the hub for the topic URL (which
      presumably they trust, if they subscribed to it in the first
      place).  In the future, this HTTP callback could include a
      signed (OAuth?)  payload, avoiding the need for the extra HTTP
      request in the other direction.  In any high transaction
      scenario, though, it's hoped that all parties (hub, publisher,
      subscriber) would make proper use of HTTP Keep-Alive
      connections, negating the ugliest part of the multiple HTTP
      requests (new TCP connections: 3-way handshake, slow start,
      ephemeral port exhaustion, etc).

    - Also in the base profile, but slightly lower priority for us
      implementation-wise, is support for NAT'd subscribers unable
      to run a publicly accessible listening webserver.  Instead,
      these subscribers need to connect to the hub to retrieve their
      enqueued notifications.  A smart hub implementation here would
      support HTTP long-polling (aka "comet") so the client doesn't
      need to make HTTP requests often to get low-latency updates.
      (TODO/FUTURE: define recommendations for this long-polling behavior
      on both client and server:  ideally server just does it, hanging
      after the GET, but then what's the recommendation for the client's
      HTTP client timeout value, which might not be under their control?
      Ignore that and document it?  Separate URL for long polling?
      Then subscriber caches hub's long-polling ability?  Server includes
      X- header to signal that it did or wants to do long polling?)

    - Fancier implementations may choose to use HTTP long polling
      ("comet") or XMPP.  We're punting on this for now in the
      interest of getting something basic working for the common case.

</APPENDIX>

============================================================================
>> Atom details
============================================================================

Notification and source formats will be Atom. More detail follows this example.

  <atom:feed>
    # ... source, title, etc ...

    <link rel="hub.subscribe" href="http://myhub.com/subscribe" />
    <link rel="self" href="http://publisher.com/happycats.xml" />
    <updated>2008-08-11T02:15:01Z</updated>

    # Example of a full entry.
    <entry>
      <title>Heathcliff</title>
      <link href="http://publisher.com/happycat25.xml" />
      <id>http://publisher.com/happycat25.xml</id>
      <updated>2008-08-11T02:15:01Z</updated>
      <content>
        What a happy cat. Full content goes here.
      </content>
    </entry>

    # Example of an entity that isn't full/is truncated. This is implied
    # by the lack of a <content> element and a <summary> element instead.
    <entry >
      <title>Heathcliff</title>
      <link href="http://publisher.com/happycat25.xml" />
      <id>http://publisher.com/happycat25.xml</id>
      <updated>2008-08-11T02:15:01Z</updated>
      <summary>
        What a happy cat!
      </summary>
    </entry>
    
    # Meta-data only; implied by the lack of <content> and <summary> elements.
    <entry>
      <title>Garfield</title>
      <link rel="alternate" href="http://publisher.com/happycat24.xml" />
      <id>http://publisher.com/happycat25.xml</id>
      <updated>2008-08-11T02:15:01Z</updated>
    </entry>

    # Context entry that's meta-data only and not new. Implied because the
    # update time on this entry is before the //atom:feed/updated time.
    <entry>
      <title>Nermal</title>
      <link rel="alternate" href="http://publisher.com/happycat23s.xml" />
      <id>http://publisher.com/happycat25.xml</id>
      <updated>2008-07-10T12:28:13Z</updated>
    </entry>

  </atom:feed>

Publisher makes the decision as to include full body, truncated body,
or meta data of most recent event(s).  One of:

  URL + metadata
  URL + metadata + truncated
  URL + metadata + full

The trade-off between including all content in outgoing notifications
or having the thundering herd (by clients who fetch the
//atom:feed/entry/link in response to a notification) is up to the
publisher.

Entries of most recent 10 events (for recipient to know whether or not
they'd missed any recent items... like TCP SACK) will be provided as
context. This is implied by the difference between the
//atom:feed/updated field and the //atom:feed/entry/updated
fields. The //atom:feed/updated field will be set to the time of the
*oldest* <entry> in the list that is new. All <entry> items with
<updated> times before then are context; all with times equal to or
after are new. This also lets subscribers know how long it has been
from when the notification was first sent by the publisher to when
they actually received it from the hub.

The //atom:feed/link[@rel="self"] element will indicate the original
URL for the entire event stream with no truncation (if available).

The //atom:feed/link[@rel="hub.delegate"] element indicates the URL
that the hub should use for retrieving new notifications from a
publisher. The publisher can make this delegate URL contain a
meta-data only or truncated view of the feed. If a hub.delegate is not
provided, then the 'self' URL is used as both the source of
notifications and the source for the topic URL feed.

Topic URLs must be unique, but multiple topics may use the same
hub.delegate. In this situation, the delegate URL may serve a
<OLD_INFO>MIME multipart response, each part of which will contain a
separate Atom document for an individual topic</OLD_INFO>. The hub
must understand this delegation. Once it has fetched the topic URL
once to see this delegation is present, it will use the delegation url
to pull the feed.  This allows the publisher to be more efficient at
publishing across many topics at once with a single fetch from the
hub.

TODO: How do you indicate to the hub that you no longer want to have a
delegate URL?

Requirement is that topic URLs and delegate URLs can never overlap!

More info on atom:link tag meanings here:
  http://intertwingly.net/wiki/pie/LinkTagMeaning

============================================================================
>> Subscribing
============================================================================

There are multiple ways to subscribe, depending on the type and
needs of the subscriber.  Roughly, the types are as follows:

    1. Internet-accessible subscriber using HTTP callback
       (new subscriptions need to be verified to prevent using
        the hub to DoS others)
       1.1. verification synchronously
       1.2. verification asynchronously ("deferred")
    2. NAT'd subscribers or those without an HTTP server
       (no verification necessary)

Flow for subscription, using the following example URLs:

http://subr.com/notify-callback.php
http://pubr.com/happycats.xml
http://hub.com/hubpoint

   1. Subr does POST to /hubpoint with payload:

        & hub.mode=subscribe
        & hub.callback = http://subr.com/notify-callback.php
        & hub.topic = http://pubr.com/happycats.xml
              (may be repeated for large subscriptions)
        & hub.verify = async,sync
        & hub.verify_token = [opaque]

      The hub.verify is an optional comma-separated list of the
      subscribers ordered preferences and capabiliies,
      verification-wise.  One of:

          sync  -- Subr only supports synchronous verification.
          async -- Subr only supports async verification.
                   WARNING: it's not required that servers support
                   async, so this type of subscription may fail.
          sync,async -- Subr prefers sync to async.
          async,sync -- Subr prefers async to sync.

      The optional hub.verify_token is opaque to the hub and is simply
      echoed back to the subscriber in the verification request.
      Subscribers can put whatever they want in it: database primary
      keys, encrypted data, etc... anything that makes processing the
      hub.mode=subverify request easier.

   2. Hub sends new request "oh do you want this topic?" to
   /notify-callback.php with x-requester-ip: 1.2.3.4 (so DoSing
   clients can be detected).

        POST /notify-callback.php
        Host: subr.com

        hub.mode=subverify &
        hub.topic=whatever
  
   NOTE: Maybe this should be a GET to the callback URL instead of a POST, since
   it represents a steady state for the subscriber? We should probably be rigid
   about the 204 here, if possible; otherwise it's really hard to differentiate
   between a callback success and just pointing at a random good page on the
   web that will return a 200 no matter what you throw at it.

   3. Subr says, "yes, I really do want this topic":

        HTTP/1.1 204 No Content

   4. Hub responds to Subr with "okay".  Either 204 if the
      subscription(s) were verified and created, or 202 if the
      subscriptions were enqueued to be verified later.

TODO: Somewhere in here we should require the subscriber to re-confirm their subscription after a certain amount of time. We need to convey to them what the expiration period of their subscription will be.

If verification is being done asynchronously, steps 2 and 3 above are
skipped and Hub's 2xx response in step 4 is really just saying,
"Potential subscription enqueued for later verification."

Publisher must provide synchronous capability at a minimum.

Sub | Situation | Results

SA - fetch succeed --> 204 (no content)
SA - fetch fail or server prefers async, async logged --> 202 (accepted): best effort.  min 1 retry in the future only.
SA - fetch fail, async not supported --> 501 (not implemented)
AS - async supported --> 202 accepted.  best effort.
AS - async not supported (or not preferred) + fetch success --> 204 no content.  success!
AS - async not supported (or not preferred) + fetch failure --> 5xx
S - fetch succeed --> 204 (no content; success!)
S - fetch failed --> 5xx
A - server supports --> 202 (accepted) best effort later
A - server doesn't support --> 501 (not implemented)

TODO: 5xx on fetch failure isn't clear enough. Maybe we should use 409 ("Conflict") to indicate when a synchronous subscription request tries to confirm and fails. Then it's clearly the requestor's fault and not a server error.

In the case of temporary server error, the server should return 503.

============================================================================-
>> Subscribe Protocol
============================================================================

POST
http://publisher.com/subpoint?
    callback=http://subscriber.com/callback.php
    topic=http://publisher.com/foo.xml
    async={AS, SA, A, S}
    mode=unsubscribe   (optional: default is 'subscribe')
    
  Error cases:
    * If callback is invalid: TODO
    * If topic isn't handled by this pubsubhub: TODO
      - Probably if it's an unknown topic, issue a 404
    * Async option is bogus (400 bad request)

TODO: What about support for multi-part data for the subscriber? For
very simple subscribers, we probably don't even want to do multipart
form-data, because it's more complex to parse? Or is it a minimum
requirement that the post body will always be multipart?

============================================================================
>> Publishing
============================================================================

Overview:

  A publisher pings the hub with the URL(s) which have been updated
  and the hub schedules them to be fetched & diffed.  Because it's
  just a ping to wake up the hub, no authentication from the publisher
  is required.

Protocol:

POST
http://pubsubhubbub.com/hubpoint?
    hub.mode=publish &
    hub.url=http://publisher.com/topic1.xml &
    hub.url=http://publisher.com/topic2.xml &
    ...

  The 'url' field can be repeated for any combination of topic URL or
  delegate URLs. The hub should deal properly with duplicate URLs.

  Error cases:
    * Topic(s) known/accepted. -> 204 No content.
    * Topic(s) unknown/unaccepted -> 4xx Bad Request / Forbidden.

This will enqueue a feed-fetch for sometime in the future, followed by
pushing the new notifications of potential deltas to all subscribers.
The hub may decide to combine this publish notification with any
earlier publish notification that have not yet been pushed to
subscribers (this could happen if events are coming in faster than the
hub will allow).

The hub's GET request of the Atom topic URL may include a Google
Reader Feed-fetcher style thing where there is a statistics header on
the request for the feed every time we pull it. Then the publisher
always knows how many subscribers are on the hub. Example:

   GET /foo.xml HTTP/1.1
   Host: publisher.com
   X-Hub-Subscribers: 120


============================================================================
>> Receive Events
============================================================================

POST
http://subscriber.com/callback.php

Post body will be the Atom notification feed described above. The hub will
keep track of the last N known <atom:id> elements for the topic, and send
updates only for the newest <atom:entry> elements (along with N entries for
context).

The subscriber will know the topic URL by looking at the
//atom:feed/link[@rel="self"] value? Or maybe we'll make it rel="source" for
the notifications?

The subscriber should return 200 or 204 on successful acceptance of
the event. 4xx and 5xx responses will be considered errors (and
delivery will be attempted again later).  TODO: What should we do with
3xx responses?

===========================================================================
>> Meeting notes from 2008-09-16:
===========================================================================

Priorities:
   - ignore for now NAT'ed token polling (requires https anyway)
   - ignore for now XMPP (requires XMPP anyway)
   - ignore for now huge subscribers:
        - multi-topic notifications
        - long-lived connections,
        - one HTTP in-flight at a time,
    - ignore for now huge publishers:
        - publishing tons of updated URLs at a time (e.g. Blogger)
    - ignore for now (until v2) all authentication issues:
        - no pushing payloads to subscribers.  send them notification
          to poll us instead.  perhaps with token.
    - ignore for now private Atom URLs/topics.  public topics for now.
      OAuth or something later.

Keep atomid of all feed entries we've seen on an Atom URL in the past.
(or just the immediate past one perhaps?  or 'n' days of them?).  keep
(atomid, date, digest)

Lexicon:
   topicid:  an Atom URL
   topicdeltaid:  a diff of two Atom URLs (t1 and t2).

POST /pubber/?topic_url=http://lolcats/lolcatz.xml
     SELECT subberid FROM subbers WHERE topicid=? LIMIT 1
       ("does anybody give a shit?")
      If no,
          return "Thanks bye! 200 OK!"  (optionally tell google
          crawlers, based on publisher's preference.  TODO: put this
          in spec somehow. perhaps reuse the term "noindex"?)
      If yes,
          enqueue a poll-this-url-later record.  one insert.  bounded
          latency.  return 200 OK

Cron:
GET /do-some-work/fetch-updated-feeds-and-find-deltas
     pull feed,
     compute digests.  find ids, dates.  compute deltas from our copy of that thing's previous value.
     INSERT INTO topicdeltapayloads
           SET topicdeltaid="yyyymmhhddmss.mmss:topicid",
                   payload=..., topicid=....
     INSERT INTO topics_what_are_new_but_people_need_to_be_notified
           SET topicdeltaid=?, subid-where-i-left-off=""

 
GET /do-some-work/send-notifications
    SELECT topiciddeltaid, subid-where-i-left-off FROM topics_what_are_new_but_people_need_to_be_notified LIMIT 500
    RANDOMIZE LIST
    Foreach topicid:
       try-to-get-lock {
           SELECT the topicdeltaid payload
           SELECT subscribers WHERE topicid = ? AND subid > subid-where-i-left-off
            BATCH urlfetch POST to them all,
              scatter-gather errors.  

            For those that fail from the 100-some batch, create
            to-do-later (notification) records.  increase subid if the
            selected count == the previous limit,
                else DELETE FROM
                    topics_where_people_need_to_be_notified WHERE
                    topiciddeltaid = ?

        }  // end lock

XMPP:

* in the future, if/when App Engine supports it.  but it's a special
  thingy.  HTTP is base and required.  XMPP support for pubbers and
  subbers is optional.

Polling mode for subscribers:

* a) callbacks won't always work (subscribers behind NATs, etc)
* b) callbacks won't always fit all subscriber's model (not easy for them)
* so must have poll mode.
* in the future:  can be long-poll, when App Engine supports it.  maybe.
* needs auth
* 1MB payload on responses, so server needs ability to paginate and set "but there's more!" flag w/ continuation token.

The hub notifies all subbers:  POST /callback/url/ "yo, something's new for you.  don't trust me. fetch: http://pubsubhubbub.appspost.com/poll-for-new-shit/?subid=234&token=23482903482340923849023840923i4"

Large subscribers:  (may be v2)

* one in-flight HTTP POST to subscribers at a time.  use memcacheg 10 second or so lock.
* if another POST is attempted while another is already in flight, enqueue/append the payloadid to a new
   table, contentious_or_big_subscriber.  still mark that (topicdeltaid, subid) pair as done for the purposes of
  /do-some-work/sent-notifications
* new do-some-work:
      /do-some-work/sent-notifications-to-big-peeps
* optional property on subscriptions for big subscribers to say, "Yo, it's okay to mix my subscriptions together
   in one HTTP payload post."  in which case it's atom-stream.xml style (updates.sixapart.com) and the payloads are mixed:


Misc notes:
----------------
* can subscribe to anything, regardless of whether or not there are any publishers.
* server's choice whether or not to actually poll proactively for changes vs. getting notified.


Discovery:
--------------
in Atom.xml:
    <link rel="hub.subscribe" href="http://pubsubhubbub.appspot.com/subscribe" />
        (repeated.  client should pick one)

in /index.html
   <link rel="hub.subscribe" href="http://pubsubhubbub.appspot.com/publish" />
   <link rel="alternate" type="application/atom+xml" href="http://lolcats.xml" />

then bookmarklet to ping the publish URL.

===========================================================================
              end meeting notes from 2009-09-16
===========================================================================

=== Open issues... ===

Is there an existing standard for aggregators to specify how many readers they're requesting on behalf of?