-
Notifications
You must be signed in to change notification settings - Fork 57
wukong dataflow notes from similar tools
-
Data window views:
win:length
win:length_batch
win:time
win:time_batch
win:time_length_batch
win:time_accum
win:ext_timed
ext:sort_window
ext:time_order
std:unique
std:groupwin
std:lastevent
std:firstevent
std:firstunique
win:firstlength
win:firsttime
-
Views that derive statistics:
std:size
stat:uni
stat:linest
stat:correl
stat:weighted_avg
s
[annotations]
[expression_declarations]
[context context_name]
[insert into insert_into_def]
select select_list
from stream_def [as name] [, stream_def [as name]] [,...]
[where search_conditions]
[group by grouping_expression_list]
[having grouping_search_conditions]
[output output_specification]
[order by order_by_expression_list]
[limit num_rows]
-
Filter criteria. The following operators are highly optimized through indexing and are the preferred means of filtering in high-volume event streams:
- equals =
- not equals !=
- comparison operators < , > , >=, <=
- ranges-- Ranges come in the following 4 varieties. The use of round () or square [] bracket dictates whether an endpoint is included or excluded. The low point and the high-point of the range are separated by the colon : character.
- Open ranges that contain neither endpoint (low:high)
- Closed ranges that contain both endpoints [low:high]. The equivalent 'between' keyword also defines a closed range.
- Half-open ranges that contain the low endpoint but not the high endpoint [low:high)
- Half-closed ranges that contain the high endpoint but not the low endpoint (low:high]
- use the between keyword for a closed range where both endpoints are included
- use the in keyword and round () or square brackets [] to control how endpoints are included
- for inverted ranges use the not keyword and the between or in keywords
- list-of-values checks using the in keyword or the not in keywords followed by a comma-separated list of values
-
where clause ...where fraud.severity = 5 and amount > 500 ...where (orderItem.orderId is null) or (orderItem.class != 10) ...where (orderItem.orderId = null) or (orderItem.class <> 10) ...where itemCount / packageCount > 10
-
aggregate --
aggregate_function( [all | distinct] expression)
- group by
- having
- last, all, first
Notes form the rack specification -- a pure expression of the middleware dataflow paradigm
...
This specification aims to formalize the Rack protocol. You can (and should) use Rack::Lint to enforce it. When you develop middleware, be sure to add a Lint before and after to catch all mistakes.
A Rack application is an Ruby object (not a class) that responds to call. It takes exactly one argument, the environment and returns an Array of exactly three values: The status, the headers, and the body.
The environment must be an instance of Hash that includes CGI-like headers. The application is free to modify the environment. The environment is required to include these variables (adopted from PEP333), except when they’d be empty, but see below.
-
REQUEST_METHOD
: The HTTP request method, such as "GET" or "POST". This cannot ever be an empty string, and so is always required. -
SCRIPT_NAME
: The initial portion of the request URL‘s "path" that corresponds to the application object, so that the application knows its virtual "location". This may be an empty string, if the application corresponds to the "root" of the server. -
PATH_INFO
: The remainder of the request URL‘s "path", designating the virtual "location" of the request‘s target within the application. This may be an empty string, if the request URL targets the application root and does not have a trailing slash. This value may be percent-encoded when I originating from a URL. -
QUERY_STRING
: The portion of the request URL that follows the ?, if any. May be empty, but is always required! -
SERVER_NAME
,SERVER_PORT
: When combined withSCRIPT_NAME
andPATH_INFO
, these variables can be used to complete the URL. Note, however, thatHTTP_HOST
, if present, should be used in preference toSERVER_NAME
for reconstructing the request URL.SERVER_NAME
andSERVER_PORT
can never be empty strings, and so are always required. -
HTTP_
Variables: Variables corresponding to the client-supplied HTTP request headers (i.e., variables whose names begin withHTTP_
). The presence or absence of these variables should correspond with the presence or absence of the appropriate HTTP header in the request. -
rack.version: The Array [1,1], representing this version of Rack.
-
rack.url_scheme: http or https, depending on the request URL.
-
rack.input: See below, the input stream.
-
rack.errors: See below, the error stream.
-
rack.multithread: true if the application object may be simultaneously invoked by another thread in the same process, false otherwise.
-
rack.multiprocess: true if an equivalent application object may be simultaneously invoked by another process, false otherwise.
-
rack.run_once: true if the server expects (but does not guarantee!) that the application will only be invoked this one time during the life of its containing process. Normally, this will only be true for a server based on CGI (or something similar).
-
rack.logger: A common object interface for logging messages. The object must implement:
info(message, &block)
debug(message, &block)
warn(message, &block)
error(message, &block)
fatal(message, &block)
The server or the application can store their own data in the environment, too. The keys must contain at least one dot, and should be prefixed uniquely. The prefix rack.
is reserved for use with the Rack core distribution and other accepted specifications, and must not be used otherwise. The environment must not contain the keys HTTP_CONTENT_TYPE
or HTTP_CONTENT_LENGTH
(use the versions without HTTP_
). The CGI keys (named without a period) must have String values. There are the following restrictions:
-
rack.version
must be an array ofInteger
s. -
rack.url_scheme
must either be'http'
or'https'
. - There must be a valid input stream in
rack.input
. - There must be a valid error stream in
rack.errors
. - The
REQUEST_METHOD
must be a valid token. - The
SCRIPT_NAME
, if non-empty, must start with'/'
- The
PATH_INFO
, if non-empty, must start with'/'
- The
CONTENT_LENGTH
, if given, must consist of digits only. - One of
SCRIPT_NAME
orPATH_INFO
must be set.PATH_INFO
should be'/'
ifSCRIPT_NAME
is empty.SCRIPT_NAME
never should be'/'
, but instead be empty.
The input stream is an IO-like object which contains the raw record. When applicable, its external encoding must be “ASCII-8BIT” and it must be opened in binary mode, for Ruby 1.9 compatibility. The input stream must respond to gets, each, read and rewind.
-
gets
must be called without arguments and return a string, ornil
on EOF. -
read
behaves likeIO#read
. Its signature isread([length, [buffer]])
. Iflength
is given, it must be an non-negative Integer (>= 0) ornil
, and buffer must be aString
and may not benil
. Iflength
is given and notnil
, then this method reads at mostlength
bytes from the input stream. Iflength
is not given ornil
, then this method reads all data until EOF. When EOF is reached, this method returnsnil
iflength
is given and notnil
, or""
iflength
is not given or isnil
. If buffer is given, then the read data will be placed into buffer instead of a newly created String object. -
each
must be called without arguments and only yield Strings. -
rewind
must be called without arguments. It rewinds the input stream back to the beginning. It must not raiseErrno::ESPIPE
: that is, it may not be a pipe or a socket. Therefore, handler developers must buffer the input data into some rewindable object if the underlying input stream is not rewindable. -
close
must never be called on the input stream.
The error stream must respond to puts, write and flush.
-
puts
must be called with a single argument that responds toto_s
. -
write
must be called with a single argument that is aString
. -
flush
must be called without arguments and must be called in order to make the error appear for sure. -
close
must never be called on the error stream.
The Response
- status: an HTTP status. When parsed as integer (to_i), it must be greater than or equal to 100.
-
headers: must respond to each, and yield values of key and value. The header keys must be Strings. The header must not contain a Status key, contain keys with : or newlines in their name, contain keys names that end in - or _, but only contain keys that consist of letters, digits, _ or - and start with a letter. The values of the header must be Strings, consisting of lines (for multiple header values, e.g. multiple Set-Cookie values) seperated by “n“. The lines must not contain characters below 037.
- Content-Type: There must be a Content-Type, except when the Status is 1xx, 204 or 304, in which case there must be none given.
- Content-Length: There must not be a Content-Length header when the Status is 1xx, 204 or 304.
- body: must respond to each and must only yield String values. The Body itself should not be an instance of String, as this will break in Ruby 1.9. If the Body responds to close, it will be called after iteration. If the Body responds to to_path, it must return a String identifying the location of a file whose contents are identical to that produced by calling each; this may be used by the server as an alternative, possibly more efficient way to transport the response. The Body commonly is an Array of Strings, the application instance itself, or a File-like object.
From Kafka docs
Most messaging systems keep metadata about what messages have been consumed on the broker. That is, as a message is handed out to a consumer, the broker records that fact locally. This is a fairly intuitive choice, and indeed for a single machine server it is not clear where else it could go. Since the data structure used for storage in many messaging systems scale poorly, this is also a pragmatic choice -- since the broker knows what is consumed it can immediately delete it, keeping the data size small.
What is perhaps not obvious, is that getting the broker and consumer to come into agreement about what has been consumed is not a trivial problem. If the broker records a message as consumed immediately every time it is handed out over the network, then if the consumer fails to process the message (say because it crashes or the request times out or whatever) then that message will be lost. To solve this problem, many messaging systems add an acknowledgement feature which means that messages are only marked as sent not consumed when they are sent; the broker waits for a specific acknowledgement from the consumer to record the message as consumed. This strategy fixes the problem of losing messages, but creates new problems. First of all, if the consumer processes the message but fails before it can send an acknowledgement then the message will be consumed twice. The second problem is around performance, now the broker must keep multiple states about every single message (first to lock it so it is not given out a second time, and then to mark it as permanently consumed so that it can be removed). Tricky problems must be dealt with, like what to do with messages that are sent but never acknowledged.
So clearly there are multiple possible message delivery guarantees that could be provided:
- At most once—this handles the first case described. Messages are immediately marked as consumed, so they can't be given out twice, but many failure scenarios may lead to losing messages.
- At least once—this is the second case where we guarantee each message will be delivered at least once, but in failure cases may be delivered twice.
- Exactly once—this is what people actually want, each message is delivered once and only once.