-
Notifications
You must be signed in to change notification settings - Fork 36
/
language.txt
319 lines (264 loc) · 10.8 KB
/
language.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
# Protoscope Language Specification.
# Protoscope is a text format for representing valid Protobuf wire-format
# encodings, directly inspired by https://github.com/google/der-ascii, and
# has a significant overlap in syntax with it.
#
# First, it is reversible, so all encoding variations must be represented in the
# language directly. This includes the distinctions between different integer
# encodings, packed primitive fields, and groups.
#
# Second, Protoscope is intended to create both valid and invalid encodings. It
# has minimal knowledge of the wire format, but it is ignorant of actual
# schemata specified by DescriptorProtos. Elements in the input file may be
# freely replaced by raw byte strings, and there is no requirement that the
# resulting output is anything resembling a valid proto.
#
# Protoscope is *not* a replacement for the text format; instead, it is intended
# to be used for manipulating encoded protos where the precise encoding is
# relevant, such as debugging a codec or creating test data.
#
# This specification is a valid Protoscope file.
# A Protoscope file is a sequence of tokens. Most tokens resolve to a byte
# string which is emitted as soon as it is processed.
# Tokens are separated by whitespace, which is defined to be space (0x20), TAB
# (0x09), CR (0x0d), and LF (0x0a). Apart from acting as a token separator,
# whitespace is not significant.
# Comments begin with # and run to the end of the line. Comments are treated as
# whitespace.
# Quoted strings.
"Quoted strings are delimited by double quotes. Backslash denotes escape
sequences. Legal escape sequences are: \\ \" \x00 \000 \n. \x00 consumes two
hex digits and emits a byte. \000 consumes one to three octal digits and emits
a byte (rejecting values that do not fit in a single octet). Otherwise, any
byte before the closing quote, including a newline, is emitted as-is."
# Tokens in the file are emitted one after another, so the following lines
# produce the same output:
"hello world"
"hello " "world"
# The Protobuf wire format only deals in UTF-8 when it deals with text at all,
# so there is no equivalent of DER-ASCII's UTF-16/32 string literals.
# Hex literals.
# Backticks denote hex literals. Either uppercase or lowercase is legal, but no
# characters other than hexadecimal digits may appear. A hex literal emits the
# decoded byte string.
`00`
`abcdef`
`AbCdEf`
# Integers.
# Tokens which match /-?[0-9]+/ or /-?0x[0-9a-fA-F]+/ are integer tokens.
# They encode into a Protobuf varint (base 128).
456
-0xffFF
# Signed integers encode as their 64-bit two's complement by default. If an
# integer is suffixed with z, it uses the zigzag encoding instead.
-2z 3 # Equivalent tokens.
# An integer may instead by suffixed with i32 or i64, which indicates it should
# be encoded as a fixed-width integer.
0i32
-23i64
# An integer may follow a 'long-form:N' token. This will cause the varint to
# have N more bytes than it needs to successfully encode. For example, the
# following are equivalent:
long-form:3 3
`83808000`
# Booleans.
# The following tokens emit `01` and `00`, respectively.
true
false
# Floats.
# Tokens that match /-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?/ or
# /-?0x[0-9a-fA-F]+\.[0-9a-fA-F]+([pP]-?[0-9]+)?/ are floating-point
# tokens. They encode to a IEEE 754 binary64 value.
1.0
9.423e-2
-0x1.ffp52
# Decimal floats are only guaranteed a particular encoding when conversion from
# decimal to binary is exact. Hex floats always have an exact conversion. The
# i32 prefix from above may be used to specify a 32-bit float (i64 is permitted,
# but redundant).
1.5i32
0xf.fi64
# The strings inf32, inf64, -inf32, and -inf64 are recognized as shorthands for
# 32-bit and 64-bit infinities. There is no shorthand for NaN (since there are
# so many of them), and it is best spelled out as a fixed-size hex int.
inf32
-inf64
# Tag expressions.
# An integer followed by a : denotes a tag expression. This encodes the tag of
# a Protobuf field. This is identical to an ordinary integer, except that a
# wire type between 0 and 7 is prepended via the expression
#
# tag := int << 3 | wireType
#
# The integer specifies the field number, while what follows after the :
# specifies the field type. In the examples below, no whitespace may be
# present around the : rune.
#
# Field numbers may be hex, signed, and zigzag per the syntax above, but not
# fixed-width. They may have a long-form prefix.
1:VARINT # A varint.
2:I64 # A fixed-width, 64-bit blob.
3:LEN # A length-prefixed blob.
4:SGROUP # A start-group marker.
5:EGROUP # An end-group marker.
6:I32 # A fixed-width, 32-bit blob.
0x10:0 # Also a varint, explicit value for the type.
8:6 # Invalid wire type (6 and 7 are unused).
# This is an error: the wire type must be between 0 and 7.
# 9:8
# If the : is instead followed by any rune not matching /[\w-]/, the scanner
# will seek forward to the next token. If it is a fixed-width integer or a
# float, the wire type will be inferred to be I32 or I64 as appropriate; if it
# is a {, or a 'long-form:N' followed by a {, the type is inferred as LEN;
# if it is a '!', the type is inferred as SGROUP; otherwise, it defaults to
# VARINT.
1: 55z
2: 1.23
3: {"text"}
6: -1i32
8: !{42}
# Length prefixes.
# Matching curly brace tokens denote length prefixes. They emit a varint-encoded
# length prefix followed by the encoding of the brace contents.
#
# It may optionally be preceded by 'long-form:N', as an integer would, to
# introduce redundant bytes in the encoding of the length prefix.
# This is a string field. Note that 23:'s type is inferred to be LEN.
23: {"my cool string"}
# This is a nested message field.
24: {
1: 5
2: {"nested string"}
}
# This is a packed repeated int field.
25: { 1 2 3 4 5 6 7 }
# This string field's length prefix will be 3, rather than one, bytes.
23: long-form:2 {"non-minimally-prefixed"}
# Groups
# If matching curly braces are prefixed with a ! (no spaces before the first
# {), it denotes a group. Encoding a group requires a field number, so the !{}
# must come immediately before a tag expression without an explicit type (which
# will be inferred to be SGROUP). The closing brace will generate a
# corresponding EGROUP-typed tag to match the SGROUP tag.
26: !{
1: 55z
2: 1.4
3: {"abcd"}
}
# long-form:N may be the last token between a group's braces, which will be
# applied to the EGROUP tag.
27: !{long-form:3}
# Examples.
# These primitives may be combined with raw byte strings to produce other
# encodings.
# This is another way to write a message, using an explicit length
2:LEN 4
"abcd"
# This allows us to insert the wrong length.
2:LEN 5
"abcd"
# The wrong wire type can be used with no consequences.
5:I64 "stuff"
# Disassembler.
# Although the conversion from Protoscope to a byte string is well-defined, the
# inverse is not. A given byte string may have multiple disassemblies. The
# disassembler heuristically attempts to give a useful conversion for its
# input.
#
# It is a goal that any valid protobuf input will be decoded reasonably,
# although this is impossible in general, because length-prefixed blobs can be
# either strings or protobufs, and fixed-width ints can also be floats. We try
# to strike a balance that produces mostly readable output.
#
# Note that the output of the disassembler is not stable, and highly heuristic.
# The only guarantee is that it will reassemble to the original input byte for
# byte.
#
# The algorithm is as follows:
#
# 1. Greedily parse tags out of the input. If an invalid tag is found, encode
# the remaining bytes as quoted strings or hex literals. Wire types 6 and 7
# are treated as "invalid".
#
# 2. Encode the tag as N:, unless the wire type is 4, in which case encode it as
# N:EGROUP. However, see below for cases when a wire type 3 tag precedes it.
#
# 3. If the wire type is 0, parse a varint and encode it as if it were an int64.
# There is no useful way to distinguish sint32/sint64 here.
#
# 4. If the wire type is 1 or 5, parse eight or four bytes and interpret that as
# a float of appropriate size.
#
# a. If the float is a NaN, print the bytes as a fixed-width hex integer.
#
# b. If the float is infinite, print inf32/inf64 as appropriate.
#
# c. If the float is zero or has an exponent that is not close to the largest
# or smallest possible exponents, print as a decimal float.
#
# i. If this would produce a non-round-trip-able value, print as a hex
# float instead.
#
# d. Otherwise, print a fixed-width decimal integer.
#
# 5. If the wire type is 2:
#
# a. Try to parse the contents of the field as a message, and print those
# fields wrapped in {} no failures occur. (-all-fields-are-messages will
# instead cause all fields that parsed successfully to be emitted followed
# by hex strings with the remaining content.)
#
# b. Output the contents as quoted strings or hex literals.
#
# 6. If the wire type is 3:
#
# a. Encode !{ to begin a group and save the field number on the group stack.
#
# b. Upon coming to a wire type 4 tag, check if it matches the top of the
# group stack. (Pop the stack unconditionally.)
#
# c. If it does, close the group with a } and do not emit a tag expression.
# Emit a long-form:N as necessary.
#
# d. If it doesn't, re-encode the wire type 3 tag as N:SGROUP, and encode the
# wire type 4 tag as above. If a type 2 message ends before the group is
# closed, or the input reaches EOF, this step also applies.
#
# However, Protoscope offers the option of providing a descriptor to aid
# disassembly. In this case the heuristic becomes much more intelligent.
# Steps 1 and 2 remain the same, but from 3 onwards:
#
# 3. If the wire type is 0, parse a varint and encode it as:
#
# a. true/false if the field is bool-typed and 0- or 1-valued.
#
# b. A sint64 (a 42z literal) if the field is sint32/sint64-typed.
#
# c. A uint64 if any of the unsigned integer types.
#
# d. An int64 if none of the above apply.
#
# 4. If the wire type is 1 or 5, parse eight or four bytes and encode as:
#
# a. A fixed64/fixed32 (resp) if the field is any of the unsigned integer
# types.
#
# b. A sfixed64/sfixed32 (resp) if the field is any of the other integer
# types.
#
# c. The same way floats are encoded in the schema-less algorithm above, step
# 4.a-4.c, but printing subnormals as floats, too.
#
# 5. If the wire type is 2:
#
# a. If the field type is a scalar, print as a packed field (numeric literals
# inside of braces) per the steps 3 and 4 above.
#
# b. If the field is a message or group, use the algorithm described in the
# schema-less scheme (step 5).
#
# c. If the field is bytes or string, print as quoted strings or hex
# literals as suits the disassembler's fancy.
#
# 6. If the wire type is 3, proceed per the instructions for a group given for
# the schema-less algorithm.