Cirq Circuit Efficient Serialization #3438

mpharrigan · 2020-10-21T22:26:16Z

Cirq supports JSON serialization of all its objects. This is highly portable and extensible but can be somewhat verbose. As part of the XEB refactor I'm working on (quantumlib/ReCirq#85); we need to generate a whole suite of large (wide and deep) and save them to persistent storage. Using the vanilla JSON serialization is both prohibitively slow and disk-space-taking. To see why this may be a problem, the simple circuit of:

circuit = cirq.Circuit(
    cirq.H(q0),
    cirq.CNOT(q0, q1)
)

turns into [expand]

{
  "cirq_type": "Circuit",
  "moments": [
    {
      "cirq_type": "Moment",
      "operations": [
        {
          "cirq_type": "GateOperation",
          "gate": {
            "cirq_type": "HPowGate",
            "exponent": 1.0,
            "global_shift": 0.0
          },
          "qubits": [
            {
              "cirq_type": "LineQubit",
              "x": 0
            }
          ]
        }
      ]
    },
    {
      "cirq_type": "Moment",
      "operations": [
        {
          "cirq_type": "GateOperation",
          "gate": {
            "cirq_type": "CXPowGate",
            "exponent": 1.0,
            "global_shift": 0.0
          },
          "qubits": [
            {
              "cirq_type": "LineQubit",
              "x": 0
            },
            {
              "cirq_type": "LineQubit",
              "x": 1
            }
          ]
        }
      ]
    }
  ],
  "device": {
    "cirq_type": "_UnconstrainedDevice"
  }
}

For example: ten circuits on nine qubits of depth 100 takes approximately 28MiB on disk. Using xz, it compresses down to 240kiB (gzip yields 417kiB).

Possible designs

Bespoke circuit text representation -- something like qasm where we define a full grammar for circuits that is less redundant
Like (1), but just use qasm
Wrapper or flag for json serialization that lets the Circuit give itself as compressed JSON. The specifics would have to be worked out, but you can imagine having "cirq_type": "CompressedCircuit" and a data field which is a hexencoded string of the gzipped bytes of the normal cirq JSON
Have a compression option in to_json and read_json that will read/write the entire file gzipped
Same as (4) but implement in ReCirq; i.e. punt on this issue in Cirq and let dependent libraries figure out their own solution.

[edited:]
6a. Re-use existing cirq.google protobuf serialization. I don't think this option is very viable because most gates aren't currently supported and the process for extending to new gates in a backwards compatible way is questionable.

6b. Use a binary serialization format like protobuf. Depending on how it's composed with JSON serialization, could be similar to (1), but not text; and could be similar to (3) but the data is the binary representation.

Benefits and Drawbacks

(1) could be more broadly useful (like if listing a circuit in a manuscript) but would be a lot of work and maintenance burden. There will be questions about what swath of gates we'd support. (2) piggy backs off of qasm, but I don't know how faithful of a representation of a cirq Circuit we can get with it.

(3) could work, but integrating this into the serialization API might be very unergonomic for the user. Now there's two types of json-serialized circuits floating around in the wild.

I think (4) or (5) are easy and solve the issue without sacrificing much. The sacrifice is now there's two filetypes that could be read by Cirq. The files are no longer human readable (but some tools (vim) will auto-decompress gzipped text files)

[edited:]
I suspect (6a) is a non-starter, since the two formats have different goals.

For 6b, we'd have to weigh how to compose (or not!) with JSON serialization of other cirq objects. For example, I exploit the fact that I can serialize anything in one cirq json file by storing a list of circuits, some qubits, result data, numpy arrays, et al. We'd either have to provide protobuf objects for everything in cirq [and users would have to define a schema in protobuf instead of throwing things into a python dictionary] OR certain objects would get a protobuf form and you could slot it into a JSON document by hexencoding the bytes like option (3).

The text was updated successfully, but these errors were encountered:

cduck · 2020-10-22T01:20:15Z

(1) You could include a header with the used set of gates specified with their json encodings. The circuit encoding could then be something really simple like a 0 2,b 1,c 4 3 5;a 1 2,a 3 5;....

Compression: Some libraries allow you to specify an initial or commonly known vocabulary to improve file size.

balopat · 2020-10-22T18:46:35Z

It is interesting that you discounted protobuf - that would have been my first thought for efficiency.
Why do you say protobuf's shouldn't be used for long term storage?

mpharrigan · 2020-10-22T19:52:37Z

I was under the impression that it was for messages in-flight (like when doing an RPC call) and lacked some of the backwards compatibility / stability required for long-term use.

In any event, I was discounting repurposing the existing quantum-engine protobuf serialization because it only works for certain gates. I'll edit the list of options above

mpharrigan · 2020-10-22T20:00:32Z

@cduck yes, that would be an option for a grammar for a (1)-type implementation

re: compression: even with "off the shelf" tools, it works good enough for me.

mpharrigan · 2020-10-22T20:01:16Z

I edited in 6 into 6a and 6b. My commentary on 6b ["use protobuf"] is reproduced below:

For 6b, we'd have to weigh how to compose (or not!) with JSON serialization of other cirq objects. For example, I exploit the fact that I can serialize anything in one cirq json file by storing a list of circuits, some qubits, result data, numpy arrays, et al. We'd either have to provide protobuf objects for everything in cirq [and users would have to define a schema in protobuf instead of throwing things into a python dictionary] OR certain objects would get a protobuf form and you could slot it into a JSON document by hexencoding the bytes like option (3).

balopat · 2020-10-22T21:41:04Z

The new edits make sense - I agree with 6b not being a good drop-in solution.

While we should compare options and keep pushing the boundaries as much as possible, it would be useful to capture the demarcation lines which below things are unacceptable, acceptable and ideal specifically for the XEB applications, e.g.:

Typical XEB instances: 100 x 10 qubit circuits of depth 100

Requirement	Acceptable	Ideal
serialization time	< 5s	<1s
size on disk	< 10MB	<500kB

?

For the above you could use something like https://msgpack.org/ to be in the acceptable range probably? It's a better version of BSON (binary JSON) which is the storage format for mongodb. It is a binary drop-in replacement for JSON. However it might not achieve as much of a compression as xz + JSON - being above 100x.

A quick tryout (it seems we already have msgpack as a dependency) shows that msgpack is comparable to json but at least 2x smaller.. it is not 100x though...

import time

import cirq
import msgpack

c = cirq.testing.random_circuit(20, 1000, 1)

t = time.monotonic()
print(len(cirq.to_json(c)))
print(time.monotonic() - t)

encoder = cirq.protocols.json_serialization.CirqEncoder()
t = time.monotonic()
print(len(msgpack.packb(c, default=encoder.default)))
print(time.monotonic() - t)

4990245
0.4884163020000001
2048313
0.4313675290000001

maffoo · 2020-10-23T05:37:16Z

I'd vote for option 4. It might even be possible to autodetect whether passed a gzip or plain text file by looking for gzip magic number, so this could be done in a fairly transparent way. Might be worth looking into a binary-first alternative like bson, but those seem like they'll take substantially more work.

I'd also say that while proto encoding seems like it should be a good choice, the way we use protos currently is very inefficient because we include a bunch of strings in the protos (gate names, argument names, etc) rather than putting these names in the proto schema. I would guess that our sizes of proto-encoded circuits are comparable to json-encoded circuits currently, in addition to the problem of not all gates being supported, as @mpharrigan noted.

mpharrigan · 2020-10-23T18:39:11Z

In case it wasn't clear by the slanted way I wrote the original post, I also like option 4 :)

@balopat -- I agree that it's important to actually create a benchmark to measure what we want. We should budget for 50 circuits of depth 200 on 50 qubits for XEB probably

balopat · 2020-10-23T23:04:22Z

I like that too now that I realized that to_json is for any object, not just Cirq objects. :)
I think we have a winner. Marking it as triage/accepted. What's the priority on this one?

mpharrigan · 2020-10-28T19:26:00Z

This is high priority for my work. I'll take this issue.

One question I have is how to deal with filenames and design the API.

How should the user specify compression when writing. One option would be to have to_json(file_or_fn, *, comrpess=None) where None means "no compression" and "gzip" means gzip. This can only work when passed a filename (not when returning a string or writing to an already opened file). If we do this, should the function append ".gz" to the filename if it isn't already there? Should we have been appending ".json" to filenames if they don't already exist this whole time? This is how np.save works https://numpy.org/devdocs/reference/generated/numpy.save.html

The other option would be to use the filename to determine compression. If the filename ends in ".gz" use gzip compression.

How should the user specify compression when reading? You could argue that during saving you must specify the compress option but during reading, it should be more permissive and try to open filenames with ".gz" extensions with gzip

balopat · 2020-10-29T04:40:16Z

I think we should have symmetry between the two directions and have both convention based and explicit methods.

Signatures:

def to_json(obj: Any,
            file_or_fn: Union[None, IO, pathlib.Path, str] = None,
            *,
            compress: Optional[str] = None,
            indent: int = 2,
            cls: Type[json.JSONEncoder] = CirqEncoder) -> Optional[str]


def read_json(file_or_fn: Union[None, IO, pathlib.Path, str] = None,
              *,
              compress: Optional[str] = None,
              json_text: Optional[str] = None,
              resolvers: Optional[Sequence[JsonResolver]] = None) -> Any:

Explicit (compress is not None)

If compress="no" or compress="gzip" are defined, that behavior is forced ("no" = plain text). No fallbacks.

Convention based (compress is None)

If filename ends with "gz" tries with gzip, if that fails or the filename does not end with "gz", it tries to deserialize without compression.

Examples:

# expects file.gz to be a gzip but if can't find the magic number, it falls back to no-compression.
to_json(obj, file_or_fn="file.gz")

# forced no-compression despite the file name
to_json(obj, file_or_fn="file.gz", compress="no")

WDYT?

Ashalynd · 2020-11-23T19:17:07Z

+1 to trying gzipping before anything else.

95-martin-orion · 2022-04-06T23:49:55Z

@mpharrigan, what remains to be done here? It looks like gzip support was added, and repetitive circuits can be further condensed using CircuitOperations.

mpharrigan · 2022-04-11T16:04:14Z

yes, this should be considered done with the linked solutions

mpharrigan added the kind/feature-request Describes new functionality label Oct 21, 2020

balopat added area/json triage/accepted A consensus emerged that this bug report, feature request, or other action should be worked on labels Oct 23, 2020

balopat added the priority/p1 Fix is needed as soon as possible. Should be staffed. It is blocking some major flows for users label Oct 29, 2020

95-martin-orion mentioned this issue Dec 10, 2020

Add concise serialization tools. #3601

Merged

mpharrigan mentioned this issue Jan 13, 2021

Support gzip serialization #3662

Merged

95-martin-orion mentioned this issue Jan 15, 2021

Efficient JSON serialization of FrozenCircuit #3633

Closed

dabacon added the time/before-1.0 label Mar 28, 2022

mpharrigan closed this as completed Apr 11, 2022

mhucka mentioned this issue May 21, 2022

Maybe explore how the file serialization format could include additional data such as a version number #5387

Open

mpharrigan mentioned this issue Jun 21, 2023

Improve Serialization performance in Cirq #6110

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cirq Circuit Efficient Serialization #3438

Cirq Circuit Efficient Serialization #3438

mpharrigan commented Oct 21, 2020 •

edited

Loading

cduck commented Oct 22, 2020 •

edited

Loading

balopat commented Oct 22, 2020

mpharrigan commented Oct 22, 2020

mpharrigan commented Oct 22, 2020

mpharrigan commented Oct 22, 2020

balopat commented Oct 22, 2020

maffoo commented Oct 23, 2020

mpharrigan commented Oct 23, 2020

balopat commented Oct 23, 2020 •

edited

Loading

mpharrigan commented Oct 28, 2020

balopat commented Oct 29, 2020 •

edited

Loading

Ashalynd commented Nov 23, 2020

95-martin-orion commented Apr 6, 2022

mpharrigan commented Apr 11, 2022

Cirq Circuit Efficient Serialization #3438

Cirq Circuit Efficient Serialization #3438

Comments

mpharrigan commented Oct 21, 2020 • edited Loading

Possible designs

Benefits and Drawbacks

cduck commented Oct 22, 2020 • edited Loading

balopat commented Oct 22, 2020

mpharrigan commented Oct 22, 2020

mpharrigan commented Oct 22, 2020

mpharrigan commented Oct 22, 2020

balopat commented Oct 22, 2020

maffoo commented Oct 23, 2020

mpharrigan commented Oct 23, 2020

balopat commented Oct 23, 2020 • edited Loading

mpharrigan commented Oct 28, 2020

balopat commented Oct 29, 2020 • edited Loading

Ashalynd commented Nov 23, 2020

95-martin-orion commented Apr 6, 2022

mpharrigan commented Apr 11, 2022

mpharrigan commented Oct 21, 2020 •

edited

Loading

cduck commented Oct 22, 2020 •

edited

Loading

balopat commented Oct 23, 2020 •

edited

Loading

balopat commented Oct 29, 2020 •

edited

Loading