Bulk parsing json metadata #3087
jeromekelleher
started this conversation in
Show and tell
Replies: 1 comment
-
Your use case here is to fetch a single column as a numpy array? Or possibly "fetch a list of keys as a set of numpy arrays"? If we make it a requirement that there is a schema and only support basic types I think those use cases would be doable by doing the schema checking in Python, then passing down the columns-with-types request to a C json parser, maybe Cjson or YAJL (both MIT-style License). The C would error out if the wrong type was encountered, and pass back the numpy arrays. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
For large tables, accessing metadata row-by-row using the tskit API can be prohibitively slow. Sometimes we want columnar access to the data.
One way to do this is to use the PyArrow read_json function, although it requires some jiggery-pokery.
First we need to insert newlines between all the metadata records from the table, so that PyArrow can read it. Then, we have to write that to a file.
It is pretty fast, at about 5 seconds end-to-end to parse about 1G of quite loosely structured JSON metadata. I think the read_json bit could be a good bit faster if the parser was given an explicit schema, and told which specific fields to pull out.
However, it's still pretty slow as there's quite a lot of copying going on. Unfortunately Arrow only seems to support reading from actual files, so we have to go out to the file system. If there was some way of getting arrow to read directly from the buffer we're making, then maybe it would be worth trying to make this path smoother.
This isn't quite what I need currently, so I'm not going to pursue any further. However, I thought that this might prove useful to someone in the future so I posted it here.
If it is, we should totally add support for inserting newlines between the metadata records in tskit, as this is a very reasonable thing to want.
Beta Was this translation helpful? Give feedback.
All reactions