JavaCPP for shipping data efficiently from Java data pipelines to python ML frameworks. #530

guangster · 2022-01-05T14:54:41Z

guangster
Jan 5, 2022

Goal

I have been looking for a way to efficiently ship data efficiently from Java data pipelines to machine learning (ML) frameworks in Python (specifically Pytorch).
The goal is to pass a large amount of data (per training batch) to Pytorch running in Python efficiently (so then I can train and run experiments in Python).

@saudet has kindly provided guidance in bytedec/javacpp-presets#1107 and I have an initial prototype working (see below) but I thought this may be a more appropriate place for further discussion. It is very fast, but I know JavaCPP may not have been designed for this, so I would like to hear some advice on where things could break / potential flaws from experts in this area. Thanks in advance!

Current solution with JavaCPP

Where I see JavaCPP comes in is to allow us to create a large data array in C from Java (via JavaCPP's FloatPointer), then wrap that same array in our training batch in Pytorch, so we don't waste time copying data multiple times.
Specifically, using JavaCPP's Pointer classes (e.g. FloatPointer) on top of a Java data iterator like:

 public class MyJavaDataIter {
        public FloatPointer nextBatch(...) { 
            // ...
        }
    }

then when we want to train in Python, I use Python's ctypes to wrap the returned tensor.address() like:

    import jnius
    from jnius import autoclass

    dataiter = autoclass(...MyJavaDataIter)
    tensor = dataiter.nextBatch() # tensor is a JavaCPP FloatPointer object here
    p = ctypes.cast(tensor.address(), ctypes.POINTER(ctypes.c_float)) # tensor.address() is the JavaCPP method
    arr = np.ctypeslib.as_array(p, [100000, 443])

then arr can be wrapped around Pytorch without new memory creation or Python ever seeing the data, and makes this method very fast. After the batch is done, I can simply call tensor.close() perhaps in a try-finally loop. Here I use PyJNIus to expose the Java function but @saudet has suggested simpler/better options in bytedec/javacpp-presets#1107.

I want to emphasize though that the value of using JavaCPP is not to expose functions but to pass data efficiently. IMO this is a bit different from the related discussion bytedeco/javacpp#17 as I am looking to build a Java program with JavaCPP, but my usecase is to pass data efficiently to Pytorch in Python to consume without converting data to Python. For some back-of-envelope calculation, a typical batch we want may have size 1e4 samples, each having around 1e3 features, which means the FloatPointer is carrying 1e7 4 byte floats. I have tried other options but this is by far the fastest.

saudet · 2022-01-05T23:10:37Z

saudet
Jan 5, 2022
Maintainer

A pointer in C/C++ is actually just a long value, and we can get that number by calling Pointer.address(). We don't need JavaCPP to pass around those integers. I think the only reason we'd want to define callbacks like that for this purpose with JavaCPP is to avoid having to use another tool, but if you're already using PyJNIus, you might as well keep using that. We can easily call Pointer.address() with PyJNIus and pass that normally to ctypes and NumPy. @agibsonccc has written some code like that in the past, he could show you I'm sure.

4 replies

guangster Jan 12, 2022
Author

You are right, I just need the long address value from the Pointer.address() call. I do still rely on JavaCPP for managing the memory from the Java side as most of my pipeline is in Java where I work with the Pointer objects.

In terms of passing the object vs. just the address another reason that I would like to pass the JavaCPP object is to be able to close the resource via Pointer.close() from Python after I'm done with the object.
In our application, the data is intended to be never touched again on the Java side once packaged by JavaCPP and exposed to Python. Then Python will use the data and eventually free the memory after some time (~seconds).

@agibsonccc mentioned being careful managing GC. Is manually calling Pointer.close() from Python the best way? Would PointerGuard work properly here if I have a Python object created from this long address?

saudet Jan 13, 2022
Maintainer

I'm not sure what you're referring to with "PointerGuard", but we can call Pointer.close() from Python, that's fine, yes.

guangster Jan 14, 2022
Author

My mistake, I meant PointerScope. For example in Java we can call

try (PointerScope scope = new PointerScope()) {
...
}

Would this also work as intended to clean up the pointer object if I call from python, for example:

try:
  PointerScope = autoclass(...PointerScope)
  scope = PointerScope()
  dataiter = autoclass(...MyJavaDataIter)
  tensor = dataiter.nextBatch() # tensor is a JavaCPP FloatPointer object here
  p = ctypes.cast(tensor.address(), ctypes.POINTER(ctypes.c_float)) # tensor.address() is the JavaCPP method
  arr = np.ctypeslib.as_array(p, [100000, 443])
  #... use arr
finally:
  scope.close() # replacing tensor.close()

saudet Jan 15, 2022
Maintainer

It probably should work I guess, but I never tested it!

agibsonccc · 2022-01-08T01:41:20Z

agibsonccc
Jan 8, 2022

@saudet @guangster Generally the way you would do that is something like this:
https://github.com/eclipse/deeplearning4j/blob/deeplearning4j-1.0.0-beta7/jumpy/jumpy/ndarray.py#L127

You'll pass buffers back and forth via pointer addresses. Make sure you're managing the GC though. Depending on what's present the pointers can be GCed by either javacpp or python and can crashes.

3 replies

guangster Jan 12, 2022
Author

Thanks for the advice! This is really cool and I have actually used deeplearning4j in the past, it was very nice to use.
If I understood correctly, this is to create a buffer in C from Python (via numpy I think?) then pass that to ND4J without overhead?
I didn't see ND4J use JavaCPP, do you also have a way to do this in reverse (i.e. create buffer in C from Java and pass to Python)?

saudet Jan 13, 2022
Maintainer

ND4J uses JavaCPP, yes, and examples for both are in the file above.

guangster Jan 14, 2022
Author

You are right! This is exactly what I am doing as well!
https://github.com/eclipse/deeplearning4j/blob/deeplearning4j-1.0.0-beta7/jumpy/jumpy/ndarray.py#L159-L178

Do you know if there's any example usage for this function?

agibsonccc · 2022-01-13T02:48:13Z

agibsonccc
Jan 13, 2022

@guangster is there a way to make that clearer? What were you looking at that indicated that? Maybe the old website? Make sure you look at the new website. We have a python4j framework that might be interesting for you that wraps all of this:
https://deeplearning4j.konduit.ai/python4j/reference/garbage-collection#javacpp-garbage-collection-with-python-gc

We also have a dedicated page on the website here:
https://deeplearning4j.konduit.ai/multi-project/how-to-guides/developer-docs/javacpp

1 reply

guangster Jan 14, 2022
Author

I looked at the GitHub page for Jumpy. I think I should have re-visited the parent deeplearning4j website and would have found this, so that is my mistake. The konduit.ai pages contain a lot of information I was looking for, thanks for pointing to these!

In terms of python4j, I think it is a tool to run python programs in JVM instead of the other way around.
To give some context, here I am shipping data from JVM data pipelines to Python during training (move data to pytorch, with minimum overhead). Once the model is trained, I connect my JVM data pipelines to the same model running in inference mode in Java (without Python code, which is achieved using JavaCPP and its Pytorch presets).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JavaCPP for shipping data efficiently from Java data pipelines to python ML frameworks. #530

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

JavaCPP for shipping data efficiently from Java data pipelines to python ML frameworks. #530

guangster Jan 5, 2022

Goal

Current solution with JavaCPP

Replies: 3 comments · 8 replies

saudet Jan 5, 2022 Maintainer

guangster Jan 12, 2022 Author

saudet Jan 13, 2022 Maintainer

guangster Jan 14, 2022 Author

saudet Jan 15, 2022 Maintainer

agibsonccc Jan 8, 2022

guangster Jan 12, 2022 Author

saudet Jan 13, 2022 Maintainer

guangster Jan 14, 2022 Author

agibsonccc Jan 13, 2022

guangster Jan 14, 2022 Author

guangster
Jan 5, 2022

Replies: 3 comments 8 replies

saudet
Jan 5, 2022
Maintainer

guangster Jan 12, 2022
Author

saudet Jan 13, 2022
Maintainer

guangster Jan 14, 2022
Author

saudet Jan 15, 2022
Maintainer

agibsonccc
Jan 8, 2022

guangster Jan 12, 2022
Author

saudet Jan 13, 2022
Maintainer

guangster Jan 14, 2022
Author

agibsonccc
Jan 13, 2022

guangster Jan 14, 2022
Author