Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving to memset instead of a raw loop massively improves performance. #7

Merged
merged 1 commit into from
Feb 2, 2021

Conversation

tritoke
Copy link
Contributor

@tritoke tritoke commented Feb 1, 2021

So in new_node there is a raw loop writing a single int value, this is exactly what the memset function does, except it goes brrrrr and raw loops do not :)

On my machine it runs 5 seconds faster than the java version.

@tritoke
Copy link
Contributor Author

tritoke commented Feb 2, 2021

So I've done a bit of looking into this and I'm confused why the compiler doesn't optimise away the loop normally...

void loop_memset_naive(char * mem, int charId, int size) {
  for (int i = 0; i < size; i++) {
    mem[i] = charId;
  }
}

Node *new_node(long id) {
  int size = (int) (almost_pseudo_random(id) * MAX_PAYLOAD_SIZE);
  int charId = (char) id;
  Node *node = malloc(sizeof(NodeDef));
  node->id = id;
  node->size = size;
  node->payload = malloc(sizeof(char) * size);
	loop_memset_naive(node->payload, charId, size);
  return node;
}

I tried just writing a naive memset using a loop and replacing it with a call to that, and it gets optimised to using memset straight away...

image

image

This is opposed to when I leave the loop as is:

image

It is plain to see that malloc actually represents a small portion of the runtime and actually that while loop is the true performance limitation of this program.

@tritoke
Copy link
Contributor Author

tritoke commented Feb 2, 2021

For reference this is the tool I'm using to profile them: https://github.com/KDAB/hotspot

They need to be compiled in debug mode, so just add -g in the list of arguments to gcc:

#!/usr/bin/env sh

mkdir -p build/c
gcc -O3 -g -o build/c/almost_pseudo_random src/main/c/almost_pseudo_random.c -lm
gcc -O3 -g -o build/c/java_2_times_faster_than_c src/main/c/java_2_times_faster_than_c.c -lm

Then just run hotspot, and find the executable and click start recording :)

@morisil
Copy link
Member

morisil commented Feb 2, 2021

Nice research! Originally, when I wrote this code, I was just trying to simulate an input of variable size which might arrive to be processed. I was thinking about filling it with random data, but then decided not to focus so much on it in the example which is supposed to be more conceptual. But you are right that in implementations in all the other langauges there is already some idiomatic way of filling byte array with single value. Let me merge your PR and rerun my tests.

@morisil morisil merged commit e24d7f9 into xemantic:main Feb 2, 2021
@morisil
Copy link
Member

morisil commented Feb 2, 2021

@tritoke I am happy to learn about statistical profiling in C. I've been using statistical profilers a lot in distributed systems running on JVM. My team was also responsible for writing own internal statistical profilers, one started as a hackathon project called Spy Goat. But what mattered for us the most, at the end of the day, was the quality of custom metrics we sent out to Datadog, to analyze performance across the whole stack of microservices. So this PR has a great sentimental value for me. :)

It also disproved the hypothesis that performance is related to differences in memory management, which I picked up too quickly. I am thinking now how to rewrite the readme to keep the full history of this research. It would be also possible to do the following in Java:

UNSAFE.setMemory(payload, Unsafe.ARRAY_BYTE_BASE_OFFSET, size, (byte) id);

But I have a feeling that it's getting too far into specific microoptimizations, which is not something I wanted to convey with this project.

@tritoke
Copy link
Contributor Author

tritoke commented Feb 2, 2021

Personally I'd be really interested to see how this affects the performance of the Java version. I have very little experience with optimising Java code, mainly because my association with Java is school / uni work that I've had to do out of necessity, but it's definitely something I'm interested in learning about it you have any resources you'd recommend on it?

morisil added a commit that referenced this pull request Feb 4, 2021
…ay takes most of the time, therefore modus operandi changed from the intended one: processing big amount of small data structures of variable size
@morisil
Copy link
Member

morisil commented Feb 4, 2021

@tritoke I had some time to update the algorithm, and also added clang to compare with gcc, I was also experimenting a bit with kotlin native, which is actually interesting for me for my future projects and seems to depend on clang infrastructure.

Regarding Java optimizations, it's possible to use sun.misc.Unsafe with some tricks. In general it is used internally, and cannot be obtained, unless one will do something like:

Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
unsafe = (Unsafe) f.get(null);

This opens possibilities of direct memory manipulation and writing own allocators . Still I would assume that it will have extra overhead of JNI calls, so it might be counterproductive in many cases. Here are some interesting use cases:

http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/

Another option I see for more direct memory access in Java is javacpp project. I use it in my work for accessing devices like Kinect. In order to wrap any C/C++ library as something consumable in java, there is a layer of abstraction over constructs of these languages. For example this would be memset equivalent:

http://bytedeco.org/javacpp/apidocs/org/bytedeco/javacpp/Pointer.html#fill-int-

I have surprisingly positive experience with javacpp paradoxically in "art projects", which sound a kind of far away from the hardware. Javacpp is also an ecosystem of packages for common native libraries, like freenect or opencv. For my interactive installations I bind my shaders together in Kotlin using OPENRNDR, and supply them with space sensing data from kinect. With standard maven or gradle build all the native libraries got packaged inside single jar file specific for the architecture. Even though I develop on linux, my clients receiving this specialized software on mac or windows can just use it out of the box, without installing any additional drivers or libraries.

@morisil
Copy link
Member

morisil commented Feb 4, 2021

@tritoke Regarding statistical profiling of the code running on JVM, even on production machines, I used this tool a lot:

https://en.wikipedia.org/wiki/JDK_Mission_Control

It has been open source for some time:
https://github.com/openjdk/jmc

It use to require starting JVM with options like:

java -XX:+UnlockCommercialFeatures -XX:+FlightRecorder ...

Maybe it has changed.

@tritoke
Copy link
Contributor Author

tritoke commented Feb 4, 2021

Because the hotspot profiler depends only on perf I thought it might be interesting to see what running the java program on it looked like. and... yeah...
image

not exactly helpful 😂

I'm current following this blog post to attempt to get a more relevant one that I can put into hotspot and compare directly with the C ones I have :)

@tritoke
Copy link
Contributor Author

tritoke commented Feb 4, 2021

The only thing that current concerns me is the proportion of the program that almost_psuedo_random takes up.
In both the C and Rust versions it makes up more than half of the program's entire runtime, 56% for rust and 59% for C (clang).
Clang:
image

Rust:
image

@morisil
Copy link
Member

morisil commented Feb 4, 2021

Yes, I am surprised by the performance of trigonometry. I would assume that it will be executed directly on CPU as floating point operation FSIN. But apparently almost_psuedo_random is expensive to calculate, this is the reason why I made a separate test for it and time wise the results of calling it in the loop are comparable for C and Java with Java being slightly faster here. My hypothesis is that calling fmod requires a call which cannot be inlined, while HotSpot can inline it. This is one of these scenarios where VM can gain performance over native code, potentially inlining library calls. To my surprise Go version is almost 2x faster here, but also results are slightly different. I checked the code and it is using custom implementation of trigonometric functions, which would mean that there is much more arithmetic involved that just FSIN introduced with the first FPUs for intel architecture.

@morisil
Copy link
Member

morisil commented Feb 4, 2021

So my assumption was that if C and Java are quite comparable in terms of trigonometry speed, then the further performance drop in native code must come from different source, most likely from memory management, as it was pointed in discussion in #1 . It's not that surprising taking into account that when I run this code on my computer, there are 2 garbage collector threads allocated to clean it in parallel. In other words VM is "cheating" by converting memory releasing into concurrent operation. I guess at the end of execution, when JVM is terminated, the whole heap is just disposed, and what hasn't been collected yet, will be just disposed totally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants