-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GraalPython performance is terrible when Pandas is used #334
Comments
I tried a slightly different version with a loop in an attempt to "warm-up" GraalPython. but the CPython is still a LOT faster.
Here are the numbers from CPython And here are the numbers from GraalPython |
Third attempt to warm up GraalPython. I added sleep() inside the loop and made the loop run 100 times instead of 10.
The numbers now are slightly better but still 100 times worse than CPython Here are the numbers from CPython |
Yes, this is about what is to be expected from this benchmark. There is a lot of boundary crossing between Python code and C code. If this is truly representative of all your workload does, GraalPy has little chance of beating CPython here. |
For reference, I modified your script as such: import pandas as pd
import time
def run():
data_dict = []
for i in range(100):
data_dict.append({
'name': 'User'+str(i),
'score1': i,
'score2': i
})
df = pd.DataFrame(data_dict)
mean = df['score1'].mean()
return mean
total = 0
while True:
for i in range(50):
start = time.time() * 1000
run()
end = time.time() * 1000
total += 50
print(total, ": 50 iterations took ", end - start, "ms") CPython averages around 1ms/50 iterations here for me. Overall, this workload is simply to small for us to offset the additional data copy and boundary crossings we have for each element of the dict when creating the pandas frame. |
Well, my actual application workload is a lot more complex, but I just showed this workload a simple reproducible example. I can try running my workload with few thousand dummy values to see if that helps it warm up. Also, in this experiment, I called $JAVA_HOME/languages/python/bin/python directly. However, my real application is a Java-based server application running on GraalVM and Python code is embedded as Polyglot guest language. So, there is even more data conversion going from the guest Python to Host Java language. |
If you can, I would recommend you run your actual workload with |
Ah, I just found something. We actually have a deopt loop in this benchmark - the following two deopts repeat 100s of times per second:
and
I will try to take a look (/cc fyi @fangerer) |
There is a small bug in the code @timfel posted above. It is still measuring time for one iteration, not 50 iterations as intended. start and end time measurements must be outside the inner loop. Here is the new version of the script - I added some more statements in the run() function to match real-world workload import pandas as pd
import pandas_ta as ta
import time
def run():
data_dict = []
for i in range(100):
data_dict.append({
'name': 'User' + str(i),
'score1': i,
'score2': i
})
df = pd.DataFrame(data_dict)
sma = ta.sma(df["score1"], length=10)
ema = ta.ema(df['score1'], length=10)
lastSma = sma.iat[-1].item()
lastEma = ema.iat[-1].item()
return (lastSma + lastEma) / 2
total = 0
while True:
start = time.time() * 1000
for i in range(50):
run()
end = time.time() * 1000
total += 50
print(total, ": 50 iterations took ", end - start, "ms")
First, here is the result for CPython: 3150 : 50 iterations took 49.900146484375 ms Now, here is the result for GraalPy after 14,000 warm-up iterations 14450 : 50 iterations took 2208.0 ms Then I also ran the same script from Java, which is what really happens in my app. public void testPythonFunction() throws Exception{
String pythonCode = """
import pandas as pd
import pandas_ta as ta
import polyglot
def run():
data_dict = []
for i in range(100):
data_dict.append({
'name': 'User' + str(i),
'score1': i,
'score2': i
})
df = pd.DataFrame(data_dict)
sma = ta.sma(df["score1"], length=5)
ema = ta.ema(df['score1'], length=5)
lastSma = sma.iat[-1].item()
lastEma = ema.iat[-1].item()
return (lastSma + lastEma) / 2
polyglot.export_value("run", run)
""";
try (Context context = Context.newBuilder()
.allowAllAccess(true)
.option("python.Executable", VENV_EXECUTABLE)
.option("python.ForceImportSite", "true")
.build()) {
long parsingStart = System.currentTimeMillis();
Source source = Source.newBuilder("python", pythonCode, "script.py").build();
Value func = context.eval(source);
long parsingEnd = System.currentTimeMillis();
System.out.println("Parsing Script took: " + (parsingEnd - parsingStart) + " ms" );
int total = 0;
while(true) {
long start = System.currentTimeMillis();
for(int i=0; i < 50; i++) {
func.execute().asDouble();
}
long end = System.currentTimeMillis();
total += 50;
System.out.println(total+ ": 50 iterations took " + (end - start) + " ms");
}
}
} This is the results from Java after 7000 warm-up iterations 7000: 50 iterations took 4978 ms So if we compare time from Java (about 4000ms) to CPython (40ms), that is about 100 times slower. In addition, the time for initial code parsing and context initialization was also much longer than expected. |
Yes, I suspect with this benchmark we won't get much better than what I had above (20-30x slower than cpython - maybe with some work it'll be 10x slower only). This particular benchmark is dominated by copying data from Python to native memory. |
Thanks for the comments. Please note that when observed from Java, it was 100x slower (not just 20-30x slower) than CPython. The time for CPython was about 40ms and time from Java was about 4000ms. Also, I could not completely understand why this benchmark is dominated by copying data from Python to native memory. The code above is calculating simple moving average (SMA) and exponential moving average (EMA) with the rolling window of 5 data points on Pandas dataframe with 100 rows. Besides creating dummy data and loading it up in Pandas dataframe, I don't see copying of data anywhere else in the code. What would be the example of code that is not dominated by copying data from Python to native memory? Anyway, let me know if there is a new build to try this again. In the meantime, I will now have to start exploring other ways to run user-supplied insecure Python code in Java server app. |
The calculation in this benchmark is nothing compared to loading all the data into a pandas dataframe on GraalPy. I measured the time around creating the dummy data, creating the dataframe, and then calculating the mean.
While we are indeed slower on everything involving native code here, the largest hit for us is creating the dataframe. That's because the entire data is copied into the native memory from Java, and then copied again by pandas. For CPython, they only copy from the native CPython object to pandas. |
I installed GraalVM 22.3.1, Python Language, and NumPy and Pandas as described in:
https://www.graalvm.org/latest/reference-manual/python/
Here is my sample test.py
++++
$JAVA_HOME/languages/python/bin/python test.py
Time Taken(ms): 1139.0
If I run the same script with default CPython, then it finishes in under 5ms. The difference of 5ms vs 1139ms is huge and therefore the performance is not even comparable.
The text was updated successfully, but these errors were encountered: