Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GraalPython performance is terrible when Pandas is used #334

Open
subashatreya opened this issue May 8, 2023 · 11 comments
Open

GraalPython performance is terrible when Pandas is used #334

subashatreya opened this issue May 8, 2023 · 11 comments

Comments

@subashatreya
Copy link

subashatreya commented May 8, 2023

I installed GraalVM 22.3.1, Python Language, and NumPy and Pandas as described in:
https://www.graalvm.org/latest/reference-manual/python/

Here is my sample test.py

import pandas as pd
import time
start = time.time() * 1000
data_dict = []
for i in range(100):
    data_dict.append({
            'name': 'User'+str(i),
            'score1': i,
            'score2': i
    })

df = pd.DataFrame(data_dict)
mean = df['score1'].mean()

end = time.time() * 1000

print("Time Taken(ms): " + str(end-start))

++++

$JAVA_HOME/languages/python/bin/python test.py
Time Taken(ms): 1139.0

If I run the same script with default CPython, then it finishes in under 5ms. The difference of 5ms vs 1139ms is huge and therefore the performance is not even comparable.

@subashatreya
Copy link
Author

subashatreya commented May 8, 2023

I tried a slightly different version with a loop in an attempt to "warm-up" GraalPython. but the CPython is still a LOT faster.
Also, note that I am not measure "Total Script Time", I am measuring time after the script has already started running, just for Pandas function call. So, per my understanding, compilation should have already happened before I even measure the time, and hence it should not have caused this delay.

data_dict = []
for i in range(100):
    data_dict.append({
            'name': 'User'+str(i),
            'score1': i,
            'score2': i
    })


for i in range(10):
    start = time.time() * 1000
    df = pd.DataFrame(data_dict)
    mean = df['score1'].mean()
    end = time.time() * 1000
    print("Time Taken(ms): " + str(end-start))

Here are the numbers from CPython
Time Taken(ms): 4.468994140625
Time Taken(ms): 0.679443359375
Time Taken(ms): 0.52490234375
Time Taken(ms): 0.4150390625
Time Taken(ms): 0.39501953125
Time Taken(ms): 0.4150390625
Time Taken(ms): 0.385009765625
Time Taken(ms): 0.447021484375
Time Taken(ms): 0.378173828125
Time Taken(ms): 0.36279296875

And here are the numbers from GraalPython
Time Taken(ms): 536.0
Time Taken(ms): 292.0
Time Taken(ms): 294.0
Time Taken(ms): 183.0
Time Taken(ms): 172.0
Time Taken(ms): 260.0
Time Taken(ms): 151.0
Time Taken(ms): 140.0
Time Taken(ms): 285.0
Time Taken(ms): 167.0

@subashatreya
Copy link
Author

Third attempt to warm up GraalPython. I added sleep() inside the loop and made the loop run 100 times instead of 10.

data_dict = []
for i in range(100):
    data_dict.append({
            'name': 'User'+str(i),
            'score1': i,
            'score2': i
    })


for i in range(100):
    start = time.time() * 1000
    df = pd.DataFrame(data_dict)
    mean = df['score1'].mean()
    end = time.time() * 1000
    print("Time Taken(ms): " + str(end-start))
    time.sleep(1)

The numbers now are slightly better but still 100 times worse than CPython
Time Taken(ms): 16.0
Time Taken(ms): 18.0
Time Taken(ms): 19.0
Time Taken(ms): 18.0
Time Taken(ms): 17.0
Time Taken(ms): 16.0
Time Taken(ms): 19.0
Time Taken(ms): 20.0
Time Taken(ms): 18.0
Time Taken(ms): 20.0

Here are the numbers from CPython
Time Taken(ms): 0.679443359375
Time Taken(ms): 0.52490234375
Time Taken(ms): 0.4150390625
Time Taken(ms): 0.39501953125
Time Taken(ms): 0.4150390625
Time Taken(ms): 0.385009765625
Time Taken(ms): 0.447021484375
Time Taken(ms): 0.378173828125
Time Taken(ms): 0.36279296875

@timfel
Copy link
Member

timfel commented May 8, 2023

Yes, this is about what is to be expected from this benchmark. There is a lot of boundary crossing between Python code and C code. If this is truly representative of all your workload does, GraalPy has little chance of beating CPython here.

@timfel
Copy link
Member

timfel commented May 8, 2023

For reference, I modified your script as such:

import pandas as pd
import time

def run():
    data_dict = []
    for i in range(100):
        data_dict.append({
            'name': 'User'+str(i),
            'score1': i,
            'score2': i
        })
    df = pd.DataFrame(data_dict)
    mean = df['score1'].mean()
    return mean

total = 0
while True:
    for i in range(50):
        start = time.time() * 1000
        run()
        end = time.time() * 1000
    total += 50
    print(total, ": 50 iterations took ", end - start, "ms")

CPython averages around 1ms/50 iterations here for me.
GraalPy 22.3 takes quite some time to warm up. After about 7000 iterations it is pretty steady for me at ~24ms/50 iterations.
GraalPy 23.0 with our new native execution mode warms up faster, but peaks at 33ms/50 iterations after 400 iterations total.

Overall, this workload is simply to small for us to offset the additional data copy and boundary crossings we have for each element of the dict when creating the pandas frame.

@subashatreya
Copy link
Author

Well, my actual application workload is a lot more complex, but I just showed this workload a simple reproducible example.

I can try running my workload with few thousand dummy values to see if that helps it warm up.
My requirement is not necessarily to beat CPython, but to execute one Pandas-based computation within 10ms. The number I am observing now is about 400ms. This number must go down from 400ms to 10ms for the application to be practically viable.

Also, in this experiment, I called $JAVA_HOME/languages/python/bin/python directly. However, my real application is a Java-based server application running on GraalVM and Python code is embedded as Polyglot guest language. So, there is even more data conversion going from the guest Python to Host Java language.

@timfel
Copy link
Member

timfel commented May 8, 2023

If you can, I would recommend you run your actual workload with --cpusampler to see where time is actually spent. Then I could check if that is dominated by something we can fix or not.

@timfel
Copy link
Member

timfel commented May 9, 2023

Ah, I just found something. We actually have a deopt loop in this benchmark - the following two deopts repeat 100s of times per second:

[Deoptimization initiated
    name: PyErr_GetExcInfo (SubstrateCompilation-6793)
    sp: 0x7ffe1e11f440  ip: 0x7f669fbec397
    reason: TransferToInterpreter  action: InvalidateReprofile
    debugId: 13  speculation: jdk.vm.ci.meta.SpeculationLog$NoSpeculationReason@c2584d3
    stack trace that triggered deoptimization:
        at com.oracle.graal.python.nodes.util.ExceptionStateNodes$GetCaughtExceptionNode.executeFromNative(ExceptionStateNodes.java:125)
        at com.oracle.graal.python.builtins.modules.cext.PythonCextErrBuiltins$PyTruffleErr_GetExcInfo.info(PythonCextErrBuiltins.java:364)

and

[Deoptimization initiated
    name: PyTruffle_ByteArrayToNative (SubstrateCompilation-6794)
    sp: 0x7ffe1e11ea20  ip: 0x7f669fbd3b33
    reason: BoundsCheckException  action: InvalidateReprofile
    debugId: 17  speculation: jdk.vm.ci.meta.SpeculationLog$NoSpeculationReason@c2584d3
    stack trace that triggered deoptimization:
        at com.oracle.truffle.host.HostObject$ArrayGet.doByte(HostObject.java:3268)
        at com.oracle.truffle.host.HostObjectFactory$ArrayGetNodeGen$Inlined.execute(HostObjectFactory.java:473)
        at com.oracle.truffle.host.HostObject$ReadArrayElement.doArray(HostObject.java:800)
        at com.oracle.truffle.host.HostObjectGen$InteropLibraryExports$Cached.readArrayElement(HostObjectGen.java:2304)
        at com.oracle.truffle.api.interop.InteropLibraryGen$CachedDispatch.readArrayElement(InteropLibraryGen.java:8819)
        at com.oracle.truffle.llvm.runtime.nodes.intrinsics.interop.LLVMPolyglotRead$LLVMPolyglotGetArrayElement.doIntrinsic(LLVMPolyglotRead.java:124)

I will try to take a look (/cc fyi @fangerer)

@subashatreya
Copy link
Author

subashatreya commented May 9, 2023

There is a small bug in the code @timfel posted above. It is still measuring time for one iteration, not 50 iterations as intended. start and end time measurements must be outside the inner loop.

Here is the new version of the script - I added some more statements in the run() function to match real-world workload

import pandas as pd
import pandas_ta as ta
import time

def run():
    data_dict = []
    for i in range(100):
        data_dict.append({
            'name': 'User' + str(i),
            'score1': i,
            'score2': i
        })
    df = pd.DataFrame(data_dict)
    sma = ta.sma(df["score1"], length=10)
    ema = ta.ema(df['score1'], length=10)
    lastSma = sma.iat[-1].item()
    lastEma = ema.iat[-1].item()
    return (lastSma + lastEma) / 2

total = 0
while True:
    start = time.time() * 1000
    for i in range(50):
        run()
    end = time.time() * 1000
    total += 50
    print(total, ": 50 iterations took ", end - start, "ms")
    

First, here is the result for CPython:

3150 : 50 iterations took 49.900146484375 ms
3200 : 50 iterations took 49.91796875 ms
3250 : 50 iterations took 56.847900390625 ms
3300 : 50 iterations took 51.123779296875 ms
3350 : 50 iterations took 49.0888671875 ms
3400 : 50 iterations took 44.664794921875 ms
3450 : 50 iterations took 49.506591796875 ms
3500 : 50 iterations took 47.31103515625 ms
3550 : 50 iterations took 52.6630859375 ms
3600 : 50 iterations took 45.040771484375 ms
3650 : 50 iterations took 43.5322265625 ms
3700 : 50 iterations took 45.740966796875 ms
3750 : 50 iterations took 48.583251953125 ms
3800 : 50 iterations took 48.193115234375 ms
3850 : 50 iterations took 43.912109375 ms
3900 : 50 iterations took 43.459716796875 ms
3950 : 50 iterations took 54.65380859375 ms
4000 : 50 iterations took 51.01416015625 ms
4050 : 50 iterations took 47.460693359375 ms

Now, here is the result for GraalPy after 14,000 warm-up iterations

14450 : 50 iterations took 2208.0 ms
14500 : 50 iterations took 2091.0 ms
14550 : 50 iterations took 2121.0 ms
14600 : 50 iterations took 2016.0 ms
14650 : 50 iterations took 2105.0 ms
14700 : 50 iterations took 2121.0 ms
14750 : 50 iterations took 8746.0 ms
14800 : 50 iterations took 1976.0 ms
14850 : 50 iterations took 2429.0 ms
14900 : 50 iterations took 2044.0 ms
14950 : 50 iterations took 2225.0 ms
15000 : 50 iterations took 2146.0 ms

Then I also ran the same script from Java, which is what really happens in my app.

	public void testPythonFunction() throws Exception{
		String pythonCode = """
	import pandas as pd
	import pandas_ta as ta
	import polyglot
	def run():
	    data_dict = []
	    for i in range(100):
	        data_dict.append({
	            'name': 'User' + str(i),
	            'score1': i,
	            'score2': i
	        })
	    df = pd.DataFrame(data_dict)
	    sma = ta.sma(df["score1"], length=5)
	    ema = ta.ema(df['score1'], length=5)
	    lastSma = sma.iat[-1].item()
	    lastEma = ema.iat[-1].item()
	    return (lastSma + lastEma) / 2
	polyglot.export_value("run", run)
				""";
		
		try (Context context = Context.newBuilder()
				.allowAllAccess(true)
				.option("python.Executable", VENV_EXECUTABLE)
                .option("python.ForceImportSite", "true")
				.build()) {

			long parsingStart = System.currentTimeMillis();
			Source source = Source.newBuilder("python", pythonCode, "script.py").build();
			Value func = context.eval(source);
			long parsingEnd = System.currentTimeMillis();
			System.out.println("Parsing Script took: " + (parsingEnd - parsingStart) + " ms" );
			
			int total = 0;
			while(true) {
				
				long start = System.currentTimeMillis();
				for(int i=0; i < 50; i++) {
					func.execute().asDouble();					
				}
				long end = System.currentTimeMillis();
				total += 50;
				System.out.println(total+ ": 50 iterations took " + (end - start) + " ms");
				
			}

		}
	}

This is the results from Java after 7000 warm-up iterations

7000: 50 iterations took 4978 ms
7050: 50 iterations took 4552 ms
7100: 50 iterations took 4382 ms
7150: 50 iterations took 4660 ms
7200: 50 iterations took 7241 ms
7250: 50 iterations took 4610 ms
7300: 50 iterations took 4784 ms
7350: 50 iterations took 4697 ms
7400: 50 iterations took 4496 ms
7450: 50 iterations took 4707 ms
7500: 50 iterations took 4100 ms
7550: 50 iterations took 4267 ms
7600: 50 iterations took 4208 ms
7650: 50 iterations took 4196 ms
7700: 50 iterations took 4125 ms
7750: 50 iterations took 4187 ms
7800: 50 iterations took 3847 ms
7850: 50 iterations took 4319 ms
7900: 50 iterations took 3847 ms
7950: 50 iterations took 3850 ms
8000: 50 iterations took 4116 ms

So if we compare time from Java (about 4000ms) to CPython (40ms), that is about 100 times slower.

In addition, the time for initial code parsing and context initialization was also much longer than expected.
Parsing Script took: 139267 ms

@timfel
Copy link
Member

timfel commented May 9, 2023

Yes, I suspect with this benchmark we won't get much better than what I had above (20-30x slower than cpython - maybe with some work it'll be 10x slower only). This particular benchmark is dominated by copying data from Python to native memory.

@subashatreya
Copy link
Author

Thanks for the comments. Please note that when observed from Java, it was 100x slower (not just 20-30x slower) than CPython. The time for CPython was about 40ms and time from Java was about 4000ms.

Also, I could not completely understand why this benchmark is dominated by copying data from Python to native memory. The code above is calculating simple moving average (SMA) and exponential moving average (EMA) with the rolling window of 5 data points on Pandas dataframe with 100 rows. Besides creating dummy data and loading it up in Pandas dataframe, I don't see copying of data anywhere else in the code.

What would be the example of code that is not dominated by copying data from Python to native memory?

Anyway, let me know if there is a new build to try this again. In the meantime, I will now have to start exploring other ways to run user-supplied insecure Python code in Java server app.

@timfel
Copy link
Member

timfel commented May 11, 2023

Also, I could not completely understand why this benchmark is dominated by copying data from Python to native memory

The calculation in this benchmark is nothing compared to loading all the data into a pandas dataframe on GraalPy. I measured the time around creating the dummy data, creating the dataframe, and then calculating the mean.

On GraalPy:
Creating data took: 0.0 ms
Creating data frame took: 62.00003623962402 ms
Calculating mean took 3.999948501586914 ms
1000 : 50 iterations took  2965.0 ms

On Python:
Creating data took: 0.05245208740234375 ms
Creating data frame took: 0.6833076477050781 ms
Calculating mean took 0.34689903259277344 ms
1000 : 50 iterations took  60.477294921875 ms

While we are indeed slower on everything involving native code here, the largest hit for us is creating the dataframe. That's because the entire data is copied into the native memory from Java, and then copied again by pandas. For CPython, they only copy from the native CPython object to pandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants