GitHub - arjunnlp/NLP-papers-tools-discussion: Anything useful goes here

Swifter will automatically apply the fastest method available (or so it says, more on it later). You want to make sure you have stuff like Dask intalled. It chooses between vectorization, Dask, and traditional pandas.apply

$ pip install -U pandas
$ pip install swifter

import pandas as pd
import swifter

mydf['outCol'] = df['inCol'].swifter.apply(anyfunction)

DASK - parallelizing numpy, pandasm python, scikit-learn, literally everything...

Modin: An alternative to DASK but only for Pandas - much simpler and lighter if I/O is what you need. Will process 10 GB DataFrame in seconds.

# replace the following line
#import pandas as pd
# with
import modin.pandas as pd

You are done, pandas is 10-30 times faster on some tasks! but sometimes will crash :)

Mini-batch data parallelism, sort of default in PyTorch

Compile your code the easy way

Numba: compiled and highly optimized C/C++/Fortran code will be used instead of slow numpy (even cython is slower)

Best of all you still code in python, just need a decorator on top of time-consuming function. MAKE SURE IT IS TIME CONSUMING - just spamming @njit eveywhere will do the opposite of what you want, initializing numba costs resources!

from numba import jit, int32

# a 4-letter magick word that will make any function that
# takes 20-30 seconds finish in 5 or so!

@njit
def function0(a, b):
    # your loop 
    return result
    
# we declare return value and types, turn off jit compiler 
# and go directly for binary (making it harder to debug 
# but SO much faster. Finally all vector ops will be 
# distributed between cores if your CPU
    
@jit(int32(int32, int32), nopython=true, parallel=true)
def function(a, b):
    # your loop or numerically intensive computations
    return result

# in this function we are saying "you are no longer restricted
# to types we specify, just run it all in parallel, on one
# or more CPUs, using threads or processes or whatever!
# numba is smart enough to figure out the best way to do so

@vectorize
def function2(c):
    # your loop or numerically intensive computations
    return result

Eliminate memory leaks

ipyexperiments - will save you 20-30% video and 10-15% system memory

ipyexperiments usage examples in some kaggle contest code I wrote

Make sure to either use IPyTorchExperiments all the time, or IPyCPUExperiments if don't care to use GPU. If you are using a GPU, you must be sure to use the IPyTorchExperiments and that the text after the cell tells you it is indeed using GPU backend.

Speed up your loops

In general, using numpy operations is preferred, e.g. np.sum() beats iterating.

Avoid if-else by using np.where is a big one. Here is an example of going from 1 trillion operations to 1 operation. Assuming each operation takes a nanosecond, that's 17 minutes vs 1 nanosecond.

# X is some numpy array, and you have a 1000 of those in a dataframe or in a list
# If your column is 1000 in length, this is 1000 operations * size of numpy array (say 1000) = 1000000 operations
def fun(x):
    if x > 0:
        x =+ 1
    else:
        x = 0
    return x
    
# ~1000000000000 operations
for X in data:
    for x in X:
        output.append(fun(x))

# ~1000000 operations
df['data'].apply([x for x in X])

# ~1000 operations you are doing  no looping but pandas is single-threaded...
df['data'].apply(X)

# This is very fast, using vector math extensions. 1 Op on Xeon or i9 with MKL Installed.
def fun2(x):
    x[np.where(x > 0)] += 1
    x[np.where(x <= 0)] = 0
    return x

df['data'].swifter.apply(fun2)

Assume data contains some items that we abstract as ... . In general, follow this rule of thumb: