Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do profiling of the fixed-ms2 score SSVM #10

Open
bachi55 opened this issue Nov 26, 2020 · 5 comments
Open

Do profiling of the fixed-ms2 score SSVM #10

bachi55 opened this issue Nov 26, 2020 · 5 comments
Assignees

Comments

@bachi55
Copy link
Member

bachi55 commented Nov 26, 2020

... where do we spend most of the run-time?

@bachi55 bachi55 self-assigned this Nov 26, 2020
@bachi55
Copy link
Member Author

bachi55 commented Dec 10, 2020

Till now we could reach a speed-up by factor 4 with very simple modifications:

  • use SQLite to ensure output order of feature vectors instead of Pandas
  • faster parsing of the fingerprints strings in the DB while loading

There is still room for improvement:

  • store fingerprints in some Numpy compatible binary format in the DB

Top list of run-time consuming parts:

 

Function Call Count Time (ms) Own Time (ms) Comment
sqlite3.Connection (execute) 261257 26353 26353 (29.6%) Data loading from the SQLite
scipy.spatial._distance_wrap.cdist_citiblock 87952 8358 8358 (9.4%) Molecule kernel calculation
_get_molecule_feature_matrix 85789 14610 8167 (9.2%) Data loading from the SQLite
numpy.ufunc (reduce)  363454 2834 2834 (3.2%)
generalized_tanimoto_kernel 87952 30925 2597 (2.9%) Molecule kernel calculation
str (split)  16533569 2259 2259 (2.5%)
_get_lambda_delta 84784 34755 2024
listcomp  89728 1815 1815
sqlite3.Cursor (fetchall)  171882 1741 1741
check_array 176366 10176 1652
numpy.array  923289 1531 1530
_assert_all_finite 176366 5104 1483
inner_f 440877 25802 1312
cdist 87952 11057 897
get_molecule_features_by_molecule_id 84928 41548 888

@bachi55
Copy link
Member Author

bachi55 commented Dec 21, 2020

We could reach a further improvement by do some optimizations on the kernel computation side:

  • Using Numba
  • or, calling the C-wrapper of SciPy directly.

The biggest speedup, so far though, was reached by computing the preference values for all candidates associated with one sequence (so of all nodes) in one go.

@bachi55
Copy link
Member Author

bachi55 commented Dec 25, 2020

Function n_A n_B d time (4 cores) time (1 core)
alg_1__numba 100 5000 307 0.057894 0.262477
cdist__FAST 100 5000 307 0.122171 0.122416
ufunc 100 5000 307 0.137654 0.323284
--- --- --- --- --- ---
alg_1__numba 1000 1000 307 0.313120 0.539815
cdist__FAST 1000 1000 307 0.203876 0.193676
ufunc 1000 1000 307 0.357854 0.593559
--- --- --- --- --- ---
alg_1__numba 100 5000 7500 4.285039 9.505218
cdist__FAST 100 5000 7500 3.065871 3.094869
ufunc 100 5000 7500 2.244482 6.121194
--- --- --- --- --- ---
alg_1__numba 1000 1000 7500 13.003921
cdist__FAST 1000 1000 7500 5.654260
ufunc 1000 1000 7500 4.224893

We used 4 cores for the parallel computations ('alg_1__numba' and 'ufunc'). Depending on the input data dimension different algorithms perform the best.

For example, when the costs per kernel matrix element increases (d=7500) the ufunc implementation performs the best.

@bachi55
Copy link
Member Author

bachi55 commented Dec 25, 2020

When the output matrix has a rectangular shape, i.e. n_A < n_B, than the 'alg_1__numba' performs the best. This case appears when the preference values are calculated where n_A is the number of active candidates and n_B are the candidates for which the preference values should be calculated.

Function n_A n_B d time
alg_1__numba 100 100 307 0.001172
cdist__FAST 100 100 307 0.002104
ufunc 100 100 307 0.008020
--- --- --- --- ---
alg_1__numba 100 200 307 0.002162
cdist__FAST 100 200 307 0.003946
ufunc 100 200 307 0.009165
--- --- --- --- ---
alg_1__numba 100 500 307 0.005299
cdist__FAST 100 500 307 0.009801
ufunc 100 500 307 0.018691
--- --- --- --- ---
alg_1__numba 100 1000 307 0.010685
cdist__FAST 100 1000 307 0.019640
ufunc 100 1000 307 0.030049
--- --- --- --- ---
alg_1__numba 100 5000 307 0.052604
cdist__FAST 100 5000 307 0.117722
ufunc 100 5000 307 0.125857
--- --- --- --- ---
alg_1__numba 100 10000 307 0.103035
cdist__FAST 100 10000 307 0.234538
ufunc 100 10000 307 0.245509
--- --- --- --- ---
alg_1__numba 100 100000 307 1.234449
cdist__FAST 100 100000 307 2.443562
ufunc 100 100000 307 2.576871

@bachi55
Copy link
Member Author

bachi55 commented Dec 25, 2020

N = 24, n_cand=50, L_max=15

Numpy: 22.856s
Numba: 28.291s

N = 48, n_cand=50, L_max=15

Numpy: 97.751s
Numba: 97.639s

N = 96, n_cand=50, L_max=15

Numpy: 321.269s
Numba: 292.787s

N = 24, n_cand=50, L_max=25

Numpy: 27.407s
Numba: 33.160s

N = 24, n_cand=100, L_max=15

Numpy: 37.855s
Numba: 42.473s

@bachi55 bachi55 transferred this issue from another repository Apr 12, 2021
@bachi55 bachi55 transferred this issue from another repository Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant