Implementing core functionalities in low-level components #286
Replies: 2 comments
-
No, this is really not a line to go down into for multiple reasons. First, this will induce a large jump in complexity: complexity of the codebase, complexity of packaging, complexity of maintenance. Second, this is not really the way to speed up dirty-cat. Most speed-up in dirty-cat should be algorithmic (and in this sense the SimilarityEncoder is a dead end that is from a statistical-modeling perspective a rather naive approach and the GapEncoder or MinHashEncoders should be preferred). Other speed-ups should come from using the right primitives, implemented in our dependencies and optimized there, for instance in scikit-learn, numpy, scipy, and pandas. The operations in dirty-cat are assembled from fairly standard operations, such as kmeans, or knn. These must be optimized in the dependencies, rather trying to do our own optimization, which will not perform as well. |
Beta Was this translation helpful? Give feedback.
-
We have not concluded that exactly. I do not know dirty_cat internals: I simply know that treating string in Python is not the most efficient option. I agree with @GaelVaroquaux and would add that profiling implementations comes before any performance improvement work. Sometimes, it's more algorithmic and this can be alleviated by restructuring implementations. |
Beta Was this translation helpful? Give feedback.
-
We discussed performance with @jjerphan the other day, and we concluded that it might be a good idea to implement some of the core computation of dirty_cat in a low-level language, which could perhaps speed up things.
I'm especially thinking of the manipulations done in the
SimilarityEncoder
.I guess the most common option is a C++ component, but as far as I remember (I haven't worked on some low-level in a while), string manipulation in C-like languages is kind of messy.
Perhaps Rust would be a good option ? Searching around, I found ngrammatic, and Julien mentioned vtext, which is a Python project using Rust. PyO3 seems to be reference.
Later in the year - probably after the summer - I will have a little bit of time, and I could work on a PoC, if you see value in that.
Please let me know what you think !
Beta Was this translation helpful? Give feedback.
All reactions