Proposal for architectural rework of feature extraction and rule application #238

copernico · 2021-07-22T07:53:44Z

copernico
Jul 22, 2021
Maintainer

It has become evident that the role of feature extraction and rule application is somewhat overlapping; in some cases we had to deal with duplication of code because a feature extraction routine would do more or less the same that would be needed for a rule, but not quite in the exact same way...

Also, in these first few months of the "Prospector 2" development, the priority has shifted from a pure ML-based approach to a more pragmatic (but equally or more effective) rule-based approach. The question is how to reconcile the two. Initially, the CommitWithFeature class was meant to represent data records that would end up in dataframe rows, for ML use. But what rules do, at the end of the day, is also feature extraction; the only difference is that rules compute features that are immediately meaningful to humans and come with a human-readable explanation.

Proposal

we merge feature extraction with rule application
ML will consume the result of this merged component
the data structure commit will have annotations (the ML module will pick those that it needs), no need for the extra CommitWithFeatures class, because the datamodel.Commit class with its 'annotations' is flexible enough

geryxyz · 2021-07-23T13:55:58Z

geryxyz
Jul 23, 2021
Collaborator

Just for a side note for proposal#3. If we end up stuffing all extracted properties or data into the annotations dict, I think it will be a dynamic-typed nightmare. The keys of the dictionary are hard to checked by any IDE or developer and since the type of their value could vary it is also hard to spot any bug, and with the "help" of some of the implicit type conversion maybe impossible.

I suggest considering keeping the strongly typed properties or attributes (maybe CommitWithFeatures could be merged into Commit class or it will be a data member of Commit, like Commit.computed_features) as the storage for extracted/computed values. We could use custom-made decorators or wrappers to note their human-readable explanation if any. Finally, there could be a method that export these data as a data frame for ML (it could be cached for speed up).

0 replies

copernico · 2021-07-23T14:56:49Z

copernico
Jul 23, 2021
Maintainer Author

Good points; however annotations are just pairs (RULE_ID, TEXTUAL_EXPLANATION), so they are pairs of string essentially so far. The problem you refer to will only happen if we decide to put more complex structures in there, which I do not see the need for, at this time. I agree that if we go down that route, we will have to use types to our advantage.

But I would not touch that with this refactoring; rather the goal would be to streamline the process by removing the overlapping in the responsibilities of different stages of the process (commit processing Vs rule application). To me, the extract_* functions are basically equivalent to apply_rule_* functions, except they do not produce user-readable explanations.

0 replies

copernico · 2021-09-16T07:51:43Z

copernico
Sep 16, 2021
Maintainer Author

This is all implemented as of 3122cb6

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for architectural rework of feature extraction and rule application #238

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Proposal for architectural rework of feature extraction and rule application #238

copernico Jul 22, 2021 Maintainer

Proposal

Replies: 3 comments

geryxyz Jul 23, 2021 Collaborator

copernico Jul 23, 2021 Maintainer Author

copernico Sep 16, 2021 Maintainer Author

copernico
Jul 22, 2021
Maintainer

geryxyz
Jul 23, 2021
Collaborator

copernico
Jul 23, 2021
Maintainer Author

copernico
Sep 16, 2021
Maintainer Author