Skip to content

Commit

Permalink
readme: add figures: ast hierarchic split, hierarchic overview, expan…
Browse files Browse the repository at this point in the history
…d-collapse overview
  • Loading branch information
eladn committed Feb 23, 2022
1 parent fd3c958 commit 0a81dfb
Showing 1 changed file with 11 additions and 1 deletion.
12 changes: 11 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,13 @@
A combination of various neural code models for tackling programming-related tasks (like LogVar, function name prediction, and VarMisUse) using deep neural networks. The principal implementation is the *Hierarchic* model, that is designed especially for tasks involving data-flow-alike information-propagation requirements. We also provide a full implementation for [`code2seq`](https://github.com/tech-srl/code2seq) by Alon et al., and a partial implementation for [*Learning to Represent Programs with Graphs*](https://miltos.allamanis.com/publications/2018learning/) By Allamanis et al.

## Hierarchical code representation
This project consists the full implementation of the [*Hierarchic Code Encoder*](https://bit.ly/3vgzclc) in `PyTorch`. The idea of the hierarchic model is to (i) shorten distances between related procedure's elements for effective information-propagation; (ii) utilize the effective paths-based information-propagation approach; (iii) scalability - efficiently support lengthy procedure; and (iv) provide a fine-grained exploration framework to identify the relevant elements/relations in the code's underlying structure that most benefit the model for the overall task. The procedure's code is being broken at the statement level to form the two-level hierarchic structure. The upper level consists control structures (loops, if-stmts, block-stmts), while the lower level consists of expression-statements / conditions. In fact, each statement in the lower-level is associated to a node in the procedure's statement-level CFG (control-flow graph). The hierarchic encoder first applies a *micro* operator to obtain a local encoding for each individual statement (CFG node), then it employs a *macro* operator to propagate information globally (between CFG nodes), then it updates the local encodings by mixing it with the *globally-aware* encodings of the related CFG nodes, and finally it employs the micro operator once again. The *micro* operator is the *local* independent encoder of the statements (statements are being encoded stand-alone in the first stage). We supply multiple options for the micro operator, including the followings: paths-based AST (w/wo *collapse* stage), TreeLSTM AST, GNN AST, AST leaves, flat tokens sequence. The *macro* operator is responsible for the *global* information propagation between CFG nodes. We supply the following macro operators: paths-based CFG, GNN over CFG, upper control AST (paths-based, GNN, TreeLSTM, leaves only), single sequence of CFG nodes (ordered by textual appearance in code), set of CFG nodes.
This project consists the full implementation of the [*Hierarchic Code Encoder*](https://bit.ly/3vgzclc) in `PyTorch`. The idea of the hierarchic model is to (i) shorten distances between related procedure's elements for effective information-propagation; (ii) utilize the effective paths-based information-propagation approach; (iii) scalability - efficiently support lengthy procedure; and (iv) provide a fine-grained exploration framework to identify the relevant elements/relations in the code's underlying structure that most benefit the model for the overall task.

The procedure's code is being broken at the statement level to form the two-level hierarchic structure. The upper level consists control structures (loops, if-stmts, block-stmts), while the lower level consists of expression-statements / conditions. In fact, each statement in the lower-level is associated to a node in the procedure's statement-level CFG (control-flow graph).
![Upper-Lower AST Hierarchic Leveling Figure](https://gitfront.io/r/user-5760758/c3d4b342f8f48fe8e3764b2c30925ea140b99535/NDFA/raw/doc/figures/upper-lower-ast-split-figure.webp "Upper-Lower AST Hierarchic Leveling Figure")

The hierarchic encoder first applies a *micro* operator to obtain a local encoding for each individual statement (CFG node), then it employs a *macro* operator to propagate information globally (between CFG nodes), then it updates the local encodings by mixing it with the *globally-aware* encodings of the related CFG nodes, and finally it employs the micro operator once again. The *micro* operator is the *local* independent encoder of the statements (statements are being encoded stand-alone in the first stage). We supply multiple options for the micro operator, including the followings: paths-based AST (w/wo *collapse* stage), TreeLSTM AST, GNN AST, AST leaves, flat tokens sequence. The *macro* operator is responsible for the *global* information propagation between CFG nodes. We supply the following macro operators: paths-based CFG, GNN over CFG, upper control AST (paths-based, GNN, TreeLSTM, leaves only), single sequence of CFG nodes (ordered by textual appearance in code), set of CFG nodes.
![Hierarchic Framework Overview Figure](https://gitfront.io/r/user-5760758/c3d4b342f8f48fe8e3764b2c30925ea140b99535/NDFA/raw/doc/figures/hierarchic-framework-overview-figure.webp "Hierarchic Framework Overview Figure")

## Execution parameters structure (and model's hyper-parameters)
The entire set of parameters for the execution is formed as a nested classes structure rooted at the class [`ExecutionParameters`](ndfa/execution_parameters.py). The [`ExecutionParameters`](ndfa/execution_parameters.py) includes the [`ExperimentSetting`](ndfa/experiment_setting.py), which includes the [`CodeTaskProperties`](ndfa/code_tasks/code_task_properties.py), the [`NDFAModelHyperParams`](ndfa/ndfa_model_hyper_parameters.py), the [`NDFAModelTrainingHyperParams`](ndfa/ndfa_model_hyper_parameters.py), and the [`DatasetProperties`](ndfa/nn_utils/model_wrapper/dataset_properties.py).
Expand All @@ -25,4 +31,8 @@ Our input data is formed of several types of elements; that is, each pre-process

Usually, training and evaluating neural networks is performed over batches of examples, following the SIMD (single instruction multiple data) computational scheme to maximize the utility of the accelerated processing units and make the training feasible under the available resources. However, the preprocessed example is stored on its own, while it should reoccur in various batches during training. Therefore, the batching takes place during data loading. Whenever a collection of examples are being collated into a batch, continuous tensors are being created containing all the elements in the batch. As a result, the indices of these elements are updated. Thus, the references to them have to be fixed accordingly to retain the indexing consistency.

## Expand-Collapse
We extended the paths-based graph encoder (originally suggested by Alon et al. in [`code2seq`](https://github.com/tech-srl/code2seq)). Our *Expand-Collapse* graph encoding framework expands the input graph into paths, uses sequential encoder to process it (propagate information along the paths individually), and then collapses the graph back into nodes representation (encodings of node occurrences scattered along paths are folded back into single node representation).

We integrate this approach in the hierarchic model both as a *micro* operator (applied over the top-level expressions sub-ASTs) and as a *macro* operator (applied over the CFG).
![Expand Collapse Framework Figure](https://gitfront.io/r/user-5760758/c3d4b342f8f48fe8e3764b2c30925ea140b99535/NDFA/raw/doc/figures/expand-collapse-framework-figure.webp "Expand Collapse Framework Figure")

0 comments on commit 0a81dfb

Please sign in to comment.