Skip to content

LibSA4Py v0.1.0

Compare
Choose a tag to compare
@mir-am mir-am released this 04 Mar 13:58
· 215 commits to master since this release

Added

Pipeline

  • Parallel pipeline to speed up processing a Python dataset using all CPU cores
  • Storing processed Python projects in JSON-formatted files.
  • Excluding duplicate files of a dataset from processing.
  • Add file set (train/test/validation) to processed project if given.
  • Applying standard NLP operations on identifies in a module.
  • Excluding cached projects before running the pipeline if specified.
  • Throwing NullProjectException for projects that have no source code files.

AST-based Extractor

  • Creating a normalized Seq2Seq representation of a source code file aligned with a sequence of identifiers' type.
  • Extracting import names of a module.
  • Extracting the name of global variables in a module with their type annotations (if present).
  • Calculating type annotation coverage for the whole project and its source code files.
  • Extracting the name of classes in a module.
  • Extracting the name of class variables and their type annotation (if present).
  • Extracting the name of functions in a module or in a class.
  • Extracting the name of functions' parameters and their type annotations (if present).
  • Extracting return expressions in functions.
  • Extracting the occurrence of a function's parameters in the function's body.
  • Extracting the return type of functions (if present).
  • Extracting docstring for functions' parameters and their return type.
  • Extracting short and long descriptions of functions in their docstring.

AST-based Transformers

  • Adding space around source code tokens for better tokenization.
  • Removing comment and docstring from source code for its normalized Seq2Seq representation.
  • Removing string literals from source code for its normalized Seq2Seq representation.
  • Removing numeric literals from source code for its normalized Seq2Seq representation.
  • Removing type annotations from source code for its normalized Seq2Seq representation.
  • Propagating the type of functions' parameters in the function body and module-level constants

Fixed

  • A special case where uninitialized variables with types caused exceptions.
  • A case where variables in a tuple couldn't be extracted in multiple assignments.
  • Handling nested tuples in multiple assignments for extracting var names.
  • A case where a type-annotated class attribute is not initialized for removing its type