Skip to content
/ SEMA Public

SEMA is based on angr, a symbolic execution engine used to extract API calls. Especially, we extend ANGR with strategies to create representative signatures based on System Call Dependency graph (SCDG). Those SCDGs can be exploited in machine learning modules to do classification/detection.

License

Notifications You must be signed in to change notification settings

csvl/SEMA

Repository files navigation

☠️ SEMA ☠️

ToolChain using Symbolic Execution for Malware Analysis.

  ██████ ▓█████  ███▄ ▄███▓ ▄▄▄
▒██    ▒ ▓█   ▀ ▓██▒▀█▀ ██▒▒████▄
░ ▓██▄   ▒███   ▓██    ▓██░▒██  ▀█▄
  ▒   ██▒▒▓█  ▄ ▒██    ▒██ ░██▄▄▄▄██
▒██████▒▒░▒████▒▒██▒   ░██▒ ▓█   ▓██▒
▒ ▒▓▒ ▒ ░░░ ▒░ ░░ ▒░   ░  ░ ▒▒   ▓▒█░
░ ░▒  ░ ░ ░ ░  ░░  ░      ░  ▒   ▒▒ ░
░  ░  ░     ░   ░      ░     ░   ▒
      ░     ░  ░       ░         ░  ░

Documentation Built by gendocs .github/workflows/.pre-commit-config.yaml CodeQL Documentation Generation pages-build-deployment Python application

Python Docker Debian

Star History Chart

Toolchain architecture

Our toolchain is represented in the following figure and works as follows:

  • A collection of labelled binaries from different malware families is collected and used as the input of the toolchain.
  • Angr, a framework for symbolic execution, is used to execute binaries symbolically and extract execution traces. For this purpose, different heuristics have been developed to optimize symbolic execution.
  • Several execution traces (i.e., API calls used and their arguments) corresponding to one binary are extracted with Angr and gathered together using several graph heuristics to construct a SCDG.
  • These resulting SCDGs are then used as input to graph mining to extract common graphs between SCDGs of the same family and create a signature.
  • Finally, when a new sample has to be classified, its SCDG is built and compared with SCDGs of known families using a simple similarity metric.

Toolchain Illustration

This repository contains a first version of a SCDG extractor. During the symbolic analysis of a binary, all system calls and their arguments found are recorded. After some stop conditions for symbolic analysis, a graph is built as follows: Nodes are system calls recorded, edges show that some arguments are shared between calls.

When a new sample has to be evaluated, its SCDG is first built as described previously. Then, gspan is applied to extract the biggest common subgraph and a similarity score is evaluated to decide if the graph is considered as part of the family or not. The similarity score S between graph G' and G'' is computed as follows: Since G'' is a subgraph of G', this is calculating how much G' appears in G''. Another classifier we use is the Support Vector Machine (SVM) with INRIA graph kernel or the Weisfeiler-Lehman extension graph kernel.

A web application is available and is called SemaWebApp. It allows to manage the launch of experiments on SemaSCDG and/or SemaClassifier.

Pre-commit

This repository uses pre-commit to ensure that the code is formatted correctly and that the code is clean. To install pre-commit, run the following command:

python3 -m pip install pre-commit
pre-commit install

Documentation

  • Complete README of the entire toolchain : Sema README

  • SCDG README : SCDG README

  • Classifier README : Classifier README

  • Web app README : Web app README

  • A Makefile is provided to ease the usage of the toolchain, run make help for more information about the available commands

Credentials

Main authors of the projects:

  • Charles-Henry Bertrand Van Ouytsel (UCLouvain)

  • Christophe Crochet (UCLouvain)

  • Khanh Huu The Dam (UCLouvain)

  • Oreins Manon (UCLouvain)

Under the supervision and with the support of Fabrizio Biondi (Avast)

Under the supervision and with the support of our professor Axel Legay (UCLouvain) (:heart:)

Linked papers

About

SEMA is based on angr, a symbolic execution engine used to extract API calls. Especially, we extend ANGR with strategies to create representative signatures based on System Call Dependency graph (SCDG). Those SCDGs can be exploited in machine learning modules to do classification/detection.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published