Skip to content

DataSystemsGroupUT/Process-Mining-Pipelines

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

README.md

CONTENTS


  • Introduction
  • Process Mining Pipeline: Workflow
  • Stack & Installation/Configuration Requirements
  • Maintainers
  • Backlog (Shared Doc Link)
  • Troubleshooting

INTRODUCTION

  • This is a Proof of Concept (PoC) for distributed Process Mining (PM) data pipelines. It relies on the theoretical concepts of:

  1. graphs:
    • Directly-Follows Graph (DFG)
    • Directed Acyclic Graph (DAG)
  2. database:
    • Neo4J graph db (via database .jar connector)
  3. data analysis:
    • distributed analytical engine, i.e., Apache Spark
      • interface: PySpark
    • PM4Py Process Mining algorithms (Python-based package)

PROCESS MINING PIPELINE: WORKFLOW

  • log files, i.e., .csv files
    • Apache Spark via PySpark interface
      • data partitioning (by case ID) and time windowing (by timestamps)
        • Directly-Follows Graph schema, i.e., predecessor(s) and successor(s) nodes
          • WRITE queries to a Neo4J graph database
            • format conversion for algorithmic analysis, i.e., Parent/Child nodes and their frequency
              • derive Petri net objects and put them within dataframes
                • PM4Py analysis by applying - in parallel to DFGs - Process Mining algorithms, i.e.,
                  1. α-miner
                  2. heuristic miner
                  3. inductive miner
                    • plotting of evaluation metrics via comparative charts from the dataframes - derived by converting Resilient Distributed Database (RDD)s to Pandas

STACK & INSTALLATION/CONFIGURATION REQUIREMENTS

System config:

  • pySpark vs. 2.4.6
  • Java JVM vs. 8
  • Scala vs. 2.13

YAML file for setting Conda and Jupyter notebook environment:

name: pyspark-vs2
channels:
  - defaults
dependencies:
  - pip=20.2.4
  - python=3.7.9
  - pip:
    - matplotlib==3.3.4
    - numpy==1.20.0
    - pandas==1.2.1
    - pm4py==2.2.5
    - py4j==0.10.9.1
    - pydotplus==2.0.2
    - pyspark==2.4.7
    - python-graphviz==0.8.4


NOTE: if some packages are missing in the required version, try installing from pip or from Conda-forge: $ conda install -c conda-forge [pkg_name]

CONTRIBUTORS/MAINTAINERS

BACKLOG (SHARED DOC LINK)

Backlog (Shared Doc)


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published