Skip to content

profileTR

datacorner edited this page Dec 4, 2023 · 4 revisions

profileTR Transformer

Purpose

This transformer aims at analysing and so profiling one or many datasets. It can provides severavl informations on the dataset(s) like:

  • all the column name
  • The column types
  • The column patterns
  • The number of Null values
  • The number of distinct values
  • The top values
  • The frequency distribution of the values

It works in a very simple way and can accept many datasets in input. The file generated can be a json file or an HTML file. The image below shows the html result:

Configuration by using the SQL directly in the configuration file

The specific configuration (as a Datasource) in the configuration file section parameters includes the following parameters:

  • directory: directory where the file generated will be created
  • filename: filename to generate
  • maxvaluecounts : Maximum list of values used for the: Top Values, patterns and type detection analysis
  • output html or json (rendering file)

Data sources configuration

  • Inputs : several inputs
  • Outputs : 1 output

Configuration example:

    "transformers":  [     
    { 
        "id": "profiling",
        "classname": "pipelite.transformers.profileTR",
        "inputs" : [ "Source A", "test" ],
        "parameters": {
            "directory" : "tests/data/out",
            "filename" : "test.html",
            "maxvaluecounts": 10,
            "output": "html"
        }
    }
    ... ] ...

JSON rendering

This is an example of a JSON file generated by this transformer.

{
  "sources": [
    {
      "id": "S",
      "profile": {
        "rows count": 14,
        "columns count": 16,
        "columns names": [
          "id",
          "concept:name",
          [... List here all the columns names ...]
        ],
        "columns": [
          {
            "id": "0",
            "name": "id",
            "type": "float64",
            "inferred": "float64",
            "distinct": 12,
            "nan": 2,
            "null": 2,
            "stats": {
              "count": 12.0,
              "mean": 6.166666666666667,
              "std": 4.323999271157397,
              "min": 0.0,
              "25%": 2.75,
              "50%": 6.0,
              "75%": 9.25,
              "max": 13.0
            },
            },
            "top values": {
              "0.0": 1,
              "1.0": 1,
              [... All the Top Values ...]
            },
            "pattern": {
              "N.N": 9,
              "NN.N": 3,
              [... All the patterns ...]
            },
            "types": {
              "number": 12,
              "null": 2,
              [... All the types ...]
            }

Example

Example with HTML output

graph TD;
    id1[Read Data Source S1]-->id2[Dataset S1];
    id2[Dataset S1]-->id3[Profile S1];
    id11[Read Data Source S2]-->id21[Dataset S2];
    id21[Dataset S2]-->id31[Profile S2];
    id3[Profile S1]-->id4[Combine and create a HTML file];
    id31[Profile S2]-->id4[Combine and create a HTML file];
Loading

In this example 2 data sources are read and profiled. One output HTML file is then generated with all the computed results.

Example with JSON output

🏠 Home
🔑 Main concepts
💻 Installation
🔨 Configuration
🚀 Running

Supported Data Sources
📄 CSV File
📑 XES File
📃 Excel File
📤 ODBC
🏢 SAP
🎢 ABBYY Timeline

Supported Transformations
🔀 Pass Through
📶 Dataset Profiling
🔂 Concat 2 Data sources
🆖 SubString
🆒 Column Transformation
🔃 Join data sources
🔃 Lookup
🔤 Rename Column Name

Extending pipelite
✅ how to
✅ Adding new Data sources
✅ Adding new Transformers
✅ Adding new Pipelines

Clone this wiki locally