-
Notifications
You must be signed in to change notification settings - Fork 1
profileTR
This transformer aims at analysing and so profiling one or many datasets. It can provides severavl informations on the dataset(s) like:
- all the column name
- The column types
- The column patterns
- The number of Null values
- The number of distinct values
- The top values
- The frequency distribution of the values
It works in a very simple way and can accept many datasets in input. The file generated can be a json file or an HTML file. The image below shows the html result:
The specific configuration (as a Datasource) in the configuration file section parameters includes the following parameters:
- directory: directory where the file generated will be created
- filename: filename to generate
- maxvaluecounts : Maximum list of values used for the: Top Values, patterns and type detection analysis
- output html or json (rendering file)
Data sources configuration
- Inputs : several inputs
- Outputs : 1 output
Configuration example:
"transformers": [
{
"id": "profiling",
"classname": "pipelite.transformers.profileTR",
"inputs" : [ "Source A", "test" ],
"parameters": {
"directory" : "tests/data/out",
"filename" : "test.html",
"maxvaluecounts": 10,
"output": "html"
}
}
... ] ...
This is an example of a JSON file generated by this transformer.
{
"sources": [
{
"id": "S",
"profile": {
"rows count": 14,
"columns count": 16,
"columns names": [
"id",
"concept:name",
[... List here all the columns names ...]
],
"columns": [
{
"id": "0",
"name": "id",
"type": "float64",
"inferred": "float64",
"distinct": 12,
"nan": 2,
"null": 2,
"stats": {
"count": 12.0,
"mean": 6.166666666666667,
"std": 4.323999271157397,
"min": 0.0,
"25%": 2.75,
"50%": 6.0,
"75%": 9.25,
"max": 13.0
},
},
"top values": {
"0.0": 1,
"1.0": 1,
[... All the Top Values ...]
},
"pattern": {
"N.N": 9,
"NN.N": 3,
[... All the patterns ...]
},
"types": {
"number": 12,
"null": 2,
[... All the types ...]
}
graph TD;
id1[Read Data Source S1]-->id2[Dataset S1];
id2[Dataset S1]-->id3[Profile S1];
id11[Read Data Source S2]-->id21[Dataset S2];
id21[Dataset S2]-->id31[Profile S2];
id3[Profile S1]-->id4[Combine and create a HTML file];
id31[Profile S2]-->id4[Combine and create a HTML file];
In this example 2 data sources are read and profiled. One output HTML file is then generated with all the computed results.
🏠 Home
🔑 Main concepts
💻 Installation
🔨 Configuration
🚀 Running
Supported Data Sources
📄 CSV File
📑 XES File
📃 Excel File
📤 ODBC
🏢 SAP
🎢 ABBYY Timeline
Supported Transformations
🔀 Pass Through
📶 Dataset Profiling
🔂 Concat 2 Data sources
🆖 SubString
🆒 Column Transformation
🔃 Join data sources
🔃 Lookup
🔤 Rename Column Name
Extending pipelite
✅ how to
✅ Adding new Data sources
✅ Adding new Transformers
✅ Adding new Pipelines