This documentation will provide you some explanation on how to use the catalog module and the different methods supported by this module.
It will include some examples but be aware that not all methods will be documented here.
To have a full view on the different API endpoints specific to the schema API, please refer to this API documentation.
Alternatively, you can use the docstring in the methods to have more information.
- Catalog module in aepp
Catalog is the system of record for data location and lineage within Adobe Experience Platform. Catalog Service does not contain the actual files or directories that contain the data. Instead, it holds the metadata and description of those files and directories.
Catalog acts as a metadata store or "catalog" where you can find information about your data within Experience Platform.
Simply put, Catalog acts as a metadata store or "catalog" where you can find information about your data within Experience Platform. You can use Catalog to answer the following questions:
- Where is my data located?
- At what stage of processing is this data?
- What systems or processes have acted on my data?
- How much data was successfully processed?
- What errors occurred during processing?
The data can be represented through different types of object depending the state of the data. Catalog is an exhaustive endpoint to fetch the status of these different objects:
- accounts
- batches
- connections
- connectors
- dataSets
- dataSetFiles
- dataSetViews
You can find more information on the official documentation of the catalog
Before importing the module, you would need to import the configuration file, or alternatively provide the information required for the API connection through the configure method. see getting starting.
Note that I am using the connectInstance
parameter to save the configuration imported.
To import the module you can use the import statement with the catalog
keyword.
import aepp
prod = aepp.importConfigFile('myConfig_file.json',connectInstance=True,sandbox='prod')
from aepp import catalog
The catalog module provides a class that you can use for generating and retrieving the different catalog objects.
The following documentation will provide you with more information on its capabilities.
The Catalog
class is the default API connector that you would encounter for any other submodules on this python module.
This class can be instantiated by calling the Catalog()
from the catalog
module.
Following the previous method described above, you can realize this:
import aepp
from aepp import catalog
mySandbox = aepp.importConfigFile('config.json',connectInstance=True,sandbox='prod')
myCat = catalog.Catalog(config=mySandbox)
3 parameters are possible for the instantiation of the class:
- config : OPTIONAL : the connect object instance created when you use
importConfigFile
with connectInstance parameter. Default to latest loaded configuration. - header : OPTIONAL : header object in the config module. (example: aepp.config.header)
- loggingObject : OPTIONAL : A logging object that can be passed for debuging or logging elements, see logging documentation
You can access some attributes once you have instantiated the class.
These following elements would be available:
- sandbox : provide which sandbox is currently being used
- header : provide the default header which is used for the requests.
- loggingEnabled : if the logging capability has been used
- endpoint : the default endpoint used for all methods.
Once you have run the getDatasets
method via the Catalog
instance. You will automatically populate the data
attributes.
This data attribute contains 3 dictionaries:
- table_names : represent the following mapping
{ 'datasetName' : 'QueryServiceTableName' }
- schema_ref : represent the following mapping
{ 'datasetName' : 'schemaRefId' }
- ids : represent the following mapping
{ 'datasetName' : 'datasetId' }
To enable it, just do the following:
import aepp
from aepp import catalog
mySandbox = aepp.importConfigFile('config.json',connectInstance=True,sandbox='prod')
myCat = catalog.Catalog(config=mySandbox)
mydatasets = myCat.getDatasets()
SchemaRef:dict = myCat.data.schema_ref
On the elements below, you can find all methods existing and that you can query when instantiating the Catalog
class.
Template for requesting data with a GET method.
Arguments:
- endpoint : REQUIRED : The URL to GET
- params: OPTIONAL : dictionary of the params to fetch
- format : OPTIONAL : Type of response returned. Possible values:
- json : default
- txt : text file
- raw : a response object from the requests module
Decode the full txt batch via the codecs module.
Usually the full batch is returned by the getResource method with format == "txt".
Arguments:
- message: REQUIRED : the text file return from the failed batch message.
return None when issue is raised
Try to create a list of dictionary messages from the decoded stream batch extracted from the decodeStreamBatch method.
Arguments:
- message : REQUIRED : a decoded text file, usually returned from the decodeStreamBatch method
- verbose : OPTIONAL : print errors and information on the decoding.
return None when issue is raised
Returns the last batch from a specific datasetId.
Arguments:
- dataSetId : OPTIONAL : the datasetId to be retrieved the batch about
- limit : OPTIONAL : number of batch per request
Possible kwargs: - created : Filter by the Unix timestamp (in milliseconds) when this object was persisted.
- lastBatchStatus : Filter by the status of the last related batch of the dataset [success, inactive, replay]
- properties : A comma separated allowlist of top-level object properties to be returned in the response. Used to cut down the number of properties and amount of data returned in the response bodies.
Retrieve a list of batches.
Arguments:
- limit : Limit response to a specified positive number of objects. Ex. limit=10 (max = 100)
- n_results : OPTIONAL : number of result you want to get in total. (will loop - "inf" to get as many as possible)
- output : OPTIONAL : Can be "raw" response (dict) or "dataframe".
Possible kwargs: - created : Filter by the Unix timestamp (in milliseconds) when this object was persisted.
- createdAfter : Exclusively filter records created after this timestamp.
- createdBefore : Exclusively filter records created before this timestamp.
- start : Returns results from a specific offset of objects. This was previously called offset. (see next line)
offset : Will offset to the next limit (sort of pagination) - updated : Filter by the Unix timestamp (in milliseconds) for the time of last modification.
- createdUser : Filter by the ID of the user who created this object.
- dataSet : Used to filter on the related object: &dataSet=dataSetId.
- version : Filter by Semantic version of the account. Updated when the object is modified.
- status : Filter by the current (mutable) status of the batch.
possible values:"processing","success","failure","queued","retrying","stalled","aborted","abandoned","inactive","failed","loading","loaded","staged","active","staging","deleted" - orderBy : Sort parameter and direction for sorting the response.\ Ex. orderBy=asc:created,updated. This was previously called sort.
- properties : A comma separated whitelist of top-level object properties to be returned in the response.
Used to cut down the number of properties and amount of data returned in the response bodies. - size : The number of bytes processed in the batch.
/Batches/get_batch, more details : https://www.adobe.io/apis/experienceplatform/home/api-reference.html
Abstraction of getBatches method that focus on failed batches and return a dataframe with the batchId and errors.
Also adding some meta data information from the batch information provided.
Arguments:
- limit : Limit response to a specified positive number of objects. Ex. limit=10 (max = 100)
- n_results : OPTIONAL : number of result you want to get in total. (will loop)
- orderBy : OPTIONAL : The order of the batch. Default "desc:created"
Possible kwargs: Any additional parameter for filtering the requests
Get a specific batch id.
Arguments:
- batch_id : REQUIRED : batch ID to be retrieved.
Create a new batch.
Arguments:
- object : REQUIRED : Object that define the data to be onboarded.
see reference here: https://www.adobe.io/apis/experienceplatform/home/api-reference.html#/Batches/postBatch
Retrieve a list of resource links for the Catalog Service.
Possible kwargs:
- limit : Limit response to a specified positive number of objects. Ex. limit=10
- orderBy : Sort parameter and direction for sorting the response.
Ex. orderBy=asc:created,updated. This was previously called sort. - property : A comma separated whitelist of top-level object properties to be returned in the response.
Used to cut down the number of properties and amount of data returned in the response bodies.
Return a list of a datasets.
Arguments:
- limit : REQUIRED : amount of dataset to be retrieved per call.
- output : OPTIONAL : Default is "raw", other option is "df" for dataframe output
Possible kwargs: - state : The state related to a dataset.
- created : Filter by the Unix timestamp (in milliseconds) when this object was persisted.
- updated : Filter by the Unix timestamp (in milliseconds) for the time of last modification.
- name : Filter by the a descriptive, human-readable name for this DataSet.
- namespace : One of the registered platform acronyms that identify the platform.
- version : Filter by Semantic version of the account. Updated when the object is modified.
- property : Regex used to filter objects in the response. Ex. property=name~^test.
/Datasets/get_data_sets, more possibilities : https://www.adobe.io/apis/experienceplatform/home/api-reference.html
Return a dictionary of Profile Snapshot datasetId, containing the information related to them.
Arguments:
- explicitMergePolicy : OPTIONAL : Provide a mergePolicyName attribute for the dataset to explain.
Create a new dataSets based either on preconfigured setup or by passing the full dictionary for creation.
Arguments:
- data : REQUIRED : If you want to pass the dataset object directly (not require the name and schemaId then)
more info: https://www.adobe.io/apis/experienceplatform/home/api-reference.html#/Datasets/postDataset - name : REQUIRED : if you wish to create a dataset via autocompletion. Provide a name.
- schemaId : REQUIRED : The schema $id reference for creating your dataSet.
- profileEnabled : OPTIONAL : If the dataset to be created with profile enbaled
- identityEnabled : OPTIONAL : If the dataset should create new identities
- upsert : OPTIONAL : If the dataset to be created with profile enbaled and Upsert capability.
- tags : OPTIONAL : set of attribute to add as tags.
- systemLabels : OPTIONAL : A list of string to attribute system based label on creation.
possible kwargs: - requestDataSource : Set to true if you want Catalog to create a dataSource on your behalf; otherwise, pass a dataSourceId in the body.
Return a single dataset.
Arguments:
- datasetId : REQUIRED : Id of the dataset to be retrieved.
Return a single dataset observable schema.
Which means that the fields that has been used in that dataset.
Arguments:
- datasetId : REQUIRED : Id of the dataset for which the observable schema should be retrieved.
- appendDatasetInfo : OPTIONAL : If set to True, it will append the "datasetId" into the dictionary return
Delete a dataset by its id.
Arguments:
- datasetId : REQUIRED : Id of the dataset to be deleted.
Get views of the datasets.
Arguments:
- datasetId : REQUIRED : Id of the dataset to be looked down.
Possible kwargs: - limit : Limit response to a specified positive number of objects. Ex. limit=10
- orderBy : Sort parameter and direction for sorting the response. Ex. orderBy=asc:created,updated.
- start : Returns results from a specific offset of objects. This was previously called offset. Ex. start=3.
- property : Regex used to filter objects in the response. Ex. property=name~^test.
Get a specific view on a specific dataset.
Arguments:
- datasetId : REQUIRED : ID of the dataset to be looked down.
- viewId : REQUIRED : ID of the view to be look upon.
Returns the list of files attached to a view in a Dataset.
Arguments:
- datasetId : REQUIRED : ID of the dataset to be looked down.
- viewId : REQUIRED : ID of the view to be look upon.
Enable a dataset for profile with upsert.
Arguments:
- datasetId : REQUIRED : Dataset ID to be enabled for profile
- upsert : OPTIONAL : If you wish to enabled the dataset for upsert.
Enable a dataset for profile with upsert.
Arguments:
- datasetId : REQUIRED : Dataset ID to be enabled for Identity
Enable a dataset for upsert on profile. The dataset is automatically enabled for profile as well. Arguments: datasetId : REQUIRED : Dataset ID to be enabled for upsert
Disable the dataset for Profile ingestion.
Arguments:
- datasetId : REQUIRED : Dataset ID to be disabled for profile
Disable a dataset for identity ingestion
Arguments:
- datasetId : REQUIRED : Dataset ID to be disabled for Identity
Create a dataset with an union Profile schema.
Get failed batches for Mapper errors, based on error code containing "MAPPER".
Arguments:
- limit : OPTIONAL : Number of results per requests
- n_results : OPTIONAL : Total number of results wanted.
Possible kwargs: - created : Filter by the Unix timestamp (in milliseconds) when this object was persisted.
- createdAfter : Exclusively filter records created after this timestamp.
- createdBefore : Exclusively filter records created before this timestamp.
- start : Returns results from a specific offset of objects. This was previously called offset. (see next line)
offset : Will offset to the next limit (sort of pagination) - updated : Filter by the Unix timestamp (in milliseconds) for the time of last modification.
- createdUser : Filter by the ID of the user who created this object.
- dataSet : Used to filter on the related object: &dataSet=dataSetId.
- version : Filter by Semantic version of the account. Updated when the object is modified.
- status : Filter by the current (mutable) status of the batch.
- orderBy : Sort parameter and direction for sorting the response.
Ex. orderBy=asc:created,updated. This was previously called sort. *properties : A comma separated whitelist of top-level object properties to be returned in the response.
Used to cut down the number of properties and amount of data returned in the response bodies. - size : The number of bytes processed in the batch.
Recursive function to find the active batch from any batch.
In case the active batch is part of a consolidation job, it returns the batch before consolidation.
Argument:
- batchId : REQUIRED : The original batch you want to look.
- predecessor : OPTIONAL : The predecessor
The ObservableSchemaManager
is a class that will allow you to take the observable schema returned by the getDataSetObservableSchema
method and construct a similar element that is the Schema Manager.
This would allow you to use the to_dataframe
or to_dict
in order to build visual representation of the data being ingested in that dataset.
You can then compare the defined schema with the observable schema in order to see which fields is used or not.
In order to instantiate the ObservableSchemaManager
, you would need to provide the response of the getDataSetObservableSchema
method.
Argument:
- observableSchema : dictionary of the data stored in the "observableSchema" key
Search a field in the observable schema.
Arguments:
- string : REQUIRED : the string to look for for one of the field
- partialMatch : OPTIONAL : if you want to look for complete string or not. (default True)
- caseSensitive : OPTIONAL : if you want to compare with case sensitivity or not. (default False)
Generate a dataframe with the row representing each possible path.
Arguments:
- save : OPTIONAL : If you wish to save it with the title used by the field group.
save as csv with the title used. Not title, used "unknown_fieldGroup_" + timestamp. - queryPath : OPTIONAL : If you want to have the query path to be used.
- description : OPTIONAL : If you want to have the description used
Generate a dictionary representing the field group constitution
Arguments:
- typed : OPTIONAL : If you want the type associated with the field group to be given.
- save : OPTIONAL : If you wish to save the dictionary in a JSON file
A method to compare the existing schema with the observable schema and find out the difference in them.
It output a dataframe with all of the path, the field group, the type (if provided) and the part availability (in that dataset)
Arguments:
- SchemaManager : REQUIRED : the SchemaManager class instance for that schema.
Important amount of use-cases are covered in the Official documentation on catalog endpoints.
However, we can cover some with our module so you can see how it translates within python.
You have the possibility to list all of the datasets that are existing within your AEP instance or sandbox.
In order to realize that, you can getDatasets()
method.
This will return the list of the dataset with all information attached to them.
The data returned could be overwhelming if you just want to have the basic information.
For that reason, the realization of that method will feed 3 data attributes on your instance.
You can access them by using the data
attribute on that instance:
catalogInstance.data.table_names
: gives you the dataSet names & Query Service namescatalogInstance.data.schema_ref
: gives you the dataSet names & their schema referencecatalogInstance.data.ids
: gives you the dataSet name and their Ids
import aepp
from aepp import catalog
mySandbox = aepp.importConfigFile('config.json',connectInstance=True,sandbox='prod')
myCat = catalog.Catalog(config=mySandbox)
DataSets = myCat.getDatasets()
## table names
myCat.data.table_names
## Schema ref
myCat.data.schema_ref
## DataSet Ids
myCat.data.ids
The catalog module can help you identify the batches that help build the dataset.
This can help you identify or monitor a specific dataSet for batch failure.
Several options exist to filter for specific dataSet or specific status.
The options are kwargs parameters and are available in the docstring.
Monitoring batch for a specific dataSet:
batches = myCat.getBatches(dataSet="60103d1b49bb4f194a5be4c9")
Monitoring failed batch:
batches = myCat.getBatches(status='failed')
Monitoring failed batch for a specific dataSet:
batches = myCat.getBatches(status='failed',dataSet="60103d1b49bb4f194a5be4c9")
It is common to monitor only the batches that have failed during ingestion.
In order to realize that, you can use one of the methods that I presented above in the documentation for batch monitoring.
I have also created a specific failed batches monitoring method that will return you some data required for investigation.
Name of that method : getFailedBatchesDF
More data can be found by the getBatch
method but this simplified method can be useful as reporting.
It will provide you the following information:
- batchId
- timestamp
- recordsSize
- invalidRecordsProfile
- invalidRecordsIdentity
- invalidRecordCount
- invalidRecordsStreamingValidation
- invalidRecordsMapper
- invalidRecordsUnknown
- errorCode
- errorMessage
- flowId (if available)
- dataSetId
- sandboxId
This method is useful as it returns a dataframe that you can save for reporting or passing along to your adobe representative.
Arguments:
- limit : Limit response to a specified positive number of objects. Ex. limit=10 (max = 100)
- n_results : OPTIONAL : number of result you want to get in total. (will loop)
failedBatches = myCat.getFailedBatchesDF(limit=50,n_results=200)
The Catalog endpoints allows you to create DataSet programatically.
You will need to pass the schemaRef $id to be used and set some options required (format, persistence,containerFormat).
Example of a request type (with createDataSource set to True
):\
{
"name":"Example Dataset",
"schemaRef": {
"id": "https://ns.adobe.com/{TENANT_ID}/schemas/{SCHEMA_ID}",
"contentType": "application/vnd.adobe.xed+json;version=1"
},
"fileDescription": {
"persisted": true,
"containerFormat": "parquet",
"format": "parquet"
}
}
Note that you can create dataSet for parquet or JSON format data type.