components llm_ingest_dataset_to_acs_basic

LLM - Dataset to ACS Pipeline

Single job pipeline to chunk data from AzureML data asset, and create ACS embeddings index

Version: 0.0.83

llm_model config

Name	Description	Type	Default	Optional
llm_config	JSON describing the LLM provider and model details to use for prompt generation.	string	{"type": "azure_open_ai", "model_name": "gpt-35-turbo", "deployment_name": "gpt-35-turbo", "temperature": 0, "max_tokens": 2000}
llm_connection	Azure OpenAI workspace connection ARM ID	string		True
acs_config	JSON describing the acs index to create or update.	string
acs_connection	Azure Cognitive Search workspace connection ARM ID	string		True

register settings

Name	Description	Type	Default	Optional	Enum
embeddings_dataset_name	Name of the vector index	string	EmbeddingsOutput	True

compute settings

Name	Description	Type	Default	Optional	Enum
serverless_instance_count	Instance count to use for the serverless compute	integer	1	True
serverless_instance_type	The Instance Type to be used for the serverless compute	string	Standard_E8s_v3	True

data to import

Name	Description	Type	Default	Optional	Enum
input_data	Input AzureML data asset UriFolder to bring in data from.	uri_folder

Data Chunker

Name	Description	Type	Default	Optional
chunk_size	Chunk size (by token) to pass into the text splitter before performing embeddings	integer	1024
chunk_overlap	Overlap of content (by token) between the chunks	integer	0
input_glob	Glob pattern to filter files from the input folder. e.g. 'articles/*/''	string		True
max_sample_files	Number of files read in during QA test data generation	integer	-1	True
data_source_url	The url which can be appended to file names to form citation links for documents	string
document_path_replacement_regex	A JSON string with two fields, 'match_pattern' and 'replacement_pattern' to be used with re.sub on the source url. e.g. '{"match_pattern": "(.)/articles/(.)(\.[^.]+)$", "replacement_pattern": "\1/\2"}' would remove '/articles' from the middle of the url.	string		True

Embeddings components

Name	Description	Type	Default	Optional
embeddings_container	Folder to contain generated embeddings. Should be parent folder of the 'embeddings' output path used for for this component. Will compare input data to existing embeddings and only embed changed/new data, reusing existing chunks.	uri_folder		True
embeddings_model	The model to use to embed data. E.g. 'hugging_face://model/sentence-transformers/all-mpnet-base-v2' or 'azure_open_ai://deployment/{deployment_name}/model/{model_name}'	string	azure_open_ai://deployment/text-embedding-ada-002/model/text-embedding-ada-002
embedding_connection	Azure OpenAI workspace connection ARM ID for embeddings	string		True

Name	Description	Type
acs_index	Folder containing the ACS MLIndex. Deserialized using azureml.rag.mlindex.MLIndex(uri).	uri_folder

defaults: compute: azureml:cpu-cluster

Name	Description	Type