diff --git a/TOOLS.md b/TOOLS.md index 1ce857bf..e866239d 100644 --- a/TOOLS.md +++ b/TOOLS.md @@ -53,7 +53,7 @@ including essential context and base instructions for working with it ### Search Tools - [find_component_id](#find_component_id): Returns list of component IDs that match the given query. -- [search](#search): Searches for Keboola items (tables, buckets, configurations, transformations, flows, etc. +- [search](#search): Searches for Keboola items (tables, buckets, components, configurations, transformations, flows, data-apps, etc. ### Storage Tools - [get_buckets](#get_buckets): Lists buckets or retrieves full details of specific buckets, including descriptions, @@ -2619,57 +2619,123 @@ USAGE EXAMPLES: **Description**: -Searches for Keboola items (tables, buckets, configurations, transformations, flows, etc.) in the current project -by matching patterns against item ID, name, display name, or description. Returns matching items grouped by type -with their IDs and metadata. +Searches for Keboola items (tables, buckets, components, configurations, transformations, flows, data-apps, etc.) +in the current project and returns matching ID + metadata. + +This tool supports two complementary search types: + +1) textual +- Searches item metadata fields by matching patterns against id, name, displayName, and description. +- For tables, also searches column names and column descriptions. + +2) config-based +- Searches item configurations (JSON objects) by matching patterns against the configuration values ​​converted +to a string, optionally narrowed by JSON path `scopes`. +- Returns also `match_scopes` with JSON paths and matched patterns per scope. THIS IS THE PRIMARY DISCOVERY TOOL. Always use it BEFORE any get_* tool when you need to find items -by name. Do NOT enumerate items with get_buckets, get_tables, get_configs, get_flows, or get_data_apps -just to locate a specific item — use this tool instead. +by name or specific configuration content. Do NOT enumerate items with get_buckets, get_tables, get_configs, +get_flows, or get_data_apps just to locate a specific item — use this tool instead. WHEN TO USE: -- User asks to "find", "locate", or "search for" something by name, keyword, or text pattern +- User asks to "find", "locate", or "search for" something by name, keyword, text pattern, configuration content or +value - User mentions a partial name and you need to find the full item (e.g., "find the customer table") - User asks "what tables/configs/flows do I have with X in the name?" - You need to discover items before performing operations on them -- User asks to "list all items with [name] in it" +- User asks to "list all items with [name] or [configuration value/part] in it" +- User asks where a value, table, component, specific configuration ID, or specific settings is used in components, +data-apps, flows, or transformations +- You need to trace lineage by searching for IDs referenced in configurations, or to find flows using a + specific component, or find usage of a bucket/table in transformations or components, or to find items with + specific parameters. +- User asks to "what is the genesis of this item?" or "explain me business logic of this item?" HOW IT WORKS: -- Searches by regex pattern matching against id, name, displayName, and description fields -- For tables, also searches column names and column descriptions -- Case-insensitive search +- Supports two types: + - search_type="textual": matches against id, name, displayName, and description, for tables also column names + and column descriptions + - search_type="config-based": matches inside configuration JSON objects, optionally narrowed by JSON path `scopes` +- case-insensitive search +- mode for pattern search: `literal` (default) or `regex` - Multiple patterns work as OR condition - matches items containing ANY of the patterns -- Returns grouped results by item type (tables, buckets, configurations, flows, etc.) - Each result includes the item's ID, name, creation date, and relevant metadata +- scopes (config-based) narrow matching to specific JSONPath areas within configurations; matching is performed +against the stringified JSON node content in those areas. +- config-based always returns all matched paths per item in `match_scopes` (including matched patterns) IMPORTANT: - Always use this tool when the user mentions a name but you don't have the exact ID - The search returns IDs that you can use with other tools (e.g., get_tables, get_configs, get_flows) - Results are ordered by update time. The most recently updated items are returned first. +- Fill `item_types` to make the search more efficient when you know the item type; scanning buckets and tables can +be expensive - For exact ID lookups, use specific tools like get_tables, get_configs, get_flows instead +- Use specific `scopes` only when you know the config structure (schema or real example); otherwise run config-based +search without scopes. - Use find_component_id and get_configs tools to find configurations related to a specific component - If results are too numerous or empty, ask the user to refine their query rather than enumerating all items. USAGE EXAMPLES: +1) textual search examples: - user_input: "Find all tables with 'customer' in the name" - → patterns=["customer"], item_types=["table"] - → Returns all tables whose id, name, displayName, or description contains "customer" + → patterns=["customer"], item_types=["table"] + → Returns all tables whose id, name, displayName, or description contains "customer" - user_input: "Find tables with 'email' column" - → patterns=["email"], item_types=["table"] - → Returns all tables that have a column named "email" or with "email" in column description + → patterns=["email"], item_types=["table"] + → Returns all tables that have a column named "email" or with "email" in column description - user_input: "Search for the sales transformation" - → patterns=["sales"], item_types=["transformation"] - → Returns transformations with "sales" in any searchable field + → patterns=["sales"], item_types=["transformation"] + → Returns transformations with "sales" in any searchable field - user_input: "Find items named 'daily report' or 'weekly summary'" - → patterns=["daily.*report", "weekly.*summary"], item_types=[] - → Returns all items matching any of these patterns + → patterns=["daily.*report", "weekly.*summary"], item_types=[], mode="regex" + → Returns all items matching any of these patterns - user_input: "Show me all configurations related to Google Analytics" - → patterns=["google.*analytics"], item_types=["configuration"] - → Returns configurations with matching patterns + → patterns=["google.*analytics"], item_types=["configuration"], mode="regex" + → Returns configurations with matching patterns + +2) config-based search examples: +- user_input: "Find transformations/configs/components referencing table in.c-prod.customers" + -> patterns=["in.c-prod.customers"], item_types=["transformation", "configuration"], + search_type="config-based" + -> No scopes = search whole stringified config; result includes `match_scopes` with exact paths + patterns + +- user_input: "Find configurations/transformations (etc.) using specific setting / id anywhere" + -> patterns=["setting", "id"], item_types=["configuration", "transformations"], search_type="config-based", + +- user_input: "Find configurations/transformations (etc.) using specific setting / id in parameters" +-> patterns=["setting", "id"], item_types=["configuration", "transformations"], search_type="config-based", +scopes=["parameters"] + +- user_input: "Find configurations/transformations (etc.) using specific setting / id in storage" +-> patterns=["setting", "id"], item_types=["configuration", "transformations"], search_type="config-based", +scopes=["storage"] + +- user_input: "Find configurations/transformations (etc.) using specific setting / id in authorization" + -> patterns=["setting", "id"], item_types=["configuration", "transformations"], search_type="config-based", + scopes=["parameters.authorization", "authorization"] + +- user_input: "Find components/transformations using my_bucket in input or output mappings" + -> patterns=["my_bucket"], item_types=["configuration", "transformation"], search_type="config-based", + scopes=["storage.input", "storage.output"] + -> Returns matches with paths like `storage.input.tables[0].source`, `storage.input.files[0].source`, + or `storage.output.tables[0].destination` + +- user_input: "Find flows using configuration ID 01k9cz233cvd1rga3zzx40g8qj" + -> patterns=["01k9cz233cvd1rga3zzx40g8qj"], item_types=["flow"], search_type="config-based", + scopes=["tasks", "phases"] + +- user_input: "Find transformations using this table / column / specific code in its script" + -> patterns=["element"], item_types=["transformation"], search_type="config-based", + scopes=["parameters", "storage"] + +- user_input: "Find data apps using something in its config / python code / setting" + -> patterns=["something"], item_types=["data-app"], search_type="config-based" + -> Returns data apps where script/config sections contain the keyword and includes `match_scopes` **Input JSON Schema**: @@ -2677,7 +2743,7 @@ USAGE EXAMPLES: { "properties": { "patterns": { - "description": "One or more search patterns to match against item ID, name, display name, or description. Supports regex patterns. Case-insensitive. Examples: [\"customer\"], [\"sales\", \"revenue\"], [\"test.*table\"]. Do not use empty strings or empty lists.", + "description": "One or more search patterns to match against item ID, name, display name, description, or configuration JSON objects. Case-insensitive by default. Examples: [\"customer\"], [\"sales\", \"revenue\"], [\"my_bucket\"]. Do not use empty strings or empty lists.", "items": { "type": "string" }, @@ -2685,13 +2751,15 @@ USAGE EXAMPLES: }, "item_types": { "default": [], - "description": "Optional filter for specific Keboola item types. Leave empty to search all types. Common values: \"table\" (data tables), \"bucket\" (table containers), \"transformation\" (SQL/Python transformations), \"configuration\" (extractor/writer configs), \"flow\" (orchestration flows). Use when you know what type of item you're looking for.", + "description": "Filter for specific Keboola item types. Common values: \"table\" (data tables), \"bucket\" (table containers), \"transformation\" (SQL/Python transformations), \"component\" (extractor/writer/application components), \"data-app\" (data apps), \"flow\" (orchestration flows). Use when you know what type of item you're looking for or leave empty to search all types.", "items": { "enum": [ - "flow", "bucket", "table", + "data-app", + "flow", "transformation", + "component", "configuration", "configuration-row", "workspace", @@ -2703,6 +2771,32 @@ USAGE EXAMPLES: }, "type": "array" }, + "search_type": { + "default": "textual", + "description": "Search mode: \"textual\" (name/id/description) or \"config-based\" (stringified configuration payloads). (default: \"textual\")", + "enum": [ + "textual", + "config-based" + ], + "type": "string" + }, + "scopes": { + "default": [], + "description": "JSONPath expressions to narrow config-based search to specific parts of the configuration. Simple dot-notation (e.g. \"parameters\", \"storage.input\") and full JSONPath (e.g. \"$.tasks[*]\") are both supported (e.g. \"parameters.host\", \"storage.input[0].source\"). Leave empty to search the whole configuration.", + "items": { + "type": "string" + }, + "type": "array" + }, + "mode": { + "default": "literal", + "description": "How to interpret patterns: \"regex\" for regular expressions or \"literal\" for exact text (default: \"literal\").", + "enum": [ + "regex", + "literal" + ], + "type": "string" + }, "limit": { "default": 50, "description": "Maximum number of items to return (default: 50, max: 100).", diff --git a/integtests/tools/test_search.py b/integtests/tools/test_search.py index a175e8f2..6a283673 100644 --- a/integtests/tools/test_search.py +++ b/integtests/tools/test_search.py @@ -103,3 +103,32 @@ async def test_find_component_id(mcp_client: Client): assert full_result.content[0].type == 'text' decoded_toon = toon_format.decode(full_result.content[0].text) assert decoded_toon == result + + +@pytest.mark.asyncio +async def test_search_config_based_simple_query( + mcp_client: Client, + configs: list[ConfigDef], +) -> None: + """ + Test config-based search with a simple scoped query. + """ + config = next(cfg for cfg in configs if cfg.component_id == 'ex-generic-v2') + full_result = await mcp_client.call_tool( + 'search', + { + 'patterns': ['wttr.in'], + 'item_types': ['configuration'], + 'search_type': 'config-based', + 'scopes': ['parameters.api.baseUrl'], + 'limit': 20, + 'offset': 0, + }, + ) + + assert full_result.structured_content is not None + result = [SearchHit.model_validate(hit) for hit in full_result.structured_content['result']] + + assert any( + hit.component_id == 'ex-generic-v2' and hit.configuration_id == config.configuration_id for hit in result + ), f'Expected config {config.configuration_id} to be returned. Found: {result}' diff --git a/pyproject.toml b/pyproject.toml index aaca9978..01ce055b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "hatchling.build" [project] name = "keboola-mcp-server" -version = "1.44.8" +version = "1.45.0" description = "MCP server for interacting with Keboola Connection" readme = "README.md" requires-python = ">=3.10" diff --git a/src/keboola_mcp_server/resources/prompts/project_system_prompt.md b/src/keboola_mcp_server/resources/prompts/project_system_prompt.md index 355f2cdd..e4f394db 100644 --- a/src/keboola_mcp_server/resources/prompts/project_system_prompt.md +++ b/src/keboola_mcp_server/resources/prompts/project_system_prompt.md @@ -1,13 +1,18 @@ -### Finding Items by Name +### Finding Items When looking for specific items (tables, buckets, configurations, flows, data apps) by name, description, -or partial match, **always use the `search` tool first** rather than listing all items with `get_*` tools. +partial match, or configuration content/reference, **always use the `search` tool first** rather than listing all +items with `get_*` tools. -- `search` matches by regex against names, IDs, descriptions, and (for tables) column names. +- `search` supports: + - textual search over names, IDs, descriptions, and (for tables) column names + - config-based search over item configuration JSON contents, including scoped JSONPath search when useful - Listing all items with empty IDs (e.g., `get_buckets(bucket_ids=[])`, `get_configs()`, `get_flows(flow_ids=[])`) is wasteful on large projects and should only be used when you genuinely need a complete inventory. - If the user mentions a name but you do not have the exact ID, call `search` with an appropriate pattern and `item_types` filter. +- If the user asks where a table/component/config ID/value is used, call `search` with + `search_type="config-based"` (and use `scopes` when you know the config structure). - If `search` returns too many results or zero results, ask the user to be more specific rather than falling back to enumerating all items. diff --git a/src/keboola_mcp_server/tools/search.py b/src/keboola_mcp_server/tools/search.py index 3085f233..ab7af0b1 100644 --- a/src/keboola_mcp_server/tools/search.py +++ b/src/keboola_mcp_server/tools/search.py @@ -2,26 +2,29 @@ import json import logging import re -from typing import Annotated, Any, AsyncGenerator, Literal, Mapping, Sequence +from collections import defaultdict +from typing import Annotated, Any, AsyncGenerator, Iterable, Literal, Mapping, Sequence +import jsonpath_ng from fastmcp import Context, FastMCP from fastmcp.tools import FunctionTool +from jsonpath_ng.jsonpath import JSONPath from mcp.types import ToolAnnotations from pydantic import BaseModel, Field, PrivateAttr, model_validator from keboola_mcp_server.clients.base import JsonDict from keboola_mcp_server.clients.client import ( CONDITIONAL_FLOW_COMPONENT_ID, + DATA_APP_COMPONENT_ID, ORCHESTRATOR_COMPONENT_ID, KeboolaClient, get_metadata_property, ) -from keboola_mcp_server.clients.storage import ItemType from keboola_mcp_server.config import MetadataField from keboola_mcp_server.errors import tool_errors from keboola_mcp_server.links import Link, ProjectLinksManager from keboola_mcp_server.mcp import toon_serializer_compact -from keboola_mcp_server.tools.components.utils import get_nested +from keboola_mcp_server.tools.components.utils import _normalize_jsonpath, get_nested LOG = logging.getLogger(__name__) @@ -36,38 +39,33 @@ 'data-app', 'flow', 'transformation', + 'component', 'configuration', 'configuration-row', - 'component', 'workspace', 'shared-code', 'rows', 'state', ] + SearchComponentItemType = Literal[ 'flow', 'transformation', + 'component', 'configuration', 'configuration-row', 'workspace', ] -ITEM_TYPE_TO_COMPONENT_TYPES: Mapping[ItemType, Sequence[str]] = { - 'flow': ['other'], - 'transformation': ['transformation'], - 'configuration': ['extractor', 'writer'], - 'configuration-row': ['extractor', 'writer'], - 'workspace': ['other'], -} SEARCH_ITEM_TYPE_TO_COMPONENT_TYPES: Mapping[SearchItemType, Sequence[str]] = { 'data-app': ['other'], 'flow': ['other'], 'transformation': ['transformation'], + 'configuration': ['extractor', 'writer', 'application'], + 'configuration-row': ['extractor', 'writer', 'application'], 'component': ['extractor', 'writer', 'application'], - 'configuration': ['extractor', 'writer'], - 'configuration-row': ['extractor', 'writer'], 'workspace': ['other'], } @@ -113,14 +111,22 @@ class SearchHit(BaseModel): configuration_id: str | None = Field(default=None, description='The ID of the configuration.') configuration_row_id: str | None = Field(default=None, description='The ID of the configuration row.') - item_type: ItemType = Field(description='The type of the item (e.g. table, bucket, configuration, etc.).') + item_type: SearchItemType = Field(description='The type of the item (e.g. table, bucket, configuration, etc.).') updated: str = Field(description='The date and time the item was created in ISO 8601 format.') name: str | None = Field(default=None, description='Name of the item.') display_name: str | None = Field(default=None, description='Display name of the item.') description: str | None = Field(default=None, description='Description of the item.') + matches: list[PatternMatch] = Field( + default_factory=list, + description='Most specific JSONPath scopes with grouped matched patterns (config-based search only).', + ) links: list[Link] = Field(default_factory=list, description='Links to the item.') - _matches: list[PatternMatch] = PrivateAttr(default_factory=list) + + def __eq__(self, other: object) -> bool: + if isinstance(other, SearchHit): + return self.model_dump() == other.model_dump() + return False @model_validator(mode='after') def check_id_fields(self) -> 'SearchHit': @@ -145,9 +151,26 @@ def check_id_fields(self) -> 'SearchHit': return self - def with_matches(self, matches: list['PatternMatch']) -> 'SearchHit': + def set_matches(self, matches: list['PatternMatch']) -> 'SearchHit': """Assign pattern matches to this search hit and return self for chaining.""" - self._matches = matches + patterns_by_scope: dict[str, set[str]] = defaultdict(set) + for match in matches: + if not match.scope: + continue + patterns_by_scope[match.scope].update(match.patterns) + + unique_scopes = list(patterns_by_scope) + most_specific_scopes = [ + scope + for scope in unique_scopes + if not any( + other.startswith(scope) and len(other) > len(scope) and other[len(scope)] in ('.', '[') + for other in unique_scopes + ) + ] + self.matches = [ + PatternMatch(scope=scope, patterns=list(patterns_by_scope[scope])) for scope in most_specific_scopes + ] return self @@ -163,6 +186,9 @@ class SearchSpec(BaseModel): _component_types: Sequence[str] = PrivateAttr(default_factory=tuple) _compiled_patterns: list[re.Pattern] = PrivateAttr(default_factory=list) _clean_patterns: list[str] = PrivateAttr(default_factory=list) + _all_nodes_expr: JSONPath | None = PrivateAttr(default=None) + # Tuple fields: (original_scope, parsed_scope_expr, parsed_descendants_expr) + _scope_exprs: list[tuple[str, JSONPath, JSONPath]] = PrivateAttr(default_factory=list) @model_validator(mode='after') def _compile_patterns(self) -> 'SearchSpec': @@ -192,8 +218,27 @@ def _validate_component_args(self) -> 'SearchSpec': ) return self + @model_validator(mode='after') + def _validate_item_types(self) -> 'SearchSpec': + if 'component' in self.item_types: + self.item_types = list({*self.item_types, 'configuration', 'configuration-row'}) + return self + + @model_validator(mode='after') + def _compile_jsonpath_exprs(self) -> 'SearchSpec': + # Compile commonly used expressions once per SearchSpec instance. + self._all_nodes_expr = jsonpath_ng.parse('$..*') + self._scope_exprs = [] + for scope in self.search_scopes: + normalized = _normalize_jsonpath(scope if scope.startswith('$') else f'$.{scope}') + try: + self._scope_exprs.append((scope, jsonpath_ng.parse(normalized), jsonpath_ng.parse(f'{normalized}..*'))) + except Exception as e: + LOG.warning(f'Invalid JSONPath scope "{scope}": {e}') + return self + @staticmethod - def _stringify(value: JsonDict) -> str: + def _stringify(value: Any) -> str: try: return json.dumps(value, sort_keys=True, default=str, ensure_ascii=False) except (TypeError, ValueError): @@ -221,27 +266,70 @@ def match_patterns(self, value: str | JsonDict | None) -> list[str]: return matches + def _find_matches_for_expr( + self, configuration: JsonDict, parsed_expr: JSONPath, scalar_only: bool = False + ) -> list[PatternMatch]: + """Find pattern matches on JSON nodes matched by a JSONPath expression. If scalar_only is True, only scalar + nodes are matched.""" + matches: list[PatternMatch] = [] + for jpath_match in parsed_expr.find(configuration): + value = jpath_match.value + if scalar_only and isinstance(value, (dict, list)): + continue + if matched := self.match_patterns(value): + matches.append( + PatternMatch( + scope=re.sub(r'\.\[', '[', str(jpath_match.full_path)), + patterns=matched, + ) + ) + if not self.return_all_matched_patterns: + return matches + return matches + def match_configuration_scopes(self, configuration: JsonDict | None) -> list[PatternMatch]: """ - Checks configuration fields within specified scopes for pattern matches. + Checks configuration fields within specified JSONPath scopes for pattern matches. + Walks matching nodes within each scope and returns the exact path where the match + was found. When no scopes are specified, walks the entire configuration. :param configuration: The configuration to match against the patterns. - :return: A tuple of scopes and patterns that matched the configuration; empty patterns if no matches. + :return: List of PatternMatch with matching JSONPath scopes; empty list if no matches. """ + if configuration is None: + return [] + if self.search_scopes: - matches: list[PatternMatch] = [] - for scope in self.search_scopes: - if matched := self.match_patterns(get_nested(configuration, scope, default=None)): - matches.append(PatternMatch(scope=scope, patterns=matched)) + all_matches: list[PatternMatch] = [] + # Deduplicate hits when scopes overlap (e.g. "parameters" + "parameters.query") + # or the same logical scope is provided multiple times. + seen: set[str | None] = set() + for _scope, self_expr, desc_expr in self._scope_exprs: + # Search in self expression node for scalar matches first + self_matches = self._find_matches_for_expr(configuration, self_expr, scalar_only=True) + # If no scalar matches, search in descendants nodes + desc_matches: list[PatternMatch] = [] + if not self_matches: + desc_matches = self._find_matches_for_expr(configuration, desc_expr) + for match in self_matches or desc_matches: + if match.scope in seen: + continue + seen.add(match.scope) + all_matches.append(match) if not self.return_all_matched_patterns: - break - return matches + return all_matches + return all_matches + else: + # No scope provided – search all descendants and return exact match paths. + return self._find_matches_for_expr(configuration, self._all_nodes_expr) - if matched := self.match_patterns(configuration): - return [PatternMatch(scope=None, patterns=matched)] - return [] + def match_texts(self, texts: Iterable[str]) -> list[PatternMatch]: + """ + Matches a sequence of strings against the patterns. - def match_texts(self, texts: Sequence[str]) -> list[PatternMatch]: + :param texts: The sequence of strings to match against the patterns. + :return: A list of PatternMatch objects. + """ matches: list[PatternMatch] = [] for text in texts: if matched := self.match_patterns(text): @@ -258,21 +346,18 @@ def _get_field_value(item: JsonDict, fields: Sequence[str]) -> Any | None: return None -def _check_column_match(table: JsonDict, spec: SearchSpec) -> bool: +def _check_column_match(table: JsonDict, cfg: SearchSpec) -> list[PatternMatch]: """Check if any column name or description matches the patterns.""" # Check column names (list of strings) - for col_name in table.get('columns', []): - if spec.match_patterns(col_name): - return True + if col_names := table.get('columns', []): + if matched := cfg.match_texts(col_names): + return matched - # Check column descriptions (from columnMetadata) - column_metadata = table.get('columnMetadata', {}) - for col_meta in column_metadata.values(): - col_description = get_metadata_property(col_meta, MetadataField.DESCRIPTION) - if spec.match_patterns(col_description): - return True - - return False + if col_metadata := table.get('columnMetadata', {}): + col_descs = (get_metadata_property(col_meta, MetadataField.DESCRIPTION) for col_meta in col_metadata.values()) + if matched := cfg.match_texts(filter(None, col_descs)): + return matched + return [] async def _fetch_buckets(client: KeboolaClient, spec: SearchSpec) -> list[SearchHit]: @@ -286,12 +371,7 @@ async def _fetch_buckets(client: KeboolaClient, spec: SearchSpec) -> list[Search bucket_display_name = bucket.get('displayName') bucket_description = get_metadata_property(bucket.get('metadata', []), MetadataField.DESCRIPTION) - if ( - spec.match_patterns(bucket_id) - or spec.match_patterns(bucket_name) - or spec.match_patterns(bucket_display_name) - or spec.match_patterns(bucket_description) - ): + if matches := spec.match_texts([bucket_id, bucket_name, bucket_display_name, bucket_description]): hits.append( SearchHit( bucket_id=bucket_id, @@ -300,7 +380,7 @@ async def _fetch_buckets(client: KeboolaClient, spec: SearchSpec) -> list[Search name=bucket_name, display_name=bucket_display_name, description=bucket_description, - ) + ).set_matches(matches) ) return hits @@ -321,13 +401,9 @@ async def _fetch_tables(client: KeboolaClient, spec: SearchSpec) -> list[SearchH table_display_name = table.get('displayName') table_description = get_metadata_property(table.get('metadata', []), MetadataField.DESCRIPTION) - if ( - spec.match_patterns(table_id) - or spec.match_patterns(table_name) - or spec.match_patterns(table_display_name) - or spec.match_patterns(table_description) - or _check_column_match(table, spec) - ): + matches = spec.match_texts([table_id, table_name, table_display_name, table_description]) + matches.extend(_check_column_match(table, spec)) + if matches: hits.append( SearchHit( table_id=table_id, @@ -336,7 +412,7 @@ async def _fetch_tables(client: KeboolaClient, spec: SearchSpec) -> list[SearchH name=table_name, display_name=table_display_name, description=table_description, - ) + ).set_matches(matches) ) return hits @@ -361,19 +437,42 @@ async def _fetch_configs( client: KeboolaClient, spec: SearchSpec, component_type: str | None = None ) -> AsyncGenerator[SearchHit, None]: components = await client.storage_client.component_list(component_type, include=['configuration', 'rows']) + + allowed_transformations = 'transformation' in spec.item_types or component_type is None + allowed_components = ( + 'configuration' in spec.item_types or 'configuration-row' in spec.item_types or component_type is None + ) + allowed_flows = 'flow' in spec.item_types or component_type is None + allowed_workspaces = 'workspace' in spec.item_types or component_type is None + allowed_data_apps = 'data-app' in spec.item_types or component_type is None + for component in components: if not (component_id := component.get('id')): continue current_component_type = component.get('type') if component_id in [ORCHESTRATOR_COMPONENT_ID, CONDITIONAL_FLOW_COMPONENT_ID]: - item_type = 'flow' + item_type: SearchItemType = 'flow' + if not allowed_flows: + continue elif current_component_type == 'transformation': - item_type = 'transformation' + item_type: SearchItemType = 'transformation' + if not allowed_transformations: + continue elif component_id == 'keboola.sandboxes': - item_type = 'workspace' + item_type: SearchItemType = 'workspace' + if not allowed_workspaces: + continue + elif component_id == DATA_APP_COMPONENT_ID: + item_type: SearchItemType = 'data-app' + if not allowed_data_apps: + continue + elif current_component_type in ['extractor', 'writer', 'application']: + item_type: SearchItemType = 'configuration' + if not allowed_components: + continue else: - item_type = 'configuration' + item_type: SearchItemType = 'configuration' for config in component.get('configurations', []): if not (config_id := config.get('id')): @@ -384,11 +483,7 @@ async def _fetch_configs( config_updated = _get_field_value(config, ['currentVersion.created', 'created']) or '' if spec.search_type == 'textual': - if ( - spec.match_patterns(config_id) - or spec.match_patterns(config_name) - or spec.match_patterns(config_description) - ): + if matches := spec.match_texts([config_id, config_name, config_description]): yield SearchHit( component_id=component_id, configuration_id=config_id, @@ -396,7 +491,7 @@ async def _fetch_configs( updated=config_updated, name=config_name, description=config_description, - ) + ).set_matches(matches) elif spec.search_type == 'config-based': if matches := spec.match_configuration_scopes(config.get('configuration')): yield SearchHit( @@ -406,7 +501,7 @@ async def _fetch_configs( updated=config_updated, name=config_name, description=config_description, - ).with_matches(matches) + ).set_matches(matches) for row in config.get('rows', []): if not (row_id := row.get('id')): @@ -416,11 +511,7 @@ async def _fetch_configs( row_description = row.get('description') if spec.search_type == 'textual': - if ( - spec.match_patterns(row_id) - or spec.match_patterns(row_name) - or spec.match_patterns(row_description) - ): + if matches := spec.match_texts([row_id, row_name, row_description]): yield SearchHit( component_id=component_id, configuration_id=config_id, @@ -429,7 +520,7 @@ async def _fetch_configs( updated=config_updated or _get_field_value(row, ['created']), name=row_name, description=row_description, - ) + ).set_matches(matches) elif spec.search_type == 'config-based': if matches := spec.match_configuration_scopes(row.get('configuration')): @@ -441,7 +532,7 @@ async def _fetch_configs( updated=config_updated or _get_field_value(row, ['created']), name=row_name, description=row_description, - ).with_matches(matches) + ).set_matches(matches) @tool_errors() @@ -450,20 +541,45 @@ async def search( patterns: Annotated[ list[str], Field( - description='One or more search patterns to match against item ID, name, display name, or description. ' - 'Supports regex patterns. Case-insensitive. Examples: ["customer"], ["sales", "revenue"], ' - '["test.*table"]. Do not use empty strings or empty lists.' + description='One or more search patterns to match against item ID, name, display name, description, ' + 'or configuration JSON objects. Case-insensitive by default. ' + 'Examples: ["customer"], ["sales", "revenue"], ["my_bucket"]. ' + 'Do not use empty strings or empty lists.' ), ], item_types: Annotated[ - Sequence[ItemType], + Sequence[SearchItemType], Field( - description='Optional filter for specific Keboola item types. Leave empty to search all types. ' + description='Filter for specific Keboola item types. ' 'Common values: "table" (data tables), "bucket" (table containers), "transformation" ' - '(SQL/Python transformations), "configuration" (extractor/writer configs), "flow" (orchestration flows). ' - "Use when you know what type of item you're looking for." + '(SQL/Python transformations), "component" (extractor/writer/application components), ' + '"data-app" (data apps), "flow" (orchestration flows). ' + "Use when you know what type of item you're looking for or leave empty to search all types." + ), + ] = tuple(), + search_type: Annotated[ + SearchType, + Field( + description='Search mode: "textual" (name/id/description) or "config-based" (stringified configuration ' + 'payloads). (default: "textual")' + ), + ] = 'textual', + scopes: Annotated[ + Sequence[str], + Field( + description='JSONPath expressions to narrow config-based search to specific parts of the configuration. ' + 'Simple dot-notation (e.g. "parameters", "storage.input") and full JSONPath (e.g. "$.tasks[*]") are both ' + 'supported (e.g. "parameters.host", "storage.input[0].source"). ' + 'Leave empty to search the whole configuration.' ), ] = tuple(), + mode: Annotated[ + SearchPatternMode, + Field( + description='How to interpret patterns: "regex" for regular expressions or "literal" for exact text ' + '(default: "literal").' + ), + ] = 'literal', limit: Annotated[ int, Field( @@ -474,63 +590,132 @@ async def search( offset: Annotated[int, Field(description='Number of matching items to skip for pagination (default: 0).')] = 0, ) -> list[SearchHit]: """ - Searches for Keboola items (tables, buckets, configurations, transformations, flows, etc.) in the current project - by matching patterns against item ID, name, display name, or description. Returns matching items grouped by type - with their IDs and metadata. + Searches for Keboola items (tables, buckets, components, configurations, transformations, flows, data-apps, etc.) + in the current project and returns matching ID + metadata. + + This tool supports two complementary search types: + + 1) textual + - Searches item metadata fields by matching patterns against id, name, displayName, and description. + - For tables, also searches column names and column descriptions. + + 2) config-based + - Searches item configurations (JSON objects) by matching patterns against the configuration values ​​converted + to a string, optionally narrowed by JSON path `scopes`. + - Returns also `match_scopes` with JSON paths and matched patterns per scope. THIS IS THE PRIMARY DISCOVERY TOOL. Always use it BEFORE any get_* tool when you need to find items - by name. Do NOT enumerate items with get_buckets, get_tables, get_configs, get_flows, or get_data_apps - just to locate a specific item — use this tool instead. + by name or specific configuration content. Do NOT enumerate items with get_buckets, get_tables, get_configs, + get_flows, or get_data_apps just to locate a specific item — use this tool instead. WHEN TO USE: - - User asks to "find", "locate", or "search for" something by name, keyword, or text pattern + - User asks to "find", "locate", or "search for" something by name, keyword, text pattern, configuration content or + value - User mentions a partial name and you need to find the full item (e.g., "find the customer table") - User asks "what tables/configs/flows do I have with X in the name?" - You need to discover items before performing operations on them - - User asks to "list all items with [name] in it" + - User asks to "list all items with [name] or [configuration value/part] in it" + - User asks where a value, table, component, specific configuration ID, or specific settings is used in components, + data-apps, flows, or transformations + - You need to trace lineage by searching for IDs referenced in configurations, or to find flows using a + specific component, or find usage of a bucket/table in transformations or components, or to find items with + specific parameters. + - User asks to "what is the genesis of this item?" or "explain me business logic of this item?" HOW IT WORKS: - - Searches by regex pattern matching against id, name, displayName, and description fields - - For tables, also searches column names and column descriptions - - Case-insensitive search + - Supports two types: + - search_type="textual": matches against id, name, displayName, and description, for tables also column names + and column descriptions + - search_type="config-based": matches inside configuration JSON objects, optionally narrowed by JSON path `scopes` + - case-insensitive search + - mode for pattern search: `literal` (default) or `regex` - Multiple patterns work as OR condition - matches items containing ANY of the patterns - - Returns grouped results by item type (tables, buckets, configurations, flows, etc.) - Each result includes the item's ID, name, creation date, and relevant metadata + - scopes (config-based) narrow matching to specific JSONPath areas within configurations; matching is performed + against the stringified JSON node content in those areas. + - config-based always returns all matched paths per item in `match_scopes` (including matched patterns) IMPORTANT: - Always use this tool when the user mentions a name but you don't have the exact ID - The search returns IDs that you can use with other tools (e.g., get_tables, get_configs, get_flows) - Results are ordered by update time. The most recently updated items are returned first. + - Fill `item_types` to make the search more efficient when you know the item type; scanning buckets and tables can + be expensive - For exact ID lookups, use specific tools like get_tables, get_configs, get_flows instead + - Use specific `scopes` only when you know the config structure (schema or real example); otherwise run config-based + search without scopes. - Use find_component_id and get_configs tools to find configurations related to a specific component - If results are too numerous or empty, ask the user to refine their query rather than enumerating all items. USAGE EXAMPLES: + 1) textual search examples: - user_input: "Find all tables with 'customer' in the name" - → patterns=["customer"], item_types=["table"] - → Returns all tables whose id, name, displayName, or description contains "customer" + → patterns=["customer"], item_types=["table"] + → Returns all tables whose id, name, displayName, or description contains "customer" - user_input: "Find tables with 'email' column" - → patterns=["email"], item_types=["table"] - → Returns all tables that have a column named "email" or with "email" in column description + → patterns=["email"], item_types=["table"] + → Returns all tables that have a column named "email" or with "email" in column description - user_input: "Search for the sales transformation" - → patterns=["sales"], item_types=["transformation"] - → Returns transformations with "sales" in any searchable field + → patterns=["sales"], item_types=["transformation"] + → Returns transformations with "sales" in any searchable field - user_input: "Find items named 'daily report' or 'weekly summary'" - → patterns=["daily.*report", "weekly.*summary"], item_types=[] - → Returns all items matching any of these patterns + → patterns=["daily.*report", "weekly.*summary"], item_types=[], mode="regex" + → Returns all items matching any of these patterns - user_input: "Show me all configurations related to Google Analytics" - → patterns=["google.*analytics"], item_types=["configuration"] - → Returns configurations with matching patterns + → patterns=["google.*analytics"], item_types=["configuration"], mode="regex" + → Returns configurations with matching patterns + + 2) config-based search examples: + - user_input: "Find transformations/configs/components referencing table in.c-prod.customers" + -> patterns=["in.c-prod.customers"], item_types=["transformation", "configuration"], + search_type="config-based" + -> No scopes = search whole stringified config; result includes `match_scopes` with exact paths + patterns + + - user_input: "Find configurations/transformations (etc.) using specific setting / id anywhere" + -> patterns=["setting", "id"], item_types=["configuration", "transformations"], search_type="config-based", + + - user_input: "Find configurations/transformations (etc.) using specific setting / id in parameters" + -> patterns=["setting", "id"], item_types=["configuration", "transformations"], search_type="config-based", + scopes=["parameters"] + + - user_input: "Find configurations/transformations (etc.) using specific setting / id in storage" + -> patterns=["setting", "id"], item_types=["configuration", "transformations"], search_type="config-based", + scopes=["storage"] + + - user_input: "Find configurations/transformations (etc.) using specific setting / id in authorization" + -> patterns=["setting", "id"], item_types=["configuration", "transformations"], search_type="config-based", + scopes=["parameters.authorization", "authorization"] + + - user_input: "Find components/transformations using my_bucket in input or output mappings" + -> patterns=["my_bucket"], item_types=["configuration", "transformation"], search_type="config-based", + scopes=["storage.input", "storage.output"] + -> Returns matches with paths like `storage.input.tables[0].source`, `storage.input.files[0].source`, + or `storage.output.tables[0].destination` + + - user_input: "Find flows using configuration ID 01k9cz233cvd1rga3zzx40g8qj" + -> patterns=["01k9cz233cvd1rga3zzx40g8qj"], item_types=["flow"], search_type="config-based", + scopes=["tasks", "phases"] + + - user_input: "Find transformations using this table / column / specific code in its script" + -> patterns=["element"], item_types=["transformation"], search_type="config-based", + scopes=["parameters", "storage"] + + - user_input: "Find data apps using something in its config / python code / setting" + -> patterns=["something"], item_types=["data-app"], search_type="config-based" + -> Returns data apps where script/config sections contain the keyword and includes `match_scopes` """ spec = SearchSpec( patterns=patterns, item_types=item_types, - search_type='textual', + pattern_mode=mode, + search_type=search_type, + search_scopes=scopes, + return_all_matched_patterns=(search_type == 'config-based'), ) offset = max(0, offset) @@ -563,6 +748,7 @@ async def search( 'flow', 'configuration-row', 'workspace', + 'data-app', }: tasks.append(fetch_configurations(client, spec)) @@ -578,10 +764,6 @@ async def search( else: all_hits.extend(result) - # Filter by item_types if specified - if types_to_fetch: - all_hits = [item for item in all_hits if item.item_type in types_to_fetch] - # TODO: Should we sort by the item type too? all_hits.sort( key=lambda x: ( diff --git a/src/keboola_mcp_server/tools/storage/tools.py b/src/keboola_mcp_server/tools/storage/tools.py index 94bd5179..31e4e421 100644 --- a/src/keboola_mcp_server/tools/storage/tools.py +++ b/src/keboola_mcp_server/tools/storage/tools.py @@ -639,7 +639,7 @@ async def _fetch_table_detail(_table_id: str) -> TableDetail | str: usage_by_ids = await find_id_usage( client, list(prod_ids_to_ids.keys()), - ['component', 'transformation'], + ['configuration', 'configuration-row', 'transformation'], ['storage.input', 'storage.output'], ) # Initialize the used_by list for all tables to avoid None values which could confuse the model. diff --git a/src/keboola_mcp_server/tools/storage/usage.py b/src/keboola_mcp_server/tools/storage/usage.py index 83e2e201..01fb399c 100644 --- a/src/keboola_mcp_server/tools/storage/usage.py +++ b/src/keboola_mcp_server/tools/storage/usage.py @@ -67,7 +67,7 @@ async def find_id_usage( # group usage references by pattern = target_id output: dict[str, list[ComponentUsageReference]] = defaultdict(list) for search_hit in search_hits: - for match in search_hit._matches: + for match in search_hit.matches: for target_id in match.patterns: output[target_id].append( # TODO: Consider whether adding configuration description is useful, it could overload context. diff --git a/tests/tools/storage/test_usage.py b/tests/tools/storage/test_usage.py index dbe7c173..fe7d883e 100644 --- a/tests/tools/storage/test_usage.py +++ b/tests/tools/storage/test_usage.py @@ -24,14 +24,14 @@ def _sorted_usage(output: Sequence[storage_usage.UsageById]) -> list[storage_usa item_type='configuration', updated='2024-01-01T00:00:00Z', name='Config 1', - ).with_matches([PatternMatch(scope='storage.input', patterns=['id-1', 'id-2'])]), + ).set_matches([PatternMatch(scope='storage.input', patterns=['id-1', 'id-2'])]), SearchHit( component_id='keboola.ex-db', configuration_id='cfg-2', item_type='configuration', updated='2024-01-02T00:00:00Z', name='Config 2', - ).with_matches([PatternMatch(scope='storage.output', patterns=['id-1'])]), + ).set_matches([PatternMatch(scope='storage.output', patterns=['id-1'])]), ], { 'id-1': [ diff --git a/tests/tools/test_search.py b/tests/tools/test_search.py index 3369aa1c..91d6b52c 100644 --- a/tests/tools/test_search.py +++ b/tests/tools/test_search.py @@ -9,11 +9,11 @@ from keboola_mcp_server.clients.ai_service import ComponentSuggestionResponse, SuggestedComponent from keboola_mcp_server.clients.base import JsonDict from keboola_mcp_server.clients.client import KeboolaClient -from keboola_mcp_server.clients.storage import ItemType from keboola_mcp_server.config import MetadataField from keboola_mcp_server.links import Link from keboola_mcp_server.tools.search import ( SearchHit, + SearchItemType, SearchSpec, SuggestedComponentOutput, find_component_id, @@ -83,7 +83,7 @@ def component_list_side_effect(component_type, include=None): result = await search( ctx=mcp_context_client, patterns=['test'], - item_types=(cast(ItemType, 'table'), cast(ItemType, 'configuration')), + item_types=(cast(SearchItemType, 'table'), cast(SearchItemType, 'configuration')), limit=20, offset=0, ) @@ -144,7 +144,12 @@ async def test_search_with_regex_pattern(self, mocker: MockerFixture, mcp_contex keboola_client.storage_client.component_list = mocker.AsyncMock(return_value=[]) keboola_client.storage_client.workspace_list = mocker.AsyncMock(return_value=[]) - result = await search(ctx=mcp_context_client, patterns=['customer.*'], item_types=(cast(ItemType, 'bucket'),)) + result = await search( + ctx=mcp_context_client, + patterns=['customer.*'], + item_types=(cast(SearchItemType, 'bucket'),), + mode='regex', + ) assert isinstance(result, list) assert result == [ @@ -361,7 +366,7 @@ async def test_search_matches_description(self, mocker: MockerFixture, mcp_conte keboola_client.storage_client.component_list = mocker.AsyncMock(return_value=[]) keboola_client.storage_client.workspace_list = mocker.AsyncMock(return_value=[]) - result = await search(ctx=mcp_context_client, patterns=['test'], item_types=(cast(ItemType, 'bucket'),)) + result = await search(ctx=mcp_context_client, patterns=['test'], item_types=(cast(SearchItemType, 'bucket'),)) assert isinstance(result, list) assert result == [ @@ -680,13 +685,241 @@ async def test_search_table_by_columns( # Mock bucket_table_list with provided test data keboola_client.storage_client.bucket_table_list = mocker.AsyncMock(return_value=tables_data) - result = await search(ctx=mcp_context_client, patterns=[search_pattern], item_types=(cast(ItemType, 'table'),)) + result = await search( + ctx=mcp_context_client, patterns=[search_pattern], item_types=(cast(SearchItemType, 'table'),) + ) assert isinstance(result, list) assert len(result) == expected_count if expected_count > 0: assert result[0].table_id == expected_first_table_id + @pytest.mark.asyncio + @pytest.mark.parametrize( + ( + 'patterns', + 'scopes', + 'component_configurations', + 'expected_hits', + ), + [ + ( + ['alpha', 'beta'], + ('parameters', 'storage.input'), + [ + { + 'id': 'test-config', + 'name': 'Test Config', + 'created': '2024-01-02T00:00:00Z', + 'configuration': { + 'parameters': {'query': 'alpha'}, + 'storage': {'input': [{'source': 'beta'}]}, + }, + 'rows': [], + } + ], + [ + ( + 'test-config', + [ + {'scope': 'parameters.query', 'patterns': ['alpha']}, + {'scope': 'storage.input[0].source', 'patterns': ['beta']}, + ], + ) + ], + ), + ( + ['gamma'], + tuple(), + [ + { + 'id': 'test-config', + 'name': 'Test Config', + 'created': '2024-01-02T00:00:00Z', + 'configuration': { + 'parameters': {'query': 'alpha'}, + 'storage': { + 'input': [{'source': 'beta'}, {'source': 'gamma'}], + 'output': [{'destination': 'gamma'}], + }, + }, + 'rows': [], + } + ], + [ + ( + 'test-config', + [ + {'scope': 'storage.input[1].source', 'patterns': ['gamma']}, + {'scope': 'storage.output[0].destination', 'patterns': ['gamma']}, + ], + ) + ], + ), + ( + ['alpha'], + ('parameters',), + [ + { + 'id': 'test-config', + 'name': 'Test Config', + 'created': '2024-01-02T00:00:00Z', + 'configuration': { + 'parameters': {'query': 'alpha'}, + 'storage': {'input': [{'source': 'alpha'}]}, + }, + 'rows': [], + } + ], + [('test-config', [{'scope': 'parameters.query', 'patterns': ['alpha']}])], + ), + ( + ['alpha'], + ('authorization.#apiKey',), + [ + { + 'id': 'test-config', + 'name': 'Test Config', + 'created': '2024-01-02T00:00:00Z', + 'configuration': { + 'authorization': {'#apiKey': 'alpha'}, + 'parameters': {'query': 'nomatch'}, + }, + 'rows': [], + } + ], + [('test-config', [{'scope': 'authorization.#apiKey', 'patterns': ['alpha']}])], + ), + ( + ['alpha', 'beta'], + ('parameters',), + [ + { + 'id': 'test-config', + 'name': 'Test Config', + 'created': '2024-01-02T00:00:00Z', + 'configuration': { + 'parameters': {'query': 'alpha beta', 'query2': 'beta'}, + }, + 'rows': [], + } + ], + [ + ( + 'test-config', + [ + {'scope': 'parameters.query', 'patterns': ['alpha', 'beta']}, + {'scope': 'parameters.query2', 'patterns': ['beta']}, + ], + ) + ], + ), + ( + ['alpha', 'gamma'], + tuple(), + [ + { + 'id': 'test-config-a', + 'name': 'Test Config A', + 'created': '2024-01-02T00:00:00Z', + 'configuration': { + 'parameters': {'query': 'alpha'}, + 'storage': {'input': [{'source': 'beta'}]}, + }, + 'rows': [], + }, + { + 'id': 'test-config-b', + 'name': 'Test Config B', + 'created': '2024-01-03T00:00:00Z', + 'configuration': { + 'storage': {'output': [{'destination': 'gamma'}]}, + }, + 'rows': [], + }, + { + 'id': 'test-config-c', + 'name': 'Test Config C', + 'created': '2024-01-01T00:00:00Z', + 'configuration': { + 'parameters': {'query': 'nomatch'}, + }, + 'rows': [], + }, + ], + [ + ('test-config-b', [{'scope': 'storage.output[0].destination', 'patterns': ['gamma']}]), + ('test-config-a', [{'scope': 'parameters.query', 'patterns': ['alpha']}]), + ], + ), + ], + ids=[ + 'all_matches_in_scopes', + 'most_specific_scope_only', + 'scope_constrains_same_value_in_other_path', + 'hash_prefixed_scope_key_in_search_tool', + 'group_two_patterns_in_one_scope', + 'multiple_configurations_returned', + ], + ) + async def test_search_config_based_match_scopes( + self, + mocker: MockerFixture, + mcp_context_client: Context, + patterns: list[str], + scopes: tuple[str, ...], + component_configurations: list[dict[str, Any]], + expected_hits: list[tuple[str, list[dict[str, Any]]]], + ): + keboola_client = KeboolaClient.from_state(mcp_context_client.session.state) + + keboola_client.storage_client.bucket_list = mocker.AsyncMock(return_value=[]) + keboola_client.storage_client.bucket_table_list = mocker.AsyncMock(return_value=[]) + keboola_client.storage_client.component_list = mocker.AsyncMock( + side_effect=lambda component_type, include=None: ( + [ + { + 'id': 'keboola.ex-db-mysql', + 'type': 'extractor', + 'configurations': component_configurations, + } + ] + if component_type == 'extractor' + else [] + ) + ) + keboola_client.storage_client.workspace_list = mocker.AsyncMock(return_value=[]) + + result = await search( + ctx=mcp_context_client, + patterns=patterns, + item_types=(cast(SearchItemType, 'configuration'),), + search_type='config-based', + scopes=scopes, + ) + + normalized_actual = [ + ( + hit.configuration_id, + sorted( + ({'scope': m.scope, 'patterns': sorted(m.patterns)} for m in hit.matches), + key=lambda x: x['scope'] or '', + ), + ) + for hit in result + ] + normalized_expected = [ + ( + config_id, + sorted( + ({'scope': m['scope'], 'patterns': sorted(m['patterns'])} for m in matches), + key=lambda x: x['scope'] or '', + ), + ) + for config_id, matches in expected_hits + ] + assert normalized_actual == normalized_expected + @pytest.mark.parametrize( ('spec_kwargs', 'texts', 'expected'), @@ -767,6 +1000,7 @@ def test_match_texts(spec_kwargs: dict[str, Any], texts: list[str], expected: li ('spec_kwargs', 'configuration', 'expected'), [ ( + # Scopes provided; each scope has one matching leaf – returns the exact leaf path. { 'patterns': ['alpha', 'beta'], 'item_types': ('configuration',), @@ -778,11 +1012,12 @@ def test_match_texts(spec_kwargs: dict[str, Any], texts: list[str], expected: li 'storage': {'input': [{'source': 'beta'}], 'output': [{'destination': 'gamma'}]}, }, [ - {'scope': 'parameters', 'patterns': ['alpha']}, - {'scope': 'storage.input', 'patterns': ['beta']}, + {'scope': 'parameters.query', 'patterns': ['alpha']}, + {'scope': 'storage.input[0].source', 'patterns': ['beta']}, ], ), ( + # Both patterns match across two leaves inside the same scope; each leaf gets its own entry. { 'patterns': ['alpha', 'beta'], 'item_types': ('configuration',), @@ -794,11 +1029,13 @@ def test_match_texts(spec_kwargs: dict[str, Any], texts: list[str], expected: li 'storage': {'input': [{'source': 'beta'}, {'source': 'alpha'}], 'output': [{'destination': 'gamma'}]}, }, [ - {'scope': 'parameters', 'patterns': ['alpha']}, - {'scope': 'storage.input', 'patterns': ['alpha', 'beta']}, + {'scope': 'parameters.query', 'patterns': ['alpha']}, + {'scope': 'storage.input[0].source', 'patterns': ['beta']}, + {'scope': 'storage.input[1].source', 'patterns': ['alpha']}, ], ), ( + # Pattern not present in any of the specified scopes → empty result. { 'patterns': ['gamma'], 'item_types': ('configuration',), @@ -812,6 +1049,7 @@ def test_match_texts(spec_kwargs: dict[str, Any], texts: list[str], expected: li [], ), ( + # No scopes → walk the whole config; can match parent nodes containing the searched fragment. { 'patterns': ['gamma'], 'item_types': ('configuration',), @@ -821,9 +1059,14 @@ def test_match_texts(spec_kwargs: dict[str, Any], texts: list[str], expected: li 'parameters': {'query': 'alpha'}, 'storage': {'input': [{'source': 'beta'}], 'output': [{'destination': 'gamma'}]}, }, - [{'scope': None, 'patterns': ['gamma']}], + [ + {'scope': 'storage', 'patterns': ['gamma']}, + {'scope': 'storage.output', 'patterns': ['gamma']}, + {'scope': 'storage.output[0].destination', 'patterns': ['gamma']}, + ], ), ( + # return_all_matched_patterns=False → stop after first matching leaf. { 'patterns': ['alpha', 'beta'], 'item_types': ('configuration',), @@ -834,7 +1077,40 @@ def test_match_texts(spec_kwargs: dict[str, Any], texts: list[str], expected: li 'parameters': {'query': 'alpha'}, 'storage': {'input': [{'source': 'beta'}], 'output': [{'destination': 'gamma'}]}, }, - [{'scope': 'parameters', 'patterns': ['alpha']}], + [{'scope': 'parameters.query', 'patterns': ['alpha']}], + ), + ( + # Overlapping scopes should not return duplicate leaf hits. + { + 'patterns': ['alpha'], + 'item_types': ('configuration',), + 'search_scopes': ('parameters', 'parameters.query'), + 'return_all_matched_patterns': True, + }, + {'parameters': {'query': 'alpha'}}, + [{'scope': 'parameters.query', 'patterns': ['alpha']}], + ), + ( + # Scope pointing directly to scalar should still match (self-scope fallback). + { + 'patterns': ['wttr.in'], + 'item_types': ('configuration',), + 'search_scopes': ('parameters.api.baseUrl',), + 'return_all_matched_patterns': True, + }, + {'parameters': {'api': {'baseUrl': 'https://wttr.in'}}}, + [{'scope': 'parameters.api.baseUrl', 'patterns': ['wttr.in']}], + ), + ( + # Scope with #-prefixed key should be normalized and parsed correctly. + { + 'patterns': ['alpha'], + 'item_types': ('configuration',), + 'search_scopes': ('authorization.#apiKey',), + 'return_all_matched_patterns': True, + }, + {'authorization': {'#apiKey': 'alpha'}}, + [{'scope': 'authorization.#apiKey', 'patterns': ['alpha']}], ), ], ids=[ @@ -843,6 +1119,9 @@ def test_match_texts(spec_kwargs: dict[str, Any], texts: list[str], expected: li 'no_patterns_in_scope', 'all_patterns_no_scope', 'any_patterns_return_first_match', + 'overlapping_scopes_deduplicated', + 'scalar_scope_matches_self', + 'hash_prefixed_scope_key_matches', ], ) def test_match_configuration_scopes(spec_kwargs: dict[str, Any], configuration: dict[str, Any], expected: list[dict]): diff --git a/uv.lock b/uv.lock index a763c8f2..00d22dec 100644 --- a/uv.lock +++ b/uv.lock @@ -1223,7 +1223,7 @@ wheels = [ [[package]] name = "keboola-mcp-server" -version = "1.44.8" +version = "1.45.0" source = { editable = "." } dependencies = [ { name = "cryptography" },