Conversation
|
|
||
| ## Out of scope (for this skill) | ||
|
|
||
| - Performance Advisor / slow query logs |
There was a problem hiding this comment.
@davidjahn Any particular reason to exclude Performance Advisor from this skill? It's already possible to access PA recommendations from the MCP server if the user has configured MCP with an Atlas service account
There was a problem hiding this comment.
Main reason is I'm having trouble getting Performance Advisor access to work w/ the MongoDB MCP server.
I'm planning to add Atlas Performance Advisor support in a fast followup PR. Wanted to start with something simple that is working
|
|
||
| 3. **Match or suggest an index** | ||
| - **If an existing index can support the query** | ||
| - Prefer equality on indexed fields, then range, then sort; compound indexes follow left-prefix. |
There was a problem hiding this comment.
you mean then sort and then range, right? (absent other info)
There was a problem hiding this comment.
yes will fix, planning to finalize this initial PR (for only index suggestion support) today
| 3. **Match or suggest an index** | ||
| - **If an existing index can support the query** | ||
| - Prefer equality on indexed fields, then range, then sort; compound indexes follow left-prefix. | ||
| - Shape the query so the planner can use that index (e.g. filter/sort order consistent with the index key). |
There was a problem hiding this comment.
sort order, yes but filter order doesn't matter...
There was a problem hiding this comment.
yes will fix this
| - **Server**: `user-MongoDB` | ||
| - **Arguments**: `database` (string), `collection` (string) | ||
| - **Use**: Call after you know the target db and collection from the user's query or code. Use the returned `classicIndexes[].key` to see index key patterns; ignore search indexes unless the query is a search query. | ||
|
|
There was a problem hiding this comment.
what about various improvements for aggregations or find based on the actual query - things like suggesting missing projections for find or pointing out inefficiencies in aggregation pipelines? sort without limit, that sort of thing?
There was a problem hiding this comment.
i'm limiting scope of this initial PR to only suggesting indexes on query
i will add query structure suggestions in a followup PR this week if i have time (looking like probably i will have time to do this before EOW)
There was a problem hiding this comment.
Pull request overview
Adds a new mongodb-query-optimizer agent skill intended to trigger only on explicit performance/indexing requests, and provides reference material + a small eval workspace to validate correct invocation/boundary behavior.
Changes:
- Introduces
skills/mongodb-query-optimizer/SKILL.mdwith MCP branching guidance forcollection-indexes,explain, and Atlas Performance Advisor. - Adds optimization reference docs (aggregation, explain interpretation, covered queries, multikey/arrays).
- Adds a testing workspace + eval registry with iteration-1 cases for invocation/boundary checks.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 22 comments.
Show a summary per file
| File | Description |
|---|---|
skills/mongodb-query-optimizer/SKILL.md |
New skill definition and workflow (invocation rules, MCP tool usage, index principles). |
skills/mongodb-query-optimizer/references/aggregation-optimization.md |
New aggregation optimization reference. |
skills/mongodb-query-optimizer/references/covered-queries.md |
New covered-query reference. |
skills/mongodb-query-optimizer/references/explain-interpretation.md |
New explain-output interpretation reference. |
skills/mongodb-query-optimizer/references/multikey-arrays.md |
New multikey/array query reference. |
testing/mongodb-query-optimizer/evals/evals.json |
Eval registry describing the iteration-1 cases. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/README.md |
Documentation for how to run the workspace evals. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/benchmark.json |
Benchmark metadata for iteration-1. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/why-slow/eval_metadata.json |
Case metadata for a “why is this slow” prompt. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/slow-queries-cluster/eval_metadata.json |
Case metadata for “slow queries on my cluster” prompt. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/optimize-this-query/eval_metadata.json |
Case metadata for optimizing a simple find+sort. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/optimize-aggregation-pipeline/eval_metadata.json |
Case metadata for optimizing an aggregation pipeline. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/negative-routine-find/eval_metadata.json |
Negative/boundary case metadata. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/how-to-index/eval_metadata.json |
Case metadata for “how to index this $match” prompt. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/fix-slow-queries/eval_metadata.json |
Case metadata for general slow-query help prompt. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/explain-interpretation/eval_metadata.json |
Case metadata for explain interpretation prompt. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/elemMatch-array-query/eval_metadata.json |
Case metadata for array correctness/perf prompt. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/covered-query-optimization/eval_metadata.json |
Case metadata for covered-query optimization prompt. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/array-field-query/eval_metadata.json |
Case metadata for array-field query optimization prompt. |
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/aggregation-unwind-performance/eval_metadata.json |
Case metadata for $unwind performance prompt. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| Aggregation stages have a 100MB memory limit per stage (MongoDB 4.2+, configurable with `internalQueryExecMaxBlockingSortBytes`). When exceeded: | ||
| - Without `allowDiskUse`: pipeline fails | ||
| - With `allowDiskUse: true`: spills to disk (slower but completes) | ||
|
|
||
| **Use allowDiskUse when:** | ||
| - Processing large datasets that exceed memory limits | ||
| - $group or $sort operations on high-cardinality data | ||
| - Acceptable to trade speed for completion | ||
|
|
||
| **Better solutions:** | ||
| - Filter more aggressively early in pipeline | ||
| - Add indexes to enable `$sort` to use index order | ||
| - Increase available memory (Atlas tier upgrade) | ||
| - Consider materialized views for repeated aggregations |
There was a problem hiding this comment.
internalQueryExecMaxBlockingSortBytes is specifically a blocking sort limit; it doesn’t represent a universal “per stage” aggregation memory limit (e.g., $group has separate limits/knobs). Please adjust this text so it doesn’t point readers at the wrong setting and instead focuses on allowDiskUse + early filtering/indexing.
| Aggregation stages have a 100MB memory limit per stage (MongoDB 4.2+, configurable with `internalQueryExecMaxBlockingSortBytes`). When exceeded: | |
| - Without `allowDiskUse`: pipeline fails | |
| - With `allowDiskUse: true`: spills to disk (slower but completes) | |
| **Use allowDiskUse when:** | |
| - Processing large datasets that exceed memory limits | |
| - $group or $sort operations on high-cardinality data | |
| - Acceptable to trade speed for completion | |
| **Better solutions:** | |
| - Filter more aggressively early in pipeline | |
| - Add indexes to enable `$sort` to use index order | |
| - Increase available memory (Atlas tier upgrade) | |
| - Consider materialized views for repeated aggregations | |
| Some aggregation operations (such as blocking `$sort` and certain `$group` patterns) have a 100MB in-memory limit per operation. When an operation exceeds its in-memory limit: | |
| - Without `allowDiskUse`: the pipeline (or stage) fails | |
| - With `allowDiskUse: true`: the operation can spill to disk (slower but completes) | |
| **Use allowDiskUse when:** | |
| - Processing large datasets that may exceed in-memory limits | |
| - `$group` or `$sort` operations on high-cardinality or wide documents | |
| - It is acceptable to trade speed for completion | |
| **Better solutions:** | |
| - Filter more aggressively early in the pipeline to reduce the working set | |
| - Add or tune indexes so `$match` and `$sort` can use index order instead of in-memory sorts | |
| - Increase available memory (for example, via an Atlas tier upgrade) | |
| - Consider materialized views or pre-aggregated collections for repeated aggregations |
| { | ||
| "id": "covered-query-optimization", | ||
| "prompt": "I want to make this query as fast as possible without reading documents. Query: db.users.find({ status: 'active', plan: 'premium' }, { status: 1, plan: 1, email: 1, _id: 0 }).sort({ email: 1 }). Can this be a covered query?", | ||
| "expect_skill_invocation": true, |
There was a problem hiding this comment.
The workspace README and testing/mongodb-query-optimizer/evals/evals.json use expect_invoke, but this case metadata uses expect_skill_invocation. If your eval harness keys off expect_invoke, these cases won’t be asserted correctly. Please standardize the field name across the registry + per-case eval_metadata.json (either rename here to expect_invoke or update the registry/docs/harness to match).
| "expect_skill_invocation": true, | |
| "expect_invoke": true, |
| - **Invocation**: Cases with `expect_invoke: true` should trigger the mongodb-query-optimizer skill (explicit optimization / performance / slow-query asks). | ||
| - **Boundary**: Cases with `expect_invoke: false` should **not** trigger the optimizer when the user only wants a query generated without optimization language. | ||
|
|
||
| ## Running evals | ||
|
|
||
| Use your agent harness or skill-validator flow to: | ||
|
|
||
| 1. Load each `eval_metadata.json` prompt. | ||
| 2. Assert the skill is selected when `expect_invoke` is true and not selected when false (or that the optimizer workflow is not applied for negative cases). |
There was a problem hiding this comment.
This README instructs using expect_invoke, but the per-case eval_metadata.json files use expect_skill_invocation. Please align the README with the actual schema (or rename the metadata field) so the instructions and fixtures are consistent.
| - **Invocation**: Cases with `expect_invoke: true` should trigger the mongodb-query-optimizer skill (explicit optimization / performance / slow-query asks). | |
| - **Boundary**: Cases with `expect_invoke: false` should **not** trigger the optimizer when the user only wants a query generated without optimization language. | |
| ## Running evals | |
| Use your agent harness or skill-validator flow to: | |
| 1. Load each `eval_metadata.json` prompt. | |
| 2. Assert the skill is selected when `expect_invoke` is true and not selected when false (or that the optimizer workflow is not applied for negative cases). | |
| - **Invocation**: Cases with `expect_skill_invocation: true` should trigger the mongodb-query-optimizer skill (explicit optimization / performance / slow-query asks). | |
| - **Boundary**: Cases with `expect_skill_invocation: false` should **not** trigger the optimizer when the user only wants a query generated without optimization language. | |
| ## Running evals | |
| Use your agent harness or skill-validator flow to: | |
| 1. Load each `eval_metadata.json` prompt. | |
| 2. Assert the skill is selected when `expect_skill_invocation` is true and not selected when false (or that the optimizer workflow is not applied for negative cases). |
| 4. **No array fields** are indexed (multikey indexes prevent covering) | ||
| 5. **No sub-document fields** unless the exact sub-document path is indexed |
There was a problem hiding this comment.
This requirement incorrectly states that multikey indexes prevent covered queries. Multikey indexes can still cover queries under certain constraints (not returning the array field, avoiding $elemMatch, etc.). Please update this requirement so covered-query guidance matches MongoDB behavior.
| 4. **No array fields** are indexed (multikey indexes prevent covering) | |
| 5. **No sub-document fields** unless the exact sub-document path is indexed | |
| 4. **Array fields / multikey indexes** are only used in ways that still allow covering (for example, you do **not** project the array field itself and you avoid operators like `$elemMatch` that require materializing array elements) | |
| 5. **Sub-document fields** are only used when the exact indexed path(s) provide all queried and returned values |
There was a problem hiding this comment.
Copilot is correct, original claim is wrong/too general.
| **Limitations:** | ||
| - Cannot cover queries (must read documents) | ||
| - Compound multikey restriction: at most **one** array field per compound index |
There was a problem hiding this comment.
This bullet says multikey indexes cannot cover queries. MongoDB can produce covered queries with multikey indexes under certain constraints (e.g., avoid projecting the array field and avoid $elemMatch). Please correct/qualify this limitation so the skill’s guidance stays accurate.
| { | ||
| "id": "optimize-this-query", | ||
| "prompt": "How do I optimize this MongoDB find? db.orders.find({ status: 'open' }).sort({ createdAt: -1 })", | ||
| "expect_skill_invocation": true, |
There was a problem hiding this comment.
The workspace README and testing/mongodb-query-optimizer/evals/evals.json use expect_invoke, but this case metadata uses expect_skill_invocation. If your eval harness keys off expect_invoke, these cases won’t be asserted correctly. Please standardize the field name across the registry + per-case eval_metadata.json (either rename here to expect_invoke or update the registry/docs/harness to match).
| "expect_skill_invocation": true, | |
| "expect_invoke": true, |
| { | ||
| "id": "optimize-aggregation-pipeline", | ||
| "prompt": "Can you optimize this aggregation pipeline? It's running slow: db.orders.aggregate([{ $lookup: { from: 'customers', localField: 'customerId', foreignField: '_id', as: 'customer' } }, { $match: { status: 'shipped' } }, { $group: { _id: '$customer.region', total: { $sum: '$amount' } } }])", | ||
| "expect_skill_invocation": true, |
There was a problem hiding this comment.
The workspace README and testing/mongodb-query-optimizer/evals/evals.json use expect_invoke, but this case metadata uses expect_skill_invocation. If your eval harness keys off expect_invoke, these cases won’t be asserted correctly. Please standardize the field name across the registry + per-case eval_metadata.json (either rename here to expect_invoke or update the registry/docs/harness to match).
| "expect_skill_invocation": true, | |
| "expect_invoke": true, |
| { | ||
| "id": "elemMatch-array-query", | ||
| "prompt": "Why is my array query returning wrong results? db.orders.find({ 'items.sku': 'ABC123', 'items.quantity': { $gte: 5 } }). Sometimes it matches orders where ABC123 has quantity 2 and a different item has quantity 10.", | ||
| "expect_skill_invocation": true, |
There was a problem hiding this comment.
The workspace README and testing/mongodb-query-optimizer/evals/evals.json use expect_invoke, but this case metadata uses expect_skill_invocation. If your eval harness keys off expect_invoke, these cases won’t be asserted correctly. Please standardize the field name across the registry + per-case eval_metadata.json (either rename here to expect_invoke or update the registry/docs/harness to match).
| "expect_skill_invocation": true, | |
| "expect_invoke": true, |
| - More index entries per document (proportional to array size) | ||
| - Larger index size on disk and in memory | ||
| - Higher write cost (must update multiple index entries per document) | ||
| - Cannot cover queries (MongoDB limitation) |
There was a problem hiding this comment.
This list of multikey “Costs” says multikey indexes cannot cover queries. MongoDB can produce covered queries using a multikey index under specific constraints (e.g., don’t project the array field and avoid $elemMatch). Please qualify or correct this statement so readers don’t incorrectly rule out covered-query optimizations.
| - Cannot cover queries (MongoDB limitation) | |
| - Covered queries are limited: multikey indexes can only cover a query when the indexed array field itself is not projected and operators such as `$elemMatch` are avoided |
| **Requirements:** | ||
| - All queried fields are in the index | ||
| - All returned fields are in the index (or `_id` only) | ||
| - Projection must exclude `_id` unless `_id` is in the index | ||
| - No array fields indexed (multikey indexes cannot cover) |
There was a problem hiding this comment.
This requirement says multikey indexes cannot cover queries. MongoDB can produce covered queries with multikey indexes under specific constraints (e.g., don’t project the array field and avoid $elemMatch). Please correct/qualify this so the skill doesn’t incorrectly block covered-query advice.
asya999
left a comment
There was a problem hiding this comment.
I’ll leave more comments later today
| ]) | ||
| ``` | ||
|
|
||
| ## Minimize $unwind usage |
There was a problem hiding this comment.
That’s not a thing. $unwind is not expensive. However the match before unwind example is good.
There was a problem hiding this comment.
ok going to rephrase this here
| ``` | ||
|
|
||
| **Consider alternatives:** | ||
| - `$filter` array operator to process arrays without unwinding |
There was a problem hiding this comment.
This is only appropriate if there is an unwind and then group by _id
There was a problem hiding this comment.
ok going to specify that here
| { $match: { active: true } }, // Reduce left side | ||
| { $lookup: { | ||
| from: "inventory", | ||
| localField: "product_id", // Ensure this is indexed |
| foreignField: "_id", // _id is always indexed | ||
| pipeline: [ | ||
| { $match: { inStock: true } }, // Reduce right side | ||
| { $project: { name: 1, price: 1 } } // Limit fields |
There was a problem hiding this comment.
Probably can exclude _id?
| { $project: { name: 1, price: 1 } } // Limit fields | ||
| ], | ||
| as: "product" | ||
| }} |
There was a problem hiding this comment.
Also I would expect to see $unwind here since there is only one product.
|
|
||
| 1. **Filter before grouping** to reduce input documents | ||
| 2. **Use selective _id** for grouping to reduce number of groups | ||
| 3. **Project only needed fields** before $group to reduce memory per document |
There was a problem hiding this comment.
That’s wrong. Only needed fields should be kept in group but no projections should be done before group.
| ```javascript | ||
| [ | ||
| { $match: { date: { $gte: ISODate("2024-01-01") } } }, // Reduce input | ||
| { $project: { category: 1, amount: 1 } }, // Only needed fields |
There was a problem hiding this comment.
Anti pattern - this is literally an example of wrong pipeline.
There was a problem hiding this comment.
This should be example of wrong in the “don’t use $project unnecessarily or too early”
| **Causes:** | ||
| - No suitable index exists | ||
| - Query optimizer chose not to use available index (rare) | ||
| - Query shape doesn't match any index prefix |
There was a problem hiding this comment.
Same as no suitable index.
|
|
||
| ### SORT (in-memory sort) | ||
|
|
||
| **Expensive:** Sorting results in memory after retrieval. |
| ``` | ||
|
|
||
| **Why this is costly:** | ||
| - Uses memory (32MB limit per sort in MongoDB 3.4+, configurable with `internalQueryExecMaxBlockingSortBytes`) |
There was a problem hiding this comment.
Remove internal config reference. Limit is 100MBs.
| **Why this is costly:** | ||
| - Uses memory (32MB limit per sort in MongoDB 3.4+, configurable with `internalQueryExecMaxBlockingSortBytes`) | ||
| - Requires buffering all results before returning first document | ||
| - Blocking operation (can't stream results) |
There was a problem hiding this comment.
These two say the same thing.
|
|
||
| ### SHARDING_FILTER | ||
|
|
||
| In sharded clusters, filters out documents not owned by the shard. Normal overhead in sharded environment. |
There was a problem hiding this comment.
ugh - this has more or less overhead depending on the index used. They should prefer to have an index with shard key fields so the filtering can happen against the index, but I'm not sure if there is a concise way to say it - an example could help.
dacharyc
left a comment
There was a problem hiding this comment.
I've got a handful of comments so far from the review-skill analysis/LLM-as-judge scoring - I'll take a deeper look later today for things that the skill won't catch, but wanted to give you this as a starting point.
| - **`explain`** — Invoke with server = MongoDB MCP server, toolName = `explain`, arguments = object with `database`, `collection`, `method` (one find/aggregate/count shape), and `verbosity`. | ||
| - **Always load `references/explain-interpretation.md`** when running explain to interpret output correctly | ||
| - **Verbosity selection:** | ||
| - Use `"executionStats"` when query likely completes in <10 seconds (small collections, simple queries, good indexes) |
There was a problem hiding this comment.
For this path, "query likely completes in <10 seconds" - an agent can't estimate this. Can we replace with concrete criteria (e.g., collection size, index availability)?
There was a problem hiding this comment.
shouldn't it just use maxTimeMs or whatever client short-circuit timeout in ALL cases?
| - **Running explain()** → **Always** load `references/explain-interpretation.md` when calling the explain MCP tool | ||
| - **Aggregation pipelines** → Load `references/aggregation-optimization.md` **first** when dealing with aggregate pipelines | ||
| - **Queries on array fields** → Load `references/multikey-arrays.md` when fields have array values | ||
| - **Covered query optimization** → Load `references/covered-queries.md` when the query is able to have all fields fully covered |
There was a problem hiding this comment.
For this line, "when the query is able to have all fields fully covered" - the agent can't determine this pre-analysis. Can we add observable triggers (user asks about covered queries, or explain shows FETCH with eliminable fields)?
| | Operation value | Use when | | ||
| |----------------|----------| | ||
| | `slowQueryLogs` | User asks what’s slow on the cluster, which queries to fix first—**prioritize by slowest and most frequent** when the response exposes those dimensions. Optional: `namespaces` to scope to a collection; `since` for a time window. | | ||
| | `suggestedIndexes` | PA index recommendations—**validate** they apply to the user’s query and will have good impact before recommending creation. | |
There was a problem hiding this comment.
For this line, "validate they apply to the user's query and will have good impact" - "good impact" is undefined. Can we be specific about what the agent should check for? Maybe something like: confirm the suggested index matches the user's query predicates and sort.
There was a problem hiding this comment.
yeah I'm a bit concerned it'll get hit by one of the existing bugs that causes "bad" PA index recommendations... but I'd hate to see us encode that here rather than just fixing them...
|
|
||
| Index `{a:1, b:1}` supports `sort({a:1, b:1})` and reverse `sort({a:-1, b:-1})`, but NOT mixed directions like `sort({a:1, b:-1})`. For mixed sorts, create index matching exact pattern. | ||
|
|
||
| ### Covered queries |
There was a problem hiding this comment.
For these two sections - the inline "Covered queries" section (lines 134–144) and "Array fields and multikey indexes" section (lines 146–156) duplicate their respective reference files. Since SKILL.md loads on every invocation (~2,390 tokens) and references load conditionally, can we consider trimming these to one-line triggers?
There was a problem hiding this comment.
Agreed but also a lot of things are not quite correct and I'd prefer we have to fix it in one place only, not two.
There was a problem hiding this comment.
The review-skill skill's LLM-as-judge-scoring notes this file doesn't contain any novel information - it's just restating things that are covered in MongoDB documentation + is already in the model's base training data set. Can we examine this content and see if any of it can be trimmed, or if the entire file could be removed from the skill? Did you do any testing with and without the skill loaded for concepts covered in this file to see if including it makes any difference in agent performance on tasks that would reference this content?
There was a problem hiding this comment.
This reference file scored a "3" on Novelty in the LLM-as-judge scoring, which meets the threshold for novelty but hints that it is not highly novel. The takeaway is that some of this content may be restating standard MongoDB documentation/data that is already in the LLM base training data. Have we tested agent performance with and without this file included in the skill to see if it actually improves outputs? I'm wondering if we can trim some of the non-novel content and still see any output improvement, or if we should consider restructuring to be more workflow-focused or otherwise find ways to emphasize any novel guidance aspects we're including here. Or if it may make sense to omit this reference if it doesn't actually improve agent outputs.
There was a problem hiding this comment.
In addition, some of the things that are flagged as being "novel" are incorrect or should not be mentioned as they are not to be changed by user/agent (like the internal sort/group settings).
There was a problem hiding this comment.
This reference file scored a "3" on Novelty in the LLM-as-judge scoring, which meets the threshold for novelty but hints that it is not highly novel. The takeaway is that some of this content may be restating standard MongoDB documentation/data that is already in the LLM base training data. Have we tested agent performance with and without this file included in the skill to see if it actually improves outputs? I'm wondering if we can trim some of the non-novel content and still see any output improvement, or if we should consider restructuring to be more workflow-focused or otherwise find ways to emphasize any novel guidance aspects we're including here. Or if it may make sense to omit this reference if it doesn't actually improve agent outputs.
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Is this a real example? The query isn’t included. 28 keys and 25 docs examined seems sus also.
| - Queries returning most documents | ||
| - Collection fits entirely in cache | ||
|
|
||
| **Optimizer may choose COLLSCAN if:** |
There was a problem hiding this comment.
Never when there is a qualified index
There was a problem hiding this comment.
In addition, some of the things that are flagged as being "novel" are incorrect or should not be mentioned as they are not to be changed by user/agent (like the internal sort/group settings).
| db.orders.aggregate([ | ||
| { $lookup: { ... } }, // Processes all documents first | ||
| { $match: { status: "shipped" } } // Too late, already did expensive work | ||
| ]) |
There was a problem hiding this comment.
not a good example as in current version we automatically swap $match in front of $lookup in this case.
| ```javascript | ||
| db.orders.aggregate([ | ||
| { $match: { status: "shipped", region: "US" } }, // Uses index, reduces data early | ||
| { $lookup: { ... } }, |
There was a problem hiding this comment.
should there be an unwind here? The only way this pipeline makes sense is if $group is grouping on something that came out of $lookup and grouping by array things is ... uncommon.
|
|
||
| ### Cardinality and selectivity | ||
|
|
||
| High-cardinality fields (many distinct values) are more selective—prefer for equality predicates. Low-cardinality fields (e.g., `status`) help in compound indexes. Good selectivity: `totalKeysExamined` ≈ `nReturned` in explain output. |
There was a problem hiding this comment.
wait, no. All equality fields are equally valuable - selectivity only matters overall for the query, not for individual fields...
|
|
||
| Index `{a:1, b:1}` supports `sort({a:1, b:1})` and reverse `sort({a:-1, b:-1})`, but NOT mixed directions like `sort({a:1, b:-1})`. For mixed sorts, create index matching exact pattern. | ||
|
|
||
| ### Covered queries |
There was a problem hiding this comment.
Agreed but also a lot of things are not quite correct and I'd prefer we have to fix it in one place only, not two.
| "iteration": "iteration-1", | ||
| "path": "iteration-1/array-field-query", | ||
| "prompt": "How do I optimize a query on an array field? db.products.find({ tags: 'electronics', 'specs.features': { $all: ['wifi', 'bluetooth'] } }). The tags array has hundreds of values per document.", | ||
| "expect_invoke": true, |
There was a problem hiding this comment.
there is some confusion here. Tags is an array but "specs.features" is being queried with an array operator...
| "path": "iteration-1/aggregation-unwind-performance", | ||
| "prompt": "My aggregation is timing out. It uses $unwind on an array with thousands of elements per document. Pipeline: [{ $unwind: '$events' }, { $match: { 'events.type': 'purchase' } }, { $group: { _id: '$userId', count: { $sum: 1 } } }]. How do I fix this?", | ||
| "expect_invoke": true, | ||
| "tags": ["aggregation", "unwind", "performance", "optimization", "references"] |
There was a problem hiding this comment.
when we remove unwind section from agg section we should ensure all the tags referencing it are also removed - the issue here is not $unwind it's the fact that there is no $match in front of it.
| "prompt": "My aggregation is timing out. It uses $unwind on an array with thousands of elements per document. Pipeline: [{ $unwind: '$events' }, { $match: { 'events.type': 'purchase' } }, { $group: { _id: '$userId', count: { $sum: 1 } } }]. How do I fix this?", | ||
| "expect_skill_invocation": true, | ||
| "rationale": "Performance issue with aggregation pipeline — skill should invoke and load references/aggregation-optimization.md. Key problem: $unwind multiplies document count before filtering. Should suggest: 1) Add $match before $unwind to reduce documents, 2) Consider $filter array operator instead of $unwind + $match, 3) May need index on userId for $group, 4) Discuss allowDiskUse if needed. Should explain $unwind performance implications with large arrays.", | ||
| "optional_mcp": ["collection-indexes", "explain"] |
There was a problem hiding this comment.
all of this is incorrect except for adding $match before $unwind - same as previous example so I'm not sure the value of having them both. You don't need index on userId for $group, it cannot be used here.
|
Assigned |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 22 changed files in this pull request and generated 11 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ) | ||
| ``` | ||
|
|
||
| **Why:** `findOneAndUpdate` writes a copy of the pre-change document to a side collection for retryable writes. This overhead is unnecessary if you don't need the returned document. No newline at end of file |
There was a problem hiding this comment.
The rationale for avoiding findOneAndUpdate is overstated as written. The extra overhead (returning the document and, for retryable writes, possibly recording pre/post images) is conditional and not simply “writes a copy … to a side collection” in all cases. Please tighten this explanation to describe the actual conditions that introduce the extra work and keep the recommendation precise.
| **Why:** `findOneAndUpdate` writes a copy of the pre-change document to a side collection for retryable writes. This overhead is unnecessary if you don't need the returned document. | |
| **Why:** `findOneAndUpdate` must build and return a result document and, when retryable writes with pre/post images are enabled, may perform additional internal bookkeeping to record those images. If you don't need the modified document back, `updateOne` avoids this extra work. |
| { | ||
| "id": "optimize-this-query", | ||
| "prompt": "How do I optimize this MongoDB find? db.orders.find({ status: 'open' }).sort({ createdAt: -1 })", | ||
| "expect_skill_invocation": true, |
There was a problem hiding this comment.
This case uses expect_skill_invocation, while the registry (testing/mongodb-query-optimizer/evals/evals.json) and workspace README refer to expect_invoke. Align the field name across the workspace so eval runners don’t need special-casing.
| "expect_skill_invocation": true, | |
| "expect_invoke": true, |
| **Optimizer willmay choose COLLSCAN in absence of an eligible index.f:** | ||
|
|
||
| - Index selectivity is poor for this query |
There was a problem hiding this comment.
This heading sentence is garbled, which makes the guidance hard to read and search. Please fix the wording/typos so it reads as a coherent statement about when the optimizer may pick a COLLSCAN.
| Blocking stages (such as in-memory $sort and $group) have a 100MB memory limit per stage. Now, the default behavior when this limit is exceeded is to spill to disk automatically (allowDiskUse defaults to true). | ||
|
|
There was a problem hiding this comment.
This states that spilling to disk is automatic because allowDiskUse “defaults to true”. That’s not generally correct—disk spilling for blocking stages depends on allowDiskUse (or a server-level default like allowDiskUseByDefault), and many deployments default to not spilling. Please adjust this to describe the actual condition under which spilling occurs, and mention how to enable it if desired.
| Blocking stages (such as in-memory $sort and $group) have a 100MB memory limit per stage. Now, the default behavior when this limit is exceeded is to spill to disk automatically (allowDiskUse defaults to true). | |
| Blocking stages (such as in-memory $sort and $group) have a 100MB memory limit per stage. Whether they can spill to disk when this limit is exceeded depends on the aggregation's allowDiskUse setting (or a server-level default such as allowDiskUseByDefault). | |
| If allowDiskUse is false and a blocking stage exceeds its memory limit, the operation fails with an error instead of spilling. To allow spilling, run the aggregation with { allowDiskUse: true } or configure an appropriate server-level default. |
|
|
||
| - **Equality** fields first (e.g. `{field: value}`, `{$in: [...]}` matching \<= 200 elements, `{field: {$eq: value}}`) | ||
| - **Sort** fields next | ||
| - **Range** fields last (e.g. `$gt`, `$lt`, `$gte`, `$lte`, `{$in: [...]}`matching \> 200 elements in the array), `$ne`) |
There was a problem hiding this comment.
The ESR bullet for “Range” has a typo and mismatched punctuation/parentheses (e.g., {$in: [...]}matching ...), which makes the rule harder to parse. Please fix spacing and the parenthesis structure so the examples read cleanly and unambiguously.
| - **Range** fields last (e.g. `$gt`, `$lt`, `$gte`, `$lte`, `{$in: [...]}`matching \> 200 elements in the array), `$ne`) | |
| - **Range** fields last (e.g. `$gt`, `$lt`, `$gte`, `$lte`, `{$in: [...]}` matching \> 200 elements in the array, `$ne`) |
| "id": "covered-query-optimization", | ||
| "prompt": "I want to make this query as fast as possible without reading documents. Query: db.users.find({ status: 'active', plan: 'premium' }, { status: 1, plan: 1, email: 1, _id: 0 }).sort({ email: 1 }). Can this be a covered query?", | ||
| "expect_skill_invocation": true, | ||
| "rationale": "User asks about covered query optimization explicitly. Skill should invoke and load references/covered-queries.md. Should explain requirements (all fields in index, _id: 0), suggest compound index { status: 1, plan: 1, email: 1 } following ESR, verify with explain showing totalDocsExamined: 0.", |
There was a problem hiding this comment.
The rationale references references/covered-queries.md, but that file does not exist under skills/mongodb-query-optimizer/references/. Update the reference to an existing doc (e.g., the “Covered Queries” section in references/core-indexing-principles.md) or add the missing reference file so the eval guidance is actionable.
| "rationale": "User asks about covered query optimization explicitly. Skill should invoke and load references/covered-queries.md. Should explain requirements (all fields in index, _id: 0), suggest compound index { status: 1, plan: 1, email: 1 } following ESR, verify with explain showing totalDocsExamined: 0.", | |
| "rationale": "User asks about covered query optimization explicitly. Skill should invoke and load references/core-indexing-principles.md (Covered Queries section). Should explain requirements (all fields in index, _id: 0), suggest compound index { status: 1, plan: 1, email: 1 } following ESR, verify with explain showing totalDocsExamined: 0.", |
| "id": "optimize-aggregation-pipeline", | ||
| "prompt": "Can you optimize this aggregation pipeline? It's running slow: db.orders.aggregate([{ $lookup: { from: 'customers', localField: 'customerId', foreignField: '_id', as: 'customer' } }, { $match: { status: 'shipped' } }, { $group: { _id: '$customer.region', total: { $sum: '$amount' } } }])", | ||
| "expect_skill_invocation": true, | ||
| "rationale": "Explicit 'optimize' + aggregation pipeline — skill should invoke and MUST load references/aggregation-optimization.md. Key issue: $match after $lookup (should be before). Should suggest moving $match earlier and checking indexes.", |
There was a problem hiding this comment.
The rationale says to load references/aggregation-optimization.md, but the repo contains references/aggregation-pipeline-examples.md instead (and no aggregation-optimization.md). Update this reference so the eval points at a real file, otherwise the evaluation instructions are impossible to follow.
| "rationale": "Explicit 'optimize' + aggregation pipeline — skill should invoke and MUST load references/aggregation-optimization.md. Key issue: $match after $lookup (should be before). Should suggest moving $match earlier and checking indexes.", | |
| "rationale": "Explicit 'optimize' + aggregation pipeline — skill should invoke and MUST load references/aggregation-pipeline-examples.md. Key issue: $match after $lookup (should be before). Should suggest moving $match earlier and checking indexes.", |
| "id": "aggregation-unwind-performance", | ||
| "prompt": "My aggregation is timing out. It uses $unwind on an array with thousands of elements per document. Pipeline: [{ $unwind: '$events' }, { $match: { 'events.type': 'purchase' } }, { $group: { _id: '$userId', count: { $sum: 1 } } }]. How do I fix this?", | ||
| "expect_skill_invocation": true, | ||
| "rationale": "Performance issue with aggregation pipeline — skill should invoke and load references/aggregation-optimization.md. Key problem: $unwind multiplies document count before filtering. Should suggest: 1) Add $match before $unwind to reduce documents, 2) Consider $filter array operator instead of $unwind + $match, 3) May need index on userId for $group, 4) Discuss allowDiskUse if needed. Should explain $unwind performance implications with large arrays.", |
There was a problem hiding this comment.
The rationale says to load references/aggregation-optimization.md, but that file does not exist under skills/mongodb-query-optimizer/references/ (the closest match is aggregation-pipeline-examples.md). Update the reference so this eval metadata doesn’t point to a missing file.
| "rationale": "Performance issue with aggregation pipeline — skill should invoke and load references/aggregation-optimization.md. Key problem: $unwind multiplies document count before filtering. Should suggest: 1) Add $match before $unwind to reduce documents, 2) Consider $filter array operator instead of $unwind + $match, 3) May need index on userId for $group, 4) Discuss allowDiskUse if needed. Should explain $unwind performance implications with large arrays.", | |
| "rationale": "Performance issue with aggregation pipeline — skill should invoke and load references/aggregation-pipeline-examples.md. Key problem: $unwind multiplies document count before filtering. Should suggest: 1) Add $match before $unwind to reduce documents, 2) Consider $filter array operator instead of $unwind + $match, 3) May need index on userId for $group, 4) Discuss allowDiskUse if needed. Should explain $unwind performance implications with large arrays.", |
| **Before** — `$exists: true` on a regular index still requires a document fetch: | ||
|
|
||
| ```javascript | ||
| db.collection.createIndex({ a: 1 }) | ||
| db.collection.find({ a: { $exists: true } }) | ||
| // Cannot efficiently answer — null semantics require checking each document | ||
| ``` | ||
|
|
||
| **After** — Use a sparse index, which only contains entries where the field exists: | ||
|
|
||
| ```javascript | ||
| db.collection.createIndex({ a: 1 }, { sparse: true }) | ||
| db.collection.find({ a: { $exists: true } }) | ||
| // Answered directly from the index — no document fetch needed | ||
| ``` | ||
|
|
||
| **Why:** Regular indexes store `null` for both missing and explicitly-null fields, so `$exists` can't distinguish them without fetching. Sparse indexes only store entries for documents where the field exists. |
There was a problem hiding this comment.
The $exists + sparse-index guidance appears incorrect/misleading: for a single-field index, documents missing the field typically have no index entry, so { a: { $exists: true } } can usually be satisfied via an index scan without requiring a sparse index. Also, sparse indexes change query semantics in some cases and shouldn’t be recommended solely to “avoid a document fetch” here. Please revise this example to reflect correct index behavior (and call out the real tradeoffs/when sparse or partial indexes help).
| **After** — Use aggregation-based update to generate smaller oplog deltas: | ||
|
|
||
| ```javascript | ||
| db.coll.updateOne({ _id: X }, [{ $replaceWith: { $literal: entireNewDocument } }]) | ||
| ``` | ||
|
|
||
| **Why:** `replaceOne` writes the full document to the oplog. The aggregation update syntax lets MongoDB compute deltas, resulting in smaller oplog entries when only a few fields are changed. |
There was a problem hiding this comment.
This suggests that using an aggregation-pipeline update with $replaceWith produces “smaller oplog deltas” than replaceOne. An update pipeline that replaces the whole document is still a full replacement in effect, and may not reduce oplog size vs replaceOne. If the goal is smaller oplog entries, the example should demonstrate updating only changed fields (e.g., $set/$unset) and explain when MongoDB can record diffs vs full replacements.
MongoDB Agent Skill Submission
Skill Information
Skill Name: mongodb-query-optimizer
Skill Directory:
skills/mongodb-query-optimizerUse Case
This skill will trigger in a few cases when explicitly requested:
collection-indexesMCP server command to suggest using an existing index or create a new oneatlas-get-performance-advisorto make suggestions based on Atlas Performance Advisor results from a clusterThis skill will not perform write operations against a cluster (such as creating or deleting an index), however will make suggestions to the user to create such indexes, or modify the structure of a query that is being created.
Value Proposition
When a user is writes a query or aggregation, either explicitly ("write me a MongoDB query to fetch documents with a particular user id") or implicitly ("modify my application to display all comments from a particular user on their profile page") this skill should ensure best practices for optimizing queries.
This most critically includes ensuring a query is indexed, but also includes:
Special Considerations
This skill will optionally need an MCP server with the MongoDB cluster's connection string configured.
It will also optionally need the MCP server configured with Atlas access for the cluster to use specialized tools like Atlas Performance Advisor.
Validation Prompts
(with MCP server configured, skill loaded, and application codebase loaded in IDE)
(general query questions)
Author Self-Validation
skill-validatorlocallyskill-validator output
I executed
/skill-validator mongodb-query-optimizerusing Claude Sonnet 4.5 (model ID: us.anthropic.claude-sonnet-4-5-20250929-v1:0) running on AWS Bedrock with the skill-validator-ent tool on my local machine.Review Summary: mongodb-query-optimizer
1. Structural Validation
✅ PASSED — No errors or warnings
The skill has:
2. SKILL.md Scores
Overall: 4.0 / 5.0 — Strong, production-ready quality
LLM Assessment: "Clear, well-structured skill with precise MCP tool invocation instructions and good optimization workflow. Some verbosity in tables and examples could be trimmed, but overall actionable and tightly scoped to MongoDB query optimization."
3. Reference File Scores
Reference Aggregate: 4.4 / 5.0 — Excellent supporting documentation
Key strengths:
4. Novelty Assessment
✅ Novelty score: 3.4 (threshold: 3.0)
Mean novelty: (4 + 3 + 2 + 3 + 3) / 5 = 3.0 → Aggregate 3.4 (weighted by importance)
The skill provides valuable proprietary content that justifies its context window cost.
Novel Information Identified:
SKILL.md (novelty: 4):
server,toolName,argumentsas separate params, not nested) and server naming convention (user-MongoDB)Reference files:
aggregation-optimization.md (novelty: 3):
internalQueryExecMaxBlockingSortBytesparameter (MongoDB 4.2+)multikey-arrays.md (novelty: 3):
$exprwith$sizesyntax for partial filters on array sizesexplain-interpretation.md (novelty: 3):
internalQueryExecMaxBlockingSortBytesfor 32MB sort limit (MongoDB 3.4+)covered-queries.md (novelty: 2):
5. Content Review
6. Recommendation
✅ READY TO PUBLISH — Excellent quality, publication-ready
The skill demonstrates:
Strengths:
Quality metrics:
The token count (9,142) is well within acceptable range, and the instruction specificity (0.71) indicates agents will be able to follow directives precisely. The skill successfully balances comprehensive guidance with actionable specificity.
The skill is production-ready and recommended for publication.
SME Review
SME: @asya999