Skip to content

MongoDB Query Optimizer Skill MCP-421#6

Open
davidjahn wants to merge 8 commits intomainfrom
query-optimizing-skill
Open

MongoDB Query Optimizer Skill MCP-421#6
davidjahn wants to merge 8 commits intomainfrom
query-optimizing-skill

Conversation

@davidjahn
Copy link
Collaborator

@davidjahn davidjahn commented Mar 6, 2026

MongoDB Agent Skill Submission

Skill Information

Skill Name: mongodb-query-optimizer
Skill Directory: skills/mongodb-query-optimizer

Use Case

This skill will trigger in a few cases when explicitly requested:

  • User asks to optimize performance of a query
    • Ensure query is indexed and following best practice
    • Leverage collection-indexes MCP server command to suggest using an existing index or create a new one
  • User asks to debug a slow or badly performing query or cluster
    • Leverage atlas-get-performance-advisor to make suggestions based on Atlas Performance Advisor results from a cluster
    • Ensure queries are indexed and following best practice

This skill will not perform write operations against a cluster (such as creating or deleting an index), however will make suggestions to the user to create such indexes, or modify the structure of a query that is being created.

Value Proposition

When a user is writes a query or aggregation, either explicitly ("write me a MongoDB query to fetch documents with a particular user id") or implicitly ("modify my application to display all comments from a particular user on their profile page") this skill should ensure best practices for optimizing queries.

This most critically includes ensuring a query is indexed, but also includes:

  • Debugging existing slow queries
  • Following practice for aggregate pipelines

Special Considerations

This skill will optionally need an MCP server with the MongoDB cluster's connection string configured.
It will also optionally need the MCP server configured with Atlas access for the cluster to use specialized tools like Atlas Performance Advisor.

Validation Prompts

(with MCP server configured, skill loaded, and application codebase loaded in IDE)

/mongodb-query-optimizer how can i optimize the slow query logs in the cluster for this application?
/mongodb-query-optimizer could you investigate the aggregate pipelines in this application and suggest if there are any places for optimization?

(general query questions)

  I ran this query and it's really slow:
  db.orders.find({ status: "pending", customerId: ObjectId("..."), orderDate: { $gte: ISODate("2024-01-01") } }).sort({ orderDate: -1 })

  Can you help me optimize it? I have indexes on status and orderDate separately.
  My aggregation pipeline is timing out with large datasets:

  db.users.aggregate([
    { $lookup: { from: "orders", localField: "_id", foreignField: "userId", as: "orders" } },
    { $unwind: "$orders" },
    { $match: { "orders.status": "completed", region: "US" } },
    { $group: { _id: "$region", totalRevenue: { $sum: "$orders.amount" } } }
  ])

  How can I optimize this?
  I have a query on an array field that's not performing well:

  db.products.find({
    tags: { $in: ["electronics", "sale", "featured"] },
    "specs.features": "wifi"
  }).sort({ price: 1 })

  The tags array can have 50-100 values per product. What's the best way to index this?

Author Self-Validation

  • Verified structure and format requirements
  • Tested skill with LLMs using the five prompts above
  • Checked content quality and novelty
  • Validated all links and file references
  • (Optional but recommended) Ran skill-validator locally

skill-validator output

I executed /skill-validator mongodb-query-optimizer using Claude Sonnet 4.5 (model ID: us.anthropic.claude-sonnet-4-5-20250929-v1:0) running on AWS Bedrock with the skill-validator-ent tool on my local machine.

Review Summary: mongodb-query-optimizer

1. Structural Validation

✅ PASSED — No errors or warnings

The skill has:

  • Valid frontmatter (name, description, compatibility)
  • Clean markdown structure (no unclosed code fences)
  • All reference files properly linked
  • Valid external link (MongoDB MCP Server docs: HTTP 200)
  • Token count: 9,142 total
    • SKILL.md: 2,390 tokens
    • References: 6,752 tokens (aggregation: 1,252 | covered-queries: 1,416 | explain: 2,270 | multikey: 1,814)
  • Instruction specificity: 0.71 (excellent)
  • Medium contamination level (expected for MongoDB-specific tool)

2. SKILL.md Scores

Overall: 4.0 / 5.0 — Strong, production-ready quality

Dimension Score Meaning
Clarity 4 Good — well-structured with precise MCP instructions
Actionability 4 Good — concrete workflow with end-to-end example
Token Efficiency 3 Adequate — some verbosity in tables but justified
Scope Discipline 5 Excellent — tightly scoped to query optimization
Directive Precision 4 Good — precise directives with explicit conditional gates
Novelty 4 Good — proprietary MCP integration details

LLM Assessment: "Clear, well-structured skill with precise MCP tool invocation instructions and good optimization workflow. Some verbosity in tables and examples could be trimmed, but overall actionable and tightly scoped to MongoDB query optimization."

3. Reference File Scores

Reference Aggregate: 4.4 / 5.0 — Excellent supporting documentation

File Overall Clarity Instructional Token Eff Novelty Relevance
aggregation-optimization.md 4.6 5 5 5 3 5
multikey-arrays.md 4.4 5 5 4 3 5
explain-interpretation.md 4.4 5 5 4 3 5
covered-queries.md 4.2 5 5 4 2 5

Key strengths:

  • Perfect 5/5 on clarity and instructional value across all references
  • Perfect 5/5 on skill relevance — every reference directly supports optimization
  • Concrete, actionable examples throughout
  • Well-structured for agent consumption

4. Novelty Assessment

✅ Novelty score: 3.4 (threshold: 3.0)

Mean novelty: (4 + 3 + 2 + 3 + 3) / 5 = 3.0 → Aggregate 3.4 (weighted by importance)

The skill provides valuable proprietary content that justifies its context window cost.

Novel Information Identified:

SKILL.md (novelty: 4):

  1. Custom MCP tool invocation pattern — Exact parameter structure (server, toolName, arguments as separate params, not nested) and server naming convention (user-MongoDB)
  2. Internal reference file system — Organization-specific knowledge files and their lazy-loading workflow

Reference files:

aggregation-optimization.md (novelty: 3):

  • internalQueryExecMaxBlockingSortBytes parameter (MongoDB 4.2+)
  • Specific stage distribution on sharded clusters (shard-local vs merge)
  • Atlas tier upgrade operational practices

multikey-arrays.md (novelty: 3):

  • $expr with $size syntax for partial filters on array sizes
  • Compound multikey restriction error behavior and workarounds
  • Deduplication post-processing behavior details
  • ESR rule application nuances with multikey indexes

explain-interpretation.md (novelty: 3):

  • internalQueryExecMaxBlockingSortBytes for 32MB sort limit (MongoDB 3.4+)
  • Specific stage names (SORT_KEY_GENERATOR, PROJECTION_* types, SHARDING_FILTER)
  • Query targeting ratio interpretation thresholds

covered-queries.md (novelty: 2):

  • Standard MongoDB covered query concepts
  • Well-organized but public knowledge

5. Content Review

Check Status Details
Examples ✅ Excellent Complete end-to-end workflow (lines 34-51), trigger phrases, MCP structures
Edge cases ✅ Good MCP availability branching, verbosity selection (<10s guideline), multikey/sort limitations
Scope-gating ✅ Excellent Clear invocation criteria, "Determine query type first" with "Always" directives, explicit boundaries
MongoDB data access ✅ Excellent Comprehensive MCP server instructions with exact tool names and formats

6. Recommendation

✅ READY TO PUBLISH — Excellent quality, publication-ready

The skill demonstrates:

Strengths:

  • ✅ Perfect structural validation (no errors/warnings)
  • ✅ Excellent scope discipline (5.0/5.0)
  • ✅ High instruction specificity (0.71)
  • ✅ Complete end-to-end example workflow
  • ✅ Explicit conditional gates with "Always" directives
  • ✅ Comprehensive MCP integration instructions
  • ✅ Four high-quality reference files (4.4 aggregate score)
  • ✅ Novel proprietary content (3.4 novelty score)

Quality metrics:

  • SKILL.md overall: 4.0/5.0 (strong)
  • Reference aggregate: 4.4/5.0 (excellent)
  • Novelty: 3.4/5.0 (above threshold)
  • All content checks: Pass

The token count (9,142) is well within acceptable range, and the instruction specificity (0.71) indicates agents will be able to follow directives precisely. The skill successfully balances comprehensive guidance with actionable specificity.

The skill is production-ready and recommended for publication.

SME Review

SME: @asya999


## Out of scope (for this skill)

- Performance Advisor / slow query logs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidjahn Any particular reason to exclude Performance Advisor from this skill? It's already possible to access PA recommendations from the MCP server if the user has configured MCP with an Atlas service account

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main reason is I'm having trouble getting Performance Advisor access to work w/ the MongoDB MCP server.
I'm planning to add Atlas Performance Advisor support in a fast followup PR. Wanted to start with something simple that is working

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!


3. **Match or suggest an index**
- **If an existing index can support the query**
- Prefer equality on indexed fields, then range, then sort; compound indexes follow left-prefix.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean then sort and then range, right? (absent other info)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will fix, planning to finalize this initial PR (for only index suggestion support) today

3. **Match or suggest an index**
- **If an existing index can support the query**
- Prefer equality on indexed fields, then range, then sort; compound indexes follow left-prefix.
- Shape the query so the planner can use that index (e.g. filter/sort order consistent with the index key).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sort order, yes but filter order doesn't matter...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will fix this

- **Server**: `user-MongoDB`
- **Arguments**: `database` (string), `collection` (string)
- **Use**: Call after you know the target db and collection from the user's query or code. Use the returned `classicIndexes[].key` to see index key patterns; ignore search indexes unless the query is a search query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about various improvements for aggregations or find based on the actual query - things like suggesting missing projections for find or pointing out inefficiencies in aggregation pipelines? sort without limit, that sort of thing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm limiting scope of this initial PR to only suggesting indexes on query

i will add query structure suggestions in a followup PR this week if i have time (looking like probably i will have time to do this before EOW)

Copy link
Collaborator

@asya999 asya999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few comments

@davidjahn davidjahn marked this pull request as ready for review March 12, 2026 20:55
@davidjahn davidjahn requested review from a team as code owners March 12, 2026 20:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new mongodb-query-optimizer agent skill intended to trigger only on explicit performance/indexing requests, and provides reference material + a small eval workspace to validate correct invocation/boundary behavior.

Changes:

  • Introduces skills/mongodb-query-optimizer/SKILL.md with MCP branching guidance for collection-indexes, explain, and Atlas Performance Advisor.
  • Adds optimization reference docs (aggregation, explain interpretation, covered queries, multikey/arrays).
  • Adds a testing workspace + eval registry with iteration-1 cases for invocation/boundary checks.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 22 comments.

Show a summary per file
File Description
skills/mongodb-query-optimizer/SKILL.md New skill definition and workflow (invocation rules, MCP tool usage, index principles).
skills/mongodb-query-optimizer/references/aggregation-optimization.md New aggregation optimization reference.
skills/mongodb-query-optimizer/references/covered-queries.md New covered-query reference.
skills/mongodb-query-optimizer/references/explain-interpretation.md New explain-output interpretation reference.
skills/mongodb-query-optimizer/references/multikey-arrays.md New multikey/array query reference.
testing/mongodb-query-optimizer/evals/evals.json Eval registry describing the iteration-1 cases.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/README.md Documentation for how to run the workspace evals.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/benchmark.json Benchmark metadata for iteration-1.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/why-slow/eval_metadata.json Case metadata for a “why is this slow” prompt.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/slow-queries-cluster/eval_metadata.json Case metadata for “slow queries on my cluster” prompt.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/optimize-this-query/eval_metadata.json Case metadata for optimizing a simple find+sort.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/optimize-aggregation-pipeline/eval_metadata.json Case metadata for optimizing an aggregation pipeline.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/negative-routine-find/eval_metadata.json Negative/boundary case metadata.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/how-to-index/eval_metadata.json Case metadata for “how to index this $match” prompt.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/fix-slow-queries/eval_metadata.json Case metadata for general slow-query help prompt.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/explain-interpretation/eval_metadata.json Case metadata for explain interpretation prompt.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/elemMatch-array-query/eval_metadata.json Case metadata for array correctness/perf prompt.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/covered-query-optimization/eval_metadata.json Case metadata for covered-query optimization prompt.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/array-field-query/eval_metadata.json Case metadata for array-field query optimization prompt.
testing/mongodb-query-optimizer/mongodb-query-optimizer-workspace/iteration-1/aggregation-unwind-performance/eval_metadata.json Case metadata for $unwind performance prompt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +101 to +114
Aggregation stages have a 100MB memory limit per stage (MongoDB 4.2+, configurable with `internalQueryExecMaxBlockingSortBytes`). When exceeded:
- Without `allowDiskUse`: pipeline fails
- With `allowDiskUse: true`: spills to disk (slower but completes)

**Use allowDiskUse when:**
- Processing large datasets that exceed memory limits
- $group or $sort operations on high-cardinality data
- Acceptable to trade speed for completion

**Better solutions:**
- Filter more aggressively early in pipeline
- Add indexes to enable `$sort` to use index order
- Increase available memory (Atlas tier upgrade)
- Consider materialized views for repeated aggregations
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

internalQueryExecMaxBlockingSortBytes is specifically a blocking sort limit; it doesn’t represent a universal “per stage” aggregation memory limit (e.g., $group has separate limits/knobs). Please adjust this text so it doesn’t point readers at the wrong setting and instead focuses on allowDiskUse + early filtering/indexing.

Suggested change
Aggregation stages have a 100MB memory limit per stage (MongoDB 4.2+, configurable with `internalQueryExecMaxBlockingSortBytes`). When exceeded:
- Without `allowDiskUse`: pipeline fails
- With `allowDiskUse: true`: spills to disk (slower but completes)
**Use allowDiskUse when:**
- Processing large datasets that exceed memory limits
- $group or $sort operations on high-cardinality data
- Acceptable to trade speed for completion
**Better solutions:**
- Filter more aggressively early in pipeline
- Add indexes to enable `$sort` to use index order
- Increase available memory (Atlas tier upgrade)
- Consider materialized views for repeated aggregations
Some aggregation operations (such as blocking `$sort` and certain `$group` patterns) have a 100MB in-memory limit per operation. When an operation exceeds its in-memory limit:
- Without `allowDiskUse`: the pipeline (or stage) fails
- With `allowDiskUse: true`: the operation can spill to disk (slower but completes)
**Use allowDiskUse when:**
- Processing large datasets that may exceed in-memory limits
- `$group` or `$sort` operations on high-cardinality or wide documents
- It is acceptable to trade speed for completion
**Better solutions:**
- Filter more aggressively early in the pipeline to reduce the working set
- Add or tune indexes so `$match` and `$sort` can use index order instead of in-memory sorts
- Increase available memory (for example, via an Atlas tier upgrade)
- Consider materialized views or pre-aggregated collections for repeated aggregations

Copilot uses AI. Check for mistakes.
{
"id": "covered-query-optimization",
"prompt": "I want to make this query as fast as possible without reading documents. Query: db.users.find({ status: 'active', plan: 'premium' }, { status: 1, plan: 1, email: 1, _id: 0 }).sort({ email: 1 }). Can this be a covered query?",
"expect_skill_invocation": true,
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workspace README and testing/mongodb-query-optimizer/evals/evals.json use expect_invoke, but this case metadata uses expect_skill_invocation. If your eval harness keys off expect_invoke, these cases won’t be asserted correctly. Please standardize the field name across the registry + per-case eval_metadata.json (either rename here to expect_invoke or update the registry/docs/harness to match).

Suggested change
"expect_skill_invocation": true,
"expect_invoke": true,

Copilot uses AI. Check for mistakes.
Comment on lines +7 to +15
- **Invocation**: Cases with `expect_invoke: true` should trigger the mongodb-query-optimizer skill (explicit optimization / performance / slow-query asks).
- **Boundary**: Cases with `expect_invoke: false` should **not** trigger the optimizer when the user only wants a query generated without optimization language.

## Running evals

Use your agent harness or skill-validator flow to:

1. Load each `eval_metadata.json` prompt.
2. Assert the skill is selected when `expect_invoke` is true and not selected when false (or that the optimizer workflow is not applied for negative cases).
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README instructs using expect_invoke, but the per-case eval_metadata.json files use expect_skill_invocation. Please align the README with the actual schema (or rename the metadata field) so the instructions and fixtures are consistent.

Suggested change
- **Invocation**: Cases with `expect_invoke: true` should trigger the mongodb-query-optimizer skill (explicit optimization / performance / slow-query asks).
- **Boundary**: Cases with `expect_invoke: false` should **not** trigger the optimizer when the user only wants a query generated without optimization language.
## Running evals
Use your agent harness or skill-validator flow to:
1. Load each `eval_metadata.json` prompt.
2. Assert the skill is selected when `expect_invoke` is true and not selected when false (or that the optimizer workflow is not applied for negative cases).
- **Invocation**: Cases with `expect_skill_invocation: true` should trigger the mongodb-query-optimizer skill (explicit optimization / performance / slow-query asks).
- **Boundary**: Cases with `expect_skill_invocation: false` should **not** trigger the optimizer when the user only wants a query generated without optimization language.
## Running evals
Use your agent harness or skill-validator flow to:
1. Load each `eval_metadata.json` prompt.
2. Assert the skill is selected when `expect_skill_invocation` is true and not selected when false (or that the optimizer workflow is not applied for negative cases).

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +20
4. **No array fields** are indexed (multikey indexes prevent covering)
5. **No sub-document fields** unless the exact sub-document path is indexed
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement incorrectly states that multikey indexes prevent covered queries. Multikey indexes can still cover queries under certain constraints (not returning the array field, avoiding $elemMatch, etc.). Please update this requirement so covered-query guidance matches MongoDB behavior.

Suggested change
4. **No array fields** are indexed (multikey indexes prevent covering)
5. **No sub-document fields** unless the exact sub-document path is indexed
4. **Array fields / multikey indexes** are only used in ways that still allow covering (for example, you do **not** project the array field itself and you avoid operators like `$elemMatch` that require materializing array elements)
5. **Sub-document fields** are only used when the exact indexed path(s) provide all queried and returned values

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot is correct, original claim is wrong/too general.

Comment on lines +150 to +152
**Limitations:**
- Cannot cover queries (must read documents)
- Compound multikey restriction: at most **one** array field per compound index
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bullet says multikey indexes cannot cover queries. MongoDB can produce covered queries with multikey indexes under certain constraints (e.g., avoid projecting the array field and avoid $elemMatch). Please correct/qualify this limitation so the skill’s guidance stays accurate.

Copilot uses AI. Check for mistakes.
{
"id": "optimize-this-query",
"prompt": "How do I optimize this MongoDB find? db.orders.find({ status: 'open' }).sort({ createdAt: -1 })",
"expect_skill_invocation": true,
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workspace README and testing/mongodb-query-optimizer/evals/evals.json use expect_invoke, but this case metadata uses expect_skill_invocation. If your eval harness keys off expect_invoke, these cases won’t be asserted correctly. Please standardize the field name across the registry + per-case eval_metadata.json (either rename here to expect_invoke or update the registry/docs/harness to match).

Suggested change
"expect_skill_invocation": true,
"expect_invoke": true,

Copilot uses AI. Check for mistakes.
{
"id": "optimize-aggregation-pipeline",
"prompt": "Can you optimize this aggregation pipeline? It's running slow: db.orders.aggregate([{ $lookup: { from: 'customers', localField: 'customerId', foreignField: '_id', as: 'customer' } }, { $match: { status: 'shipped' } }, { $group: { _id: '$customer.region', total: { $sum: '$amount' } } }])",
"expect_skill_invocation": true,
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workspace README and testing/mongodb-query-optimizer/evals/evals.json use expect_invoke, but this case metadata uses expect_skill_invocation. If your eval harness keys off expect_invoke, these cases won’t be asserted correctly. Please standardize the field name across the registry + per-case eval_metadata.json (either rename here to expect_invoke or update the registry/docs/harness to match).

Suggested change
"expect_skill_invocation": true,
"expect_invoke": true,

Copilot uses AI. Check for mistakes.
{
"id": "elemMatch-array-query",
"prompt": "Why is my array query returning wrong results? db.orders.find({ 'items.sku': 'ABC123', 'items.quantity': { $gte: 5 } }). Sometimes it matches orders where ABC123 has quantity 2 and a different item has quantity 10.",
"expect_skill_invocation": true,
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workspace README and testing/mongodb-query-optimizer/evals/evals.json use expect_invoke, but this case metadata uses expect_skill_invocation. If your eval harness keys off expect_invoke, these cases won’t be asserted correctly. Please standardize the field name across the registry + per-case eval_metadata.json (either rename here to expect_invoke or update the registry/docs/harness to match).

Suggested change
"expect_skill_invocation": true,
"expect_invoke": true,

Copilot uses AI. Check for mistakes.
- More index entries per document (proportional to array size)
- Larger index size on disk and in memory
- Higher write cost (must update multiple index entries per document)
- Cannot cover queries (MongoDB limitation)
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list of multikey “Costs” says multikey indexes cannot cover queries. MongoDB can produce covered queries using a multikey index under specific constraints (e.g., don’t project the array field and avoid $elemMatch). Please qualify or correct this statement so readers don’t incorrectly rule out covered-query optimizations.

Suggested change
- Cannot cover queries (MongoDB limitation)
- Covered queries are limited: multikey indexes can only cover a query when the indexed array field itself is not projected and operators such as `$elemMatch` are avoided

Copilot uses AI. Check for mistakes.
Comment on lines +138 to +142
**Requirements:**
- All queried fields are in the index
- All returned fields are in the index (or `_id` only)
- Projection must exclude `_id` unless `_id` is in the index
- No array fields indexed (multikey indexes cannot cover)
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement says multikey indexes cannot cover queries. MongoDB can produce covered queries with multikey indexes under specific constraints (e.g., don’t project the array field and avoid $elemMatch). Please correct/qualify this so the skill doesn’t incorrectly block covered-query advice.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

@asya999 asya999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll leave more comments later today

])
```

## Minimize $unwind usage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s not a thing. $unwind is not expensive. However the match before unwind example is good.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok going to rephrase this here

```

**Consider alternatives:**
- `$filter` array operator to process arrays without unwinding
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only appropriate if there is an unwind and then group by _id

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok going to specify that here

@nirinchev nirinchev changed the title MongoDB Query Optimizer Skill MongoDB Query Optimizer Skill MCP-421 Mar 12, 2026
Copy link
Collaborator

@asya999 asya999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About half way through

{ $match: { active: true } }, // Reduce left side
{ $lookup: {
from: "inventory",
localField: "product_id", // Ensure this is indexed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, why?

foreignField: "_id", // _id is always indexed
pipeline: [
{ $match: { inStock: true } }, // Reduce right side
{ $project: { name: 1, price: 1 } } // Limit fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably can exclude _id?

{ $project: { name: 1, price: 1 } } // Limit fields
],
as: "product"
}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I would expect to see $unwind here since there is only one product.


1. **Filter before grouping** to reduce input documents
2. **Use selective _id** for grouping to reduce number of groups
3. **Project only needed fields** before $group to reduce memory per document
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s wrong. Only needed fields should be kept in group but no projections should be done before group.

```javascript
[
{ $match: { date: { $gte: ISODate("2024-01-01") } } }, // Reduce input
{ $project: { category: 1, amount: 1 } }, // Only needed fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anti pattern - this is literally an example of wrong pipeline.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be example of wrong in the “don’t use $project unnecessarily or too early”

**Causes:**
- No suitable index exists
- Query optimizer chose not to use available index (rare)
- Query shape doesn't match any index prefix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as no suitable index.


### SORT (in-memory sort)

**Expensive:** Sorting results in memory after retrieval.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be (without limit)

```

**Why this is costly:**
- Uses memory (32MB limit per sort in MongoDB 3.4+, configurable with `internalQueryExecMaxBlockingSortBytes`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove internal config reference. Limit is 100MBs.

**Why this is costly:**
- Uses memory (32MB limit per sort in MongoDB 3.4+, configurable with `internalQueryExecMaxBlockingSortBytes`)
- Requires buffering all results before returning first document
- Blocking operation (can't stream results)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two say the same thing.


### SHARDING_FILTER

In sharded clusters, filters out documents not owned by the shard. Normal overhead in sharded environment.
Copy link
Collaborator

@asya999 asya999 Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh - this has more or less overhead depending on the index used. They should prefer to have an index with shard key fields so the filtering can happen against the index, but I'm not sure if there is a concise way to say it - an example could help.

Copy link
Collaborator

@dacharyc dacharyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got a handful of comments so far from the review-skill analysis/LLM-as-judge scoring - I'll take a deeper look later today for things that the skill won't catch, but wanted to give you this as a starting point.

- **`explain`** — Invoke with server = MongoDB MCP server, toolName = `explain`, arguments = object with `database`, `collection`, `method` (one find/aggregate/count shape), and `verbosity`.
- **Always load `references/explain-interpretation.md`** when running explain to interpret output correctly
- **Verbosity selection:**
- Use `"executionStats"` when query likely completes in <10 seconds (small collections, simple queries, good indexes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this path, "query likely completes in <10 seconds" - an agent can't estimate this. Can we replace with concrete criteria (e.g., collection size, index availability)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it just use maxTimeMs or whatever client short-circuit timeout in ALL cases?

- **Running explain()** → **Always** load `references/explain-interpretation.md` when calling the explain MCP tool
- **Aggregation pipelines** → Load `references/aggregation-optimization.md` **first** when dealing with aggregate pipelines
- **Queries on array fields** → Load `references/multikey-arrays.md` when fields have array values
- **Covered query optimization** → Load `references/covered-queries.md` when the query is able to have all fields fully covered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this line, "when the query is able to have all fields fully covered" - the agent can't determine this pre-analysis. Can we add observable triggers (user asks about covered queries, or explain shows FETCH with eliminable fields)?

| Operation value | Use when |
|----------------|----------|
| `slowQueryLogs` | User asks what’s slow on the cluster, which queries to fix first—**prioritize by slowest and most frequent** when the response exposes those dimensions. Optional: `namespaces` to scope to a collection; `since` for a time window. |
| `suggestedIndexes` | PA index recommendations—**validate** they apply to the user’s query and will have good impact before recommending creation. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this line, "validate they apply to the user's query and will have good impact" - "good impact" is undefined. Can we be specific about what the agent should check for? Maybe something like: confirm the suggested index matches the user's query predicates and sort.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'm a bit concerned it'll get hit by one of the existing bugs that causes "bad" PA index recommendations... but I'd hate to see us encode that here rather than just fixing them...


Index `{a:1, b:1}` supports `sort({a:1, b:1})` and reverse `sort({a:-1, b:-1})`, but NOT mixed directions like `sort({a:1, b:-1})`. For mixed sorts, create index matching exact pattern.

### Covered queries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these two sections - the inline "Covered queries" section (lines 134–144) and "Array fields and multikey indexes" section (lines 146–156) duplicate their respective reference files. Since SKILL.md loads on every invocation (~2,390 tokens) and references load conditionally, can we consider trimming these to one-line triggers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed but also a lot of things are not quite correct and I'd prefer we have to fix it in one place only, not two.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review-skill skill's LLM-as-judge-scoring notes this file doesn't contain any novel information - it's just restating things that are covered in MongoDB documentation + is already in the model's base training data set. Can we examine this content and see if any of it can be trimmed, or if the entire file could be removed from the skill? Did you do any testing with and without the skill loaded for concepts covered in this file to see if including it makes any difference in agent performance on tasks that would reference this content?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reference file scored a "3" on Novelty in the LLM-as-judge scoring, which meets the threshold for novelty but hints that it is not highly novel. The takeaway is that some of this content may be restating standard MongoDB documentation/data that is already in the LLM base training data. Have we tested agent performance with and without this file included in the skill to see if it actually improves outputs? I'm wondering if we can trim some of the non-novel content and still see any output improvement, or if we should consider restructuring to be more workflow-focused or otherwise find ways to emphasize any novel guidance aspects we're including here. Or if it may make sense to omit this reference if it doesn't actually improve agent outputs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, some of the things that are flagged as being "novel" are incorrect or should not be mentioned as they are not to be changed by user/agent (like the internal sort/group settings).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reference file scored a "3" on Novelty in the LLM-as-judge scoring, which meets the threshold for novelty but hints that it is not highly novel. The takeaway is that some of this content may be restating standard MongoDB documentation/data that is already in the LLM base training data. Have we tested agent performance with and without this file included in the skill to see if it actually improves outputs? I'm wondering if we can trim some of the non-novel content and still see any output improvement, or if we should consider restructuring to be more workflow-focused or otherwise find ways to emphasize any novel guidance aspects we're including here. Or if it may make sense to omit this reference if it doesn't actually improve agent outputs.

Copy link
Collaborator

@asya999 asya999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

additional comments

}
}
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a real example? The query isn’t included. 28 keys and 25 docs examined seems sus also.

- Queries returning most documents
- Collection fits entirely in cache

**Optimizer may choose COLLSCAN if:**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never when there is a qualified index

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, some of the things that are flagged as being "novel" are incorrect or should not be mentioned as they are not to be changed by user/agent (like the internal sort/group settings).

db.orders.aggregate([
{ $lookup: { ... } }, // Processes all documents first
{ $match: { status: "shipped" } } // Too late, already did expensive work
])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a good example as in current version we automatically swap $match in front of $lookup in this case.

```javascript
db.orders.aggregate([
{ $match: { status: "shipped", region: "US" } }, // Uses index, reduces data early
{ $lookup: { ... } },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should there be an unwind here? The only way this pipeline makes sense is if $group is grouping on something that came out of $lookup and grouping by array things is ... uncommon.


### Cardinality and selectivity

High-cardinality fields (many distinct values) are more selective—prefer for equality predicates. Low-cardinality fields (e.g., `status`) help in compound indexes. Good selectivity: `totalKeysExamined` ≈ `nReturned` in explain output.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, no. All equality fields are equally valuable - selectivity only matters overall for the query, not for individual fields...


Index `{a:1, b:1}` supports `sort({a:1, b:1})` and reverse `sort({a:-1, b:-1})`, but NOT mixed directions like `sort({a:1, b:-1})`. For mixed sorts, create index matching exact pattern.

### Covered queries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed but also a lot of things are not quite correct and I'd prefer we have to fix it in one place only, not two.

"iteration": "iteration-1",
"path": "iteration-1/array-field-query",
"prompt": "How do I optimize a query on an array field? db.products.find({ tags: 'electronics', 'specs.features': { $all: ['wifi', 'bluetooth'] } }). The tags array has hundreds of values per document.",
"expect_invoke": true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some confusion here. Tags is an array but "specs.features" is being queried with an array operator...

"path": "iteration-1/aggregation-unwind-performance",
"prompt": "My aggregation is timing out. It uses $unwind on an array with thousands of elements per document. Pipeline: [{ $unwind: '$events' }, { $match: { 'events.type': 'purchase' } }, { $group: { _id: '$userId', count: { $sum: 1 } } }]. How do I fix this?",
"expect_invoke": true,
"tags": ["aggregation", "unwind", "performance", "optimization", "references"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we remove unwind section from agg section we should ensure all the tags referencing it are also removed - the issue here is not $unwind it's the fact that there is no $match in front of it.

"prompt": "My aggregation is timing out. It uses $unwind on an array with thousands of elements per document. Pipeline: [{ $unwind: '$events' }, { $match: { 'events.type': 'purchase' } }, { $group: { _id: '$userId', count: { $sum: 1 } } }]. How do I fix this?",
"expect_skill_invocation": true,
"rationale": "Performance issue with aggregation pipeline — skill should invoke and load references/aggregation-optimization.md. Key problem: $unwind multiplies document count before filtering. Should suggest: 1) Add $match before $unwind to reduce documents, 2) Consider $filter array operator instead of $unwind + $match, 3) May need index on userId for $group, 4) Discuss allowDiskUse if needed. Should explain $unwind performance implications with large arrays.",
"optional_mcp": ["collection-indexes", "explain"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this is incorrect except for adding $match before $unwind - same as previous example so I'm not sure the value of having them both. You don't need index on userId for $group, it cannot be used here.

@codeowners-service-app
Copy link

Assigned misbamongo for team dbx-devtools because betsybutton is out of office.

Copilot AI review requested due to automatic review settings March 20, 2026 12:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 11 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

)
```

**Why:** `findOneAndUpdate` writes a copy of the pre-change document to a side collection for retryable writes. This overhead is unnecessary if you don't need the returned document. No newline at end of file
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for avoiding findOneAndUpdate is overstated as written. The extra overhead (returning the document and, for retryable writes, possibly recording pre/post images) is conditional and not simply “writes a copy … to a side collection” in all cases. Please tighten this explanation to describe the actual conditions that introduce the extra work and keep the recommendation precise.

Suggested change
**Why:** `findOneAndUpdate` writes a copy of the pre-change document to a side collection for retryable writes. This overhead is unnecessary if you don't need the returned document.
**Why:** `findOneAndUpdate` must build and return a result document and, when retryable writes with pre/post images are enabled, may perform additional internal bookkeeping to record those images. If you don't need the modified document back, `updateOne` avoids this extra work.

Copilot uses AI. Check for mistakes.
{
"id": "optimize-this-query",
"prompt": "How do I optimize this MongoDB find? db.orders.find({ status: 'open' }).sort({ createdAt: -1 })",
"expect_skill_invocation": true,
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case uses expect_skill_invocation, while the registry (testing/mongodb-query-optimizer/evals/evals.json) and workspace README refer to expect_invoke. Align the field name across the workspace so eval runners don’t need special-casing.

Suggested change
"expect_skill_invocation": true,
"expect_invoke": true,

Copilot uses AI. Check for mistakes.
Comment on lines +294 to +296
**Optimizer willmay choose COLLSCAN in absence of an eligible index.f:**

- Index selectivity is poor for this query
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This heading sentence is garbled, which makes the guidance hard to read and search. Please fix the wording/typos so it reads as a coherent statement about when the optimizer may pick a COLLSCAN.

Copilot uses AI. Check for mistakes.
Comment on lines +186 to +187
Blocking stages (such as in-memory $sort and $group) have a 100MB memory limit per stage. Now, the default behavior when this limit is exceeded is to spill to disk automatically (allowDiskUse defaults to true).

Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This states that spilling to disk is automatic because allowDiskUse “defaults to true”. That’s not generally correct—disk spilling for blocking stages depends on allowDiskUse (or a server-level default like allowDiskUseByDefault), and many deployments default to not spilling. Please adjust this to describe the actual condition under which spilling occurs, and mention how to enable it if desired.

Suggested change
Blocking stages (such as in-memory $sort and $group) have a 100MB memory limit per stage. Now, the default behavior when this limit is exceeded is to spill to disk automatically (allowDiskUse defaults to true).
Blocking stages (such as in-memory $sort and $group) have a 100MB memory limit per stage. Whether they can spill to disk when this limit is exceeded depends on the aggregation's allowDiskUse setting (or a server-level default such as allowDiskUseByDefault).
If allowDiskUse is false and a blocking stage exceeds its memory limit, the operation fails with an error instead of spilling. To allow spilling, run the aggregation with { allowDiskUse: true } or configure an appropriate server-level default.

Copilot uses AI. Check for mistakes.

- **Equality** fields first (e.g. `{field: value}`, `{$in: [...]}` matching \<= 200 elements, `{field: {$eq: value}}`)
- **Sort** fields next
- **Range** fields last (e.g. `$gt`, `$lt`, `$gte`, `$lte`, `{$in: [...]}`matching \> 200 elements in the array), `$ne`)
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ESR bullet for “Range” has a typo and mismatched punctuation/parentheses (e.g., {$in: [...]}matching ...), which makes the rule harder to parse. Please fix spacing and the parenthesis structure so the examples read cleanly and unambiguously.

Suggested change
- **Range** fields last (e.g. `$gt`, `$lt`, `$gte`, `$lte`, `{$in: [...]}`matching \> 200 elements in the array), `$ne`)
- **Range** fields last (e.g. `$gt`, `$lt`, `$gte`, `$lte`, `{$in: [...]}` matching \> 200 elements in the array, `$ne`)

Copilot uses AI. Check for mistakes.
"id": "covered-query-optimization",
"prompt": "I want to make this query as fast as possible without reading documents. Query: db.users.find({ status: 'active', plan: 'premium' }, { status: 1, plan: 1, email: 1, _id: 0 }).sort({ email: 1 }). Can this be a covered query?",
"expect_skill_invocation": true,
"rationale": "User asks about covered query optimization explicitly. Skill should invoke and load references/covered-queries.md. Should explain requirements (all fields in index, _id: 0), suggest compound index { status: 1, plan: 1, email: 1 } following ESR, verify with explain showing totalDocsExamined: 0.",
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale references references/covered-queries.md, but that file does not exist under skills/mongodb-query-optimizer/references/. Update the reference to an existing doc (e.g., the “Covered Queries” section in references/core-indexing-principles.md) or add the missing reference file so the eval guidance is actionable.

Suggested change
"rationale": "User asks about covered query optimization explicitly. Skill should invoke and load references/covered-queries.md. Should explain requirements (all fields in index, _id: 0), suggest compound index { status: 1, plan: 1, email: 1 } following ESR, verify with explain showing totalDocsExamined: 0.",
"rationale": "User asks about covered query optimization explicitly. Skill should invoke and load references/core-indexing-principles.md (Covered Queries section). Should explain requirements (all fields in index, _id: 0), suggest compound index { status: 1, plan: 1, email: 1 } following ESR, verify with explain showing totalDocsExamined: 0.",

Copilot uses AI. Check for mistakes.
"id": "optimize-aggregation-pipeline",
"prompt": "Can you optimize this aggregation pipeline? It's running slow: db.orders.aggregate([{ $lookup: { from: 'customers', localField: 'customerId', foreignField: '_id', as: 'customer' } }, { $match: { status: 'shipped' } }, { $group: { _id: '$customer.region', total: { $sum: '$amount' } } }])",
"expect_skill_invocation": true,
"rationale": "Explicit 'optimize' + aggregation pipeline — skill should invoke and MUST load references/aggregation-optimization.md. Key issue: $match after $lookup (should be before). Should suggest moving $match earlier and checking indexes.",
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale says to load references/aggregation-optimization.md, but the repo contains references/aggregation-pipeline-examples.md instead (and no aggregation-optimization.md). Update this reference so the eval points at a real file, otherwise the evaluation instructions are impossible to follow.

Suggested change
"rationale": "Explicit 'optimize' + aggregation pipeline — skill should invoke and MUST load references/aggregation-optimization.md. Key issue: $match after $lookup (should be before). Should suggest moving $match earlier and checking indexes.",
"rationale": "Explicit 'optimize' + aggregation pipeline — skill should invoke and MUST load references/aggregation-pipeline-examples.md. Key issue: $match after $lookup (should be before). Should suggest moving $match earlier and checking indexes.",

Copilot uses AI. Check for mistakes.
"id": "aggregation-unwind-performance",
"prompt": "My aggregation is timing out. It uses $unwind on an array with thousands of elements per document. Pipeline: [{ $unwind: '$events' }, { $match: { 'events.type': 'purchase' } }, { $group: { _id: '$userId', count: { $sum: 1 } } }]. How do I fix this?",
"expect_skill_invocation": true,
"rationale": "Performance issue with aggregation pipeline — skill should invoke and load references/aggregation-optimization.md. Key problem: $unwind multiplies document count before filtering. Should suggest: 1) Add $match before $unwind to reduce documents, 2) Consider $filter array operator instead of $unwind + $match, 3) May need index on userId for $group, 4) Discuss allowDiskUse if needed. Should explain $unwind performance implications with large arrays.",
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale says to load references/aggregation-optimization.md, but that file does not exist under skills/mongodb-query-optimizer/references/ (the closest match is aggregation-pipeline-examples.md). Update the reference so this eval metadata doesn’t point to a missing file.

Suggested change
"rationale": "Performance issue with aggregation pipeline — skill should invoke and load references/aggregation-optimization.md. Key problem: $unwind multiplies document count before filtering. Should suggest: 1) Add $match before $unwind to reduce documents, 2) Consider $filter array operator instead of $unwind + $match, 3) May need index on userId for $group, 4) Discuss allowDiskUse if needed. Should explain $unwind performance implications with large arrays.",
"rationale": "Performance issue with aggregation pipeline — skill should invoke and load references/aggregation-pipeline-examples.md. Key problem: $unwind multiplies document count before filtering. Should suggest: 1) Add $match before $unwind to reduce documents, 2) Consider $filter array operator instead of $unwind + $match, 3) May need index on userId for $group, 4) Discuss allowDiskUse if needed. Should explain $unwind performance implications with large arrays.",

Copilot uses AI. Check for mistakes.
Comment on lines +3 to +19
**Before** — `$exists: true` on a regular index still requires a document fetch:

```javascript
db.collection.createIndex({ a: 1 })
db.collection.find({ a: { $exists: true } })
// Cannot efficiently answer — null semantics require checking each document
```

**After** — Use a sparse index, which only contains entries where the field exists:

```javascript
db.collection.createIndex({ a: 1 }, { sparse: true })
db.collection.find({ a: { $exists: true } })
// Answered directly from the index — no document fetch needed
```

**Why:** Regular indexes store `null` for both missing and explicitly-null fields, so `$exists` can't distinguish them without fetching. Sparse indexes only store entries for documents where the field exists.
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $exists + sparse-index guidance appears incorrect/misleading: for a single-field index, documents missing the field typically have no index entry, so { a: { $exists: true } } can usually be satisfied via an index scan without requiring a sparse index. Also, sparse indexes change query semantics in some cases and shouldn’t be recommended solely to “avoid a document fetch” here. Please revise this example to reflect correct index behavior (and call out the real tradeoffs/when sparse or partial indexes help).

Copilot uses AI. Check for mistakes.
Comment on lines +11 to +17
**After** — Use aggregation-based update to generate smaller oplog deltas:

```javascript
db.coll.updateOne({ _id: X }, [{ $replaceWith: { $literal: entireNewDocument } }])
```

**Why:** `replaceOne` writes the full document to the oplog. The aggregation update syntax lets MongoDB compute deltas, resulting in smaller oplog entries when only a few fields are changed.
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suggests that using an aggregation-pipeline update with $replaceWith produces “smaller oplog deltas” than replaceOne. An update pipeline that replaces the whole document is still a full replacement in effect, and may not reduce oplog size vs replaceOne. If the goal is smaller oplog entries, the example should demonstrate updating only changed fields (e.g., $set/$unset) and explain when MongoDB can record diffs vs full replacements.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants