-
Notifications
You must be signed in to change notification settings - Fork 4
add locomo results for MemMachine v0.2 #41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tomw-mv
wants to merge
6
commits into
MemMachine:main
Choose a base branch
from
tomw-mv:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
f2fb10a
add locomo results for MemMachine v0.2
tomw-mv 2f04f3a
try to fix table formatting
tomw-mv 67e0ec6
Resolve comments to index.md, add feature image.
tomw-mv 5d6da7f
resolve comments in index.md
tomw-mv 30df5aa
resolve comments in index.md
tomw-mv 56dce52
update feature image, index.md
tomw-mv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,200 @@ | ||
| --- | ||
| title: "MemMachine v0.2 release - Locomo benchmark results" | ||
| date: 2025-12-01T13:22:00-08:00 | ||
| featured_image: "featured_image.png" | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| tags: ["AI Agent", "LoCoMo Benchmark", "Generative AI", "LLM", "Agent Memory", "featured"] | ||
| author: "The MemMachine Team" | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| description: "This paper presents the results of the Mem0 evaluation of Locomo benchmark for the v0.2 release of memMachine." | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| aliases: | ||
| - /blog/2025/12/locomo-results-v0.2/ | ||
| --- | ||
|
|
||
| # MemMachine v0.2 release - Locomo benchmark results | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Introduction | ||
|
|
||
| This paper presents the results of the Mem0 evaluation of Locomo benchmark for the v0.2 release of memMachine. This new release delivers significant improvements over the previous release v0.1.5 in many aspects of the memory system. | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| The test environment is setup as follows. | ||
|
|
||
| ### Eval-LLM | ||
tomw-mv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The eval-LLM is the chat LLM that is used to answer the questions in the Locomo benchmark. The choice of eval-LLM can significantly influence the resulting score. The Mem0 evaluation of Locomo benchmark historically uses the OpenAI gpt-4o-mini as the eval-LLM. To compare different products, the eval-LLM will be the same for all products under test. | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| Since the original Mem0 evaluation of Locomo benchmark was published, the newer OpenAI gpt-4.1-mini LLM was introduced. In this paper, we also do a comparison between the original gpt-4o-mini and the newer gpt-4.1-mini when used as the eval-LLM. | ||
|
|
||
|
|
||
| ### Vector DB embedder | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| When using a memory system, the embedder is the essential element that is used to index the memories in a chat history. The embedder facilitates the retrieval of saved memories to correctly and factually answer questions. The choice of embedder can significantly influence the quality of the answers provided by the memory system. The Mem0 evaluation of Locomo benchmark historically uses the OpenAI text-embedding-3-small embedder. To compare different products, the embedder will be the same for all products under test. | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| ### Judge-LLM | ||
|
|
||
| The judge-LLM is the chat LLM that is used to judge if the response from the eval-LLM correctly answers the question in the Locomo benchmark. | ||
|
|
||
| The choice of judge-LLM can significantly influence the resulting score. Different LLMs may give false positives or false negatives when judging the same response. The Mem0 evaluation of Locomo benchmark historically uses the OpenAI gpt-4o-mini as the judge-LLM. To compare different products, the judge-LLM will be the same for all products under test. | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| ### Reranker | ||
|
|
||
| When using a memory system, the reranker is an element that is used to re-evaluate the best matching memories that were retrieved by the embedder. The reranker will re-sort the retrieved results moving the best results to the top. The reranker provides a second level of evaluation, providing the best set of saved memories to correctly and factually answer questions. The choice of reranker can significantly influence the quality of the answers provided by the memory system. The MemMachine v0.2 uses the AWS cohere.rerank-v3-5:0 as the reranker for the Locomo benchmark. | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| ### Question categories | ||
|
|
||
| The original Locomo benchmark has 5 categories of questions. The Mem0 evaluation of Locomo benchmark uses questions from 4 of the 5 categories. The categories used are as follows: | ||
|
|
||
| | Category number | Description | Total Questions | | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | --------------- | ----------- | --------------- | | ||
| | **1** | **Single-Hop:** Questions asking for specific facts directly mentioned in the single session conversation. | 282 | | ||
| | **2** | **Temporal Reasoning:** Questions can be answered through temporal reasoning and capturing time-related data cues within the conversation. | 321 | | ||
| | **3** | **Multi-Hop:** Questions that require synthesizing information from multiple sessions. | 96 | | ||
| | **4** | **Open-Domain:** Questions can be answered by integrating a speaker’s provided information with external knowledge, such as commonsense or world facts. | 841 | | ||
|
|
||
|
|
||
| ### LLM-score | ||
tomw-mv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The judge-LLM will evaluate each question from the categories, and for each question the eval-LLM response is compared to the golden answer. The judge-LLM assigns a llm-score of 1 if the answers are equal, and 0 otherwise. The llm-score is tabulated for each question category. Then the weighted mean is calculated to give an overall mean llm-score. | ||
|
|
||
|
|
||
| ### Memory and agent modes | ||
|
|
||
| The MemMachine works in either memory mode or agent mode. In memory mode, the MemMachine directly provides the context for a question being asked. There is a single request to the eval-LLM for each question. In agent mode, the MemMachine is presented as an OpenAI-agent to the eval-LLM. When a question is given to the eval-LLM, the eval-LLM will use the MemMachine agent as a tool to retrieve the context for the question. The eval-LLM may perform several rounds of requests to the MemMachine agent to formulate the best response. | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| ## LLM-score results | ||
|
|
||
| Here are the observed llm-scores for MemMachine v0.2. | ||
|
|
||
|
|
||
| ### LLM-score using gpt-4o-mini | ||
|
|
||
| The eval-LLM is gpt-4o-mini to compare against other memory products. | ||
|
|
||
|
|
||
| #### Memory mode (gpt-4o-mini) | ||
|
|
||
| Mean score per category | ||
|
|
||
| | Locomo category | bleu-score | f1-score | llm-score | count | | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | --------------- | ---------- | -------- | --------- | ----- | | ||
| | 1.single hop | 0.1407 | 0.1993 | 0.8759 | 282 | | ||
| | 2.temporal | 0.0977 | 0.1847 | 0.7352 | 321 | | ||
| | 3.multi hop | 0.0871 | 0.1191 | 0.7083 | 96 | | ||
| | 4.open domain | 0.1436 | 0.2519 | 0.9465 | 841 | | ||
|
|
||
| Overall mean score | ||
|
|
||
| | Category | Score | | ||
| | -------- | ----- | | ||
| | bleu-score | 0.1300 | | ||
| | f1-score | 0.2200 | | ||
| | llm-score | 0.8747 | | ||
|
|
||
|
|
||
| #### Agent mode (gpt-4o-mini) | ||
|
|
||
| Mean score per category | ||
|
|
||
| | Locomo category | bleu-score | f1-score | llm-score | count | | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | --------------- | ---------- | -------- | --------- | ----- | | ||
| | 1.single hop | 0.1147 | 0.1684 | 0.8404 | 282 | | ||
| | 2.temporal | 0.1402 | 0.2242 | 0.8069 | 321 | | ||
| | 3.multi hop | 0.0666 | 0.1037 | 0.7396 | 96 | | ||
| | 4.open domain | 0.1415 | 0.2508 | 0.9394 | 841 | | ||
|
|
||
| Overall mean score | ||
|
|
||
| | Category | Score | | ||
| | -------- | ----- | | ||
| | bleu-score | 0.1316 | | ||
| | f1-score | 0.2210 | | ||
| | llm-score | 0.8812 | | ||
|
|
||
|
|
||
|  | ||
|
|
||
|
|
||
| ### LLM-score using gpt-4.1-mini | ||
|
|
||
| The newer gpt-4.1-mini provides better results than the previous LLM. Here are the observed llm-scores for MemMachine v0.2 when the eval-LLM is gpt-4.1-mini. We also re-ran the Mem0 memory system using gpt-4.1-mini for comparison. | ||
|
|
||
|
|
||
| #### Memory mode (gpt-4.1-mini) | ||
|
|
||
| Mean score per category | ||
|
|
||
| | Locomo category | bleu-score | f1-score | llm-score | count | | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | --------------- | ---------- | -------- | --------- | ----- | | ||
| | 1.single hop | 0.1795 | 0.2497 | 0.8972 | 282 | | ||
| | 2.temporal | 0.1521 | 0.2549 | 0.8910 | 321 | | ||
| | 3.multi hop | 0.1059 | 0.1429 | 0.7500 | 96 | | ||
| | 4.open domain | 0.1868 | 0.3127 | 0.9441 | 841 | | ||
|
|
||
| Overall mean score | ||
|
|
||
| | Category | Score | | ||
| | -------- | ----- | | ||
| | bleu-score | 0.1732 | | ||
| | f1-score | 0.2785 | | ||
| | llm-score | 0.9123 | | ||
|
|
||
|
|
||
| #### Agent mode (gpt-4.1-mini) | ||
tomw-mv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Mean score per category | ||
|
|
||
| | Locomo category | bleu-score | f1-score | llm-score | count | | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| | --------------- | ---------- | -------- | --------- | ----- | | ||
| | 1.single hop | 0.1460 | 0.2125 | 0.8830 | 282 | | ||
| | 2.temporal | 0.1363 | 0.2366 | 0.9159 | 321 | | ||
| | 3.multi hop | 0.0744 | 0.1167 | 0.7188 | 96 | | ||
| | 4.open domain | 0.1613 | 0.2836 | 0.9512 | 841 | | ||
|
|
||
| Overall mean score | ||
|
|
||
| | Category | Score | | ||
| | -------- | ----- | | ||
| | bleu-score | 0.1479 | | ||
| | f1-score | 0.2503 | | ||
| | llm-score | 0.9169 | | ||
|
|
||
|
|
||
|  | ||
|
|
||
|
|
||
| ## Token usage results | ||
|
|
||
| When using a memory system, the retrieved memories are added to the question when presented to the eval-LLM. The amount of context generated by the memory system will add to the input token usage (e.g. prompt token usage). The final prompt will also influence the amount of output tokens that the eval-LLM will emit. | ||
|
|
||
| Here are the observed token usage by MemMachine when in memory mode and in agent mode. The Mem0 token usage is shown for comparison. | ||
|
|
||
| | memory system | i_token | o_token | | ||
| | ------------- | ------- | ------- | | ||
| | memmachine v0.2 gpt-4.1-mini memory mode | 4199096 | 43169 | | ||
| | memmachine v0.2 gpt-4.1-mini agent mode | 8571936 | 93210 | | ||
| | mem0 main/HEAD gpt-4.1-mini memory mode | 19206707 | 14840 | | ||
tomw-mv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
|  | ||
|
|
||
tomw-mv marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## Search time results | ||
|
|
||
| The MemMachine v0.2 release has many new optimizations to the handling of episodic memory. Both the add and search memory times are significantly improved. | ||
|
|
||
| There is approximately 75% reduction in add memory time compared to previous release. | ||
|
|
||
|  | ||
|
|
||
|
|
||
| There is up to 75% reduction in search memory time compared to previous release. | ||
|
|
||
|  | ||
|
|
||
|
|
||
| ## Conclusion | ||
|
|
||
| The new MemMachine v0.2 provides significant improvements over previous release, and is one of the highest scoring memory systems in the market. | ||
tomw-mv marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
Binary file added
BIN
+41.3 KB
content/en/blog/2025/12/locomo-results-v0.2/memmachine-v0.2-add-time.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+41 KB
content/en/blog/2025/12/locomo-results-v0.2/memmachine-v0.2-i_token-usage.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+48 KB
.../en/blog/2025/12/locomo-results-v0.2/memmachine-v0.2-llm_score-gpt-4.1-mini.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+60.4 KB
...t/en/blog/2025/12/locomo-results-v0.2/memmachine-v0.2-llm_score-gpt-4o-mini.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+47.6 KB
content/en/blog/2025/12/locomo-results-v0.2/memmachine-v0.2-o_token-usage.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+44.2 KB
content/en/blog/2025/12/locomo-results-v0.2/memmachine-v0.2-search-time.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+43.5 KB
content/en/blog/2025/12/locomo-results-v0.2/memmachine-v0.2-token-usage.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.