MongoDB slow for large databases

I spent some time investigating the slowness of the `li-ion-conductors` optimade database.

Here's the info endpoint: https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/info

Accessing the `/structures` endpoint takes over 2 minutes.

I turned on performance profiling (`db.setProfilingLevel(1)`) and here's the part of the log for accessing the `/structures` endpoint (only the slow commands should show up here, meaning everything else was fast, i think):

<details>
  <summary>/structures profiling (click to show)</summary>

```
> db.system.profile.find().pretty()
{
	"op" : "command",
	"ns" : "li-ion-conductors.structures",
	"command" : {
		"aggregate" : "structures",
		"pipeline" : [
			{
				"$match" : {
					
				}
			},
			{
				"$group" : {
					"_id" : 1,
					"n" : {
						"$sum" : 1
					}
				}
			}
		],
		"cursor" : {
			
		},
		"lsid" : {
			"id" : UUID("f822cd93-f176-4405-8c3a-062a5c7e79d8")
		},
		"$db" : "li-ion-conductors"
	},
	"keysExamined" : 0,
	"docsExamined" : 4396695,
	"cursorExhausted" : true,
	"numYield" : 7910,
	"nreturned" : 1,
	"locks" : {
		"FeatureCompatibilityVersion" : {
			"acquireCount" : {
				"r" : NumberLong(7913)
			}
		},
		"ReplicationStateTransition" : {
			"acquireCount" : {
				"w" : NumberLong(7913)
			}
		},
		"Global" : {
			"acquireCount" : {
				"r" : NumberLong(7913)
			}
		},
		"Database" : {
			"acquireCount" : {
				"r" : NumberLong(7912)
			}
		},
		"Collection" : {
			"acquireCount" : {
				"r" : NumberLong(7912)
			}
		},
		"Mutex" : {
			"acquireCount" : {
				"r" : NumberLong(2)
			}
		}
	},
	"flowControl" : {
		
	},
	"storage" : {
		"data" : {
			"bytesRead" : NumberLong("17760670501"),
			"timeReadingMicros" : NumberLong(135074290)
		}
	},
	"responseLength" : 141,
	"protocol" : "op_msg",
	"millis" : 139328,
	"planSummary" : "COLLSCAN",
	"ts" : ISODate("2023-08-22T14:51:08.005Z"),
	"client" : "172.18.0.1",
	"allUsers" : [ ],
	"user" : ""
}
```
</details>

Some key points:
* Basically, what seems to be the slow part, is just the counting of the total number of structures. (`$match` everything in the collection; group by `$_id` and then just sum the number).
* it takes 139328 milliseconds to run
* almost all of the time (135074 milliseconds) is spent on reading 17 GB of data from disk.
* This command has to do a full scan of the documents (`COLLSCAN`). I don't think this command, in it's current state, could be sped up by using indexes.

I am wondering if this functionality could be implemented in a more efficient way. For example, `db.structures.count()` runs instantly. 

Just for additional information, initially also accessing a single structure was as slow (2+ min), e.g. via

https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures/5b5b4b01-5b7e-48ad-8e17-8077f9b0b5d2

But after I added the `id` index with `db.structures.createIndex({ id: 1 })`, it's fast now.

Pinging @ml-evs @unkcpz @superstar54 for comments/ideas regarding the "counting" speedup. i suspect this is probably something that should be adapted in https://github.com/Materials-Consortia/optimade-python-tools?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MongoDB slow for large databases #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MongoDB slow for large databases #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions