Skip to content

MongoDB slow for large databases #24

@eimrek

Description

@eimrek

I spent some time investigating the slowness of the li-ion-conductors optimade database.

Here's the info endpoint: https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/info

Accessing the /structures endpoint takes over 2 minutes.

I turned on performance profiling (db.setProfilingLevel(1)) and here's the part of the log for accessing the /structures endpoint (only the slow commands should show up here, meaning everything else was fast, i think):

/structures profiling (click to show)
> db.system.profile.find().pretty()
{
	"op" : "command",
	"ns" : "li-ion-conductors.structures",
	"command" : {
		"aggregate" : "structures",
		"pipeline" : [
			{
				"$match" : {
					
				}
			},
			{
				"$group" : {
					"_id" : 1,
					"n" : {
						"$sum" : 1
					}
				}
			}
		],
		"cursor" : {
			
		},
		"lsid" : {
			"id" : UUID("f822cd93-f176-4405-8c3a-062a5c7e79d8")
		},
		"$db" : "li-ion-conductors"
	},
	"keysExamined" : 0,
	"docsExamined" : 4396695,
	"cursorExhausted" : true,
	"numYield" : 7910,
	"nreturned" : 1,
	"locks" : {
		"FeatureCompatibilityVersion" : {
			"acquireCount" : {
				"r" : NumberLong(7913)
			}
		},
		"ReplicationStateTransition" : {
			"acquireCount" : {
				"w" : NumberLong(7913)
			}
		},
		"Global" : {
			"acquireCount" : {
				"r" : NumberLong(7913)
			}
		},
		"Database" : {
			"acquireCount" : {
				"r" : NumberLong(7912)
			}
		},
		"Collection" : {
			"acquireCount" : {
				"r" : NumberLong(7912)
			}
		},
		"Mutex" : {
			"acquireCount" : {
				"r" : NumberLong(2)
			}
		}
	},
	"flowControl" : {
		
	},
	"storage" : {
		"data" : {
			"bytesRead" : NumberLong("17760670501"),
			"timeReadingMicros" : NumberLong(135074290)
		}
	},
	"responseLength" : 141,
	"protocol" : "op_msg",
	"millis" : 139328,
	"planSummary" : "COLLSCAN",
	"ts" : ISODate("2023-08-22T14:51:08.005Z"),
	"client" : "172.18.0.1",
	"allUsers" : [ ],
	"user" : ""
}

Some key points:

  • Basically, what seems to be the slow part, is just the counting of the total number of structures. ($match everything in the collection; group by $_id and then just sum the number).
  • it takes 139328 milliseconds to run
  • almost all of the time (135074 milliseconds) is spent on reading 17 GB of data from disk.
  • This command has to do a full scan of the documents (COLLSCAN). I don't think this command, in it's current state, could be sped up by using indexes.

I am wondering if this functionality could be implemented in a more efficient way. For example, db.structures.count() runs instantly.

Just for additional information, initially also accessing a single structure was as slow (2+ min), e.g. via

https://dev-optimade.materialscloud.org/archive/li-ion-conductors/v1/structures/5b5b4b01-5b7e-48ad-8e17-8077f9b0b5d2

But after I added the id index with db.structures.createIndex({ id: 1 }), it's fast now.

Pinging @ml-evs @unkcpz @superstar54 for comments/ideas regarding the "counting" speedup. i suspect this is probably something that should be adapted in https://github.com/Materials-Consortia/optimade-python-tools?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions