-
-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata Provider: Improve startup times by avoiding directory scanning #1089
base: master
Are you sure you want to change the base?
Conversation
…rst before iterating over all files
Thank you very much for the detailed and thorough investigation! Yes, the current implementation is perhaps a bit naive, and focuses more on being cross-platform and avoiding corner cases, than on raw speed. In Qt, as far as I know QDirIterator is the only built-in method for iterating the contents of a directory, and with some useful filtering features. For smaller collections, and the original target of embedded Linux (eg. the Raspberry Pi) it had an acceptable performance. Later, requests for handling large datasets, and also platform changes, like the filesystem update in Android 10+ made the issue more apparent. Based on your analysis, it does seem like QDirIterator provides more information than what we need most of the time. Looking into the Qt code, it seems the iterator creates a file info object for each file, which in turn calls I was interested in how QDirIterator compares to other solutions, and made a little benchmark. Running on desktop Linux on a test set of ~40 000 files, the classic About the first QDirIteratorThe prefixed metadata file names is a feature that was asked for, and we should keep it. There are two important use cases:
So we cannot use the shortcut of just checking the existence of a particular path. It is also unlikely, but possible that the path is a directory. However, these checks can be done after we iterated through the file names, as opposed to querying each file, like QDirIterator. About the second QDirIteratorIf a Collection in the metadata file has an Originally Pegasus showed only the games where the game files actually existed. With large game sets, these file checks also took a considerable time, which is why an option was added to show every game, and assume that the metadata file is correctly filled with valid games. Thus the settings option only disables an additional filtering step, but not the game lookup step. However, note that this scanning only happens if the Collection has the properties mentioned above. If you know that your metadata file contains everything you need as a list of Game entries, then you can remove the In case scanning is necessary, the iteration there happens entirely over file names, so this is another place that would benefit from such optimization. |
@mmatyas appreciate the feedback ❤️ and if it's not clear, I love Pegasus -- great work. I should have mentioned as well, that I was trying Pegasus on several devices (Windows desktop, Retroid Pocket 3+ (Android), Ayn Odin 2 Pro (Android), and others). For some reason, Pegasus startup was especially slow on the RG405M where the other devices loaded a lot faster (<10s), using the same content set. So maybe there's something especially bad with the hardware/cpu/sd reader in the RG405M? I love your research into the other solutions, and yes a straight 1st QDirIterator Totally agree the prefixed metadata files feature should be kept! To be clear, the proposed PR here takes a short-cut to look for just If not found, it proceeds with a full directory scan, as before. The only downside I see is if an existing user has both a But I'm wondering, how common would that be? Someone using 2nd QDirIterator That's a good point. I had also tested just removing my My suggestion is that the option Only show existing games is logically equivalent today to "don't scan my files" when it's un-checked. This PR also applies that "don't scan my files" logic to the But maybe both of those changes are moot if we can convince ourselves Thanks for your feedback! Happy to make any updates if you want to take either of the above changes. |
Thanks for sharing your points! Hm, perhaps the issue on RG405M might not be the hardware, but the software? As far as I know, ever since the file system changes on Android, all file operations must go through a permission checking layer. This is where the RG405M might not excel, though the Retroid Pocket 3+ also supposed to have a recent Android, so I might be wrong here. 1st QDirIterator I certainly agree it would be a good shortcut, but I think making a difference between the file namings would introduce an unnecessary friction. I would like to avoid such points of "gotcha", where something has to be kept in mind because it is not obvious from our design. I believe having both kinds of files wouldn't be rare actually; I can imagine people to start splitting a large metadata file into smaller chunks when the main file gets too big. I think we can also assume some sort of categorization: while it is absolutely possible that someone might store even 100 000 games in a single directory, and a metadata file next to them, I feel most at that point would start making some subdirectories. If we assume humanly manageable directory sizes, as our general target, then a non-recursive, non- 2nd QDirIterator I agree the wording might not be the most clear there; we can probably improve that, or add a description to the option. Perhaps the documentation too could be extended with a page for the settings options. There are people who add details to every game, but also people who rely on auto-collecting all game files and assets, and specify details to none or only but a few of them in the metadata file. This option is also used by the optional third-party data sources, when a file is referenced in their data. A possible rewording should account for all the diverse use cases. Regarding Yes, now that we have a good candidate for optimizing the whole directory iteration, and possibly through the whole codebase, it is possible that the situation changes significantly. It's probably not a magic solution, but perhaps we can actually avoid any user-side changes, and keep things as internal optimizations. We can always do new measurements if our performance gains turn out to be less than expected. |
Sounds good to me! I'll take a stab at a lower-level |
News about this pull? |
I might actually have a little time today/next week for some experimenting. |
So I've replaced the QDirIterators in the metadata provider code, and while the result is measurably faster, by about 15–20% at best, it's not as fast as I was expecting it. Looking into some perf analysis it seems the string operations, UTF8 <-> UTF16 conversions, and calculating things like absolute paths do have a cost. The code could be still optimized a bit, and I've measured on a RAM-backed storage, so the results might not be representative. Some important notes: the dirent method will likely not work out of the box on Windows with MSVC, so different filesystem backends would need to be implemented. The fast path to use dirent itself too would rely on a non-POSIX standard field to tell the file's type, and when it comes to symlinks, stat is still needed. I'm also concerned about Unicode support, which I haven't tested yet. I've also run into issues with subdirectory iteration in the test files, but that might be on me. Let me know if you also made an implementation and got different results! |
What's the cause of this UTF8-UTF16 conversions, is it related to windows? The step further can be to have some sort of cache. All mtime of directories could be stored, and their files they have. |
Qt uses UTF16 internally, while different platforms use their platform-specific encoding. Some conversion will be always necessary, which is handled internally by Qt. Conversion has a cost, but this shouldn't be the bottleneck. I haven't tested my experiment on Windows yet, so it's entirely possible that the results would validate the implementation there. It is possible to change directory contents without updating the directory's modification date, so trusting purely that value would likely be inaccurate (but of course you mentioned this too in a different issue). It could be experimented with though. There are also other low-level directory iteration solutions we could try for more raw speed. |
There are a few open Issues sharing problems with Android startup times being slow, especially if there are a lot of files and directories to parse.
For example: e.g. #937 and #839
On one of my devices (Anbernic RG405M), I am using Pegasus with 30+ directories and 40k+ files, each with their own screenshot. Each directory has a complete
metadata.pegasus.txt
covering all of the files. Pegasus is regularly taking over a minute to start up. This is frustratingly slow, so I looked for opportunities to make it faster.In profiling the application startup, one thing I found is that the
pegasus_metadata
provider uses aQDirIterator
to go over each directory's contents, when it's looking for matchingmetadata.pegasus.txt
,metadata.txt
or*.metadata.pegasus.txt
files. I don't know a ton about QT, but from a few threads I've read,QDirIterator
may run astat()
on each file it finds (requesting more details from the file system). This is a lot more expensive than just getting a list of file names (with no details), which likely contributes to a slow directory processing time when reading directories with a lot of files.It also appears we do a full
QDirIterator
on each directory at least twice:*metadata*.txt
filesI first attempted a couple ways of making the
QDirIterators
themselves faster. Here are some startup time results for my tests:nameFilter
to the iterator and specifying onlymetadata.pegasus.txt
,metadata.txt
,*.metadata.pegasus.txt
and*.metadata.txt
: 50s (same as above)nameFilter
must still get andstat()
all files before returning, so this didn't save any time)QDir::entryList()
instead of aQDirIterator
- 50s (same as above)QDirIterator
)So, none of the above changes helped.
I believe each of the two
QDirIterators
phases was contributing to about 20-25s of startup time, so next I looked to find a way to avoidQDirIterator
entirely.Avoiding the first
QDirIterator
One fix we could apply to save time: when looking for matching metadata files, run a small
QDir::exists()
check on justmetadata.pegasus.txt
andmetadata.txt
first and return immediately without a full directory scan if so (avoiding the slowstat()
calls to the other files).In the "best case" scenario (where
metadata.pegasus.txt
exists), this eliminates aQDirIterator
pass entirely, reducing startup time in half (to about 25s).One downside of doing this: It could cause a compatibility concern if people have both a
metadata.pegasus.txt
and then additional*.metadata.pegasus.txt
files. Do we expect that to be common? I think we could explain this limitation in the release notes and wiki to avoid confusion: use one or the other file naming scheme, not both.Avoiding the second
QDirIterator
I reviewed the remaining 25 seconds of startup time, and most of it was spent doing the second
QDirIterator
. This happens after the metadata files have been parsed, inPegasusFilter.cpp
, when applying the include/exclude list to the list of file extensions. It's looking for additional files to include dynamically, if I understand it correctly?As far as I can tell, this step isn't necessary if you have a "full" metadata file (with all of the files you want defined already), and have Only show existing games unchecked. I think, in the spirit of #899, we should avoid doing this inclusion/exclusion scan if that option is un-checked. In that case, we "trust" the metadata file to be the complete view of the world and shouldn't be looking at the disk.
This removes the second
QDirIterator
entirely.Results
On my Android device that was taking 50+ seconds to show the theme, it now loads in under 5 seconds. The rest of the startup time is mostly spent parsing the
metadata.pegasus.txt
file format. On my faster devices, the theme loads near instantly (except for Media processing, which I have a separate PR for).