-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
Currently, the database is 378MB. This seems huge considering that it theoretically only contains hashes, versions and filenames.
A fast investigation revealed that:
- The database contains hashes for useless filenames. Some files are inside the intellij
.ideafolder and knowning their hash is not usefull. Same for all files in test-related folders. A request in the database revealed that these test files account for at least 40% of all files. Some files also ends with the.phpextension. There is no chance that these files will be usefull to detect a version. Php files represents 50% of all files. - The versions are stored in json, which is quite verbose. All hashes seems to correspond to a continuous range of versions. Replacing the json field by two fields 'initial_version' and 'last_version' could greatly reduce the size of these data. The range could then be computed using the
versiontable. The best would be to have a version table with ordered versions. By limiting to numerical versions (other are useless because Cyberwatch cannot find associated CVE), the ordering can be alphanumeric. - The hashes are stored as string, using 64 bytes instead of 8 bytes. As there are 600 000 hashes, converting these strings to binary could save up to 33MB.
- The table
versionsseems useless, but I may be wrong. There are only 1600 entries so this is not very important. - The name of the technology is stored in each row in tables
hashandfile. Each of these entry use 8 bytes. Adding a tabletechnologyand using foreign keys of a small size (u32 for example) can save some space.
Action required
- Filter filenames added to the database with some heuristic, removing useless files.
- Store only numerical versions (or versions like '2.5.0-beta'), beginning with a number.
- Replace the json with two small strings representing the range of versions this hash was present.
- Convert hashes to binary. (it may be impossible due to the fact sqlite does not support binary data because of NULL bytes).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels