Skip to content

Reduce the size of the database #28

@polyedre

Description

@polyedre

Currently, the database is 378MB. This seems huge considering that it theoretically only contains hashes, versions and filenames.

A fast investigation revealed that:

  • The database contains hashes for useless filenames. Some files are inside the intellij .idea folder and knowning their hash is not usefull. Same for all files in test-related folders. A request in the database revealed that these test files account for at least 40% of all files. Some files also ends with the .php extension. There is no chance that these files will be usefull to detect a version. Php files represents 50% of all files.
  • The versions are stored in json, which is quite verbose. All hashes seems to correspond to a continuous range of versions. Replacing the json field by two fields 'initial_version' and 'last_version' could greatly reduce the size of these data. The range could then be computed using the version table. The best would be to have a version table with ordered versions. By limiting to numerical versions (other are useless because Cyberwatch cannot find associated CVE), the ordering can be alphanumeric.
  • The hashes are stored as string, using 64 bytes instead of 8 bytes. As there are 600 000 hashes, converting these strings to binary could save up to 33MB.
  • The table versions seems useless, but I may be wrong. There are only 1600 entries so this is not very important.
  • The name of the technology is stored in each row in tables hash and file. Each of these entry use 8 bytes. Adding a table technology and using foreign keys of a small size (u32 for example) can save some space.

Action required

  • Filter filenames added to the database with some heuristic, removing useless files.
  • Store only numerical versions (or versions like '2.5.0-beta'), beginning with a number.
  • Replace the json with two small strings representing the range of versions this hash was present.
  • Convert hashes to binary. (it may be impossible due to the fact sqlite does not support binary data because of NULL bytes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions