-
-
Notifications
You must be signed in to change notification settings - Fork 224
Embedding Model Classifier #474
Description
- I have checked the existing issues to avoid duplicates
- I have redacted any info hashes and content metadata from any logs or screenshots attached to this issue
Is your feature request related to a problem? Please describe
The classifier works fairly well but obviously misses a ton of content, with Unknown category being a significant portion of the total.
Describe the solution you'd like
Recently llm embedding models have become very lightweight and powerful, could a locally ran open source model be used with the torrents name plus other metadata which can output similarity to queries like "tv series" "porn" etc, highest one is chosen.
For example a lot of software is missed as it's only looking for exes etc, but if the file is zipped it will miss that even though a embedding model would easily figure out it's software from a title like "Adobe Photoshop".
This wouldn't replace the existing classifier, only the portions that rely on matching a list of strings.
Describe alternatives you've considered
The classifier could always be improved over time, such as adding more keywords, but that will result in more false positives too.