-
Notifications
You must be signed in to change notification settings - Fork 8
Description
[alert: typical newbie issue: :-) ]
[short: define the content by data and not by code]
I think there are several issues with the content ID defined by code.
-
The content ID may be changed based on installed software and even based hardware.
For example, the value of a pixel in a JPEG is not guaranteed to be exact. Hardware functions like sin/cos do not have to be exact, they can be approximated. Also, a JPEG lib can be "improved" and the color of a pixel may change by 1/256. In most cases, nothing will happen, but there is no guarantee that it can not affect the hash. If NNs are used everything gets worse. -
Further improvements may make the content incompatible.
-
Improvements make the reference code obsolete.
If one builds a better/optimized code, the reference code allows neet to do the same. -
Specification will never be exact. (Otherwise, JPEG would need to be included, of MP4....)
Possible solution:
Each bit of the content ID is defined by an attribute that has a name.
examples may be [scientific, funny, violent, animal, social,....
For each name/category/atribut example data is provided.
if tests are written then the only requirement Is that a minimum of X bits are correct.
the detection or training of the data also shold take care that each possibility of a bit is used 50%.
to get a uniform has distribution.
advantages:
more freedom in writing individual content similarity match. flexible for updates. different contenttypes can be matched. more easy to specify.