Describe the bug
When more elements are inserted in a vector set, the VSIM score became lower.
To reproduce
Ger FB data wiki-news-300d-1M.vec.
Begin to insert in a loop. I use PHP script for it.
while it loads try searching VSIM. in the beginning there will be no data, because "frog" is still not loaded. when is loaded ~20 000 vectors, you get:
$ redis-cli VSIM v_n ele frog COUNT 100 withscores | grep frog -A 1
frog 0.9999999403953552 frogs 0.9300006031990051
then wait for about 23000 vectors to be loaded and do the same:
redis-cli VSIM v_n ele frog COUNT 100 withscores | grep frog -A 1 frog 0.9999998211860657 frogs 0.8306085467338562
this works on both int8 and float vector sets.
Expected behavior
Expecting the score not to be changed.
Additional information
The file wiki-news-300d-1M.vec can be downloaded from here: https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Comment From: nmmmnu
if later I do VSIM for "frog" with the original vector, I get following:
VSIM v_n values 300 -0.1902 0.3798 0.1009 -0.0387 -0.0758 -0.0083 -0.0983 0.0157 0.1076 0.0945 -0.0045 -0.1646 -0.1718 -0.0254 0.0652 -0.2033 -0.0933 -0.1140 0.0158 -0.4690 -0.4105 0.0553 0.0483 -0.0731 0.0221 -0.0065 0.1272 0.1784 0.2382 -0.0391 0.0855 0.0975 0.0023 -0.0226 0.0448 -0.0246 0.2040 0.0560 -0.1299 -0.2650 0.1239 -0.0565 0.2559 0.1399 -0.0408 -0.0574 -0.0189 0.0451 -0.0971 -0.0873 -0.2086 -0.1841 -0.7157 0.0549 0.1104 -0.0786 0.3531 0.0802 0.0915 -0.0632 0.0048 -0.1406 -0.1375 -0.2009 -0.0865 -0.1452 -0.1567 0.0103 -0.0781 0.2486 -0.0849 -0.0025 -0.1247 0.0649 0.1731 0.2748 0.1027 0.1306 0.1383 0.0945 -0.0297 0.1219 0.0203 -0.2261 -0.0052 -0.1194 0.1408 -0.1817 0.2201 -0.0493 0.0928 -0.0630 -0.0640 -0.0774 -0.0314 -0.0140 -0.1904 -0.1495 -0.1196 0.0900 -0.1444 0.3274 -0.0519 -0.1401 0.0427 -0.0427 0.0043 0.2384 0.0163 -0.0806 -0.1732 0.1635 0.1448 0.0265 0.0731 0.1675 0.1085 -0.0215 -0.1056 -0.3031 -0.1410 -0.1582 0.1524 -0.0719 0.1431 0.1893 -0.2007 0.0628 -0.0555 0.0039 0.0340 -0.0909 -0.0475 -0.0144 0.0654 0.0216 0.0971 -0.0067 -0.1867 -0.0466 0.1269 -0.0413 -0.0494 0.2005 0.0310 -0.0332 0.0601 0.0167 -0.0330 0.1303 -0.1984 -0.0101 -0.0757 -0.2718 0.0508 -0.0505 0.0712 0.0956 -0.0310 0.0771 -0.0157 -0.0846 0.3409 0.0030 -0.0116 0.0809 0.0942 -0.0298 0.0639 0.0099 -0.0469 0.1457 -0.1529 0.0494 0.2185 -0.0829 0.3576 -0.0548 0.1188 -0.0513 0.0328 0.0780 -0.0858 -0.0645 0.1066 -0.1231 0.0456 0.1913 -0.2681 0.1878 -0.0077 0.2623 0.1559 0.0045 0.1368 -0.0720 0.2520 0.0336 0.0911 -0.1273 0.0322 -0.0644 -0.0638 -0.2571 0.0476 0.1604 0.0283 -0.1435 0.0866 0.2220 -0.1408 -0.2078 0.2310 0.0249 0.1760 0.2103 -0.1009 0.1441 -0.1066 0.1381 -0.0212 0.0370 -0.1432 0.1495 0.0518 -0.2653 0.1265 0.1299 -0.0955 -0.1064 0.0387 -0.0711 0.3423 -0.0793 0.0475 -0.1686 -0.1235 -0.0852 -0.2434 -0.0744 -0.1475 -0.2739 0.1033 0.2221 -0.1234 -0.1000 -0.2275 0.1243 0.1891 0.4219 0.0725 -0.0116 0.2565 -0.1183 -0.0118 -0.1402 -0.0954 0.1974 -0.0856 -0.2874 -0.0827 -0.0962 0.2671 -0.0094 -0.4570 0.1190 0.1573 0.1139 -0.2207 -0.1293 0.0580 -0.0379 0.1044 0.0096 -0.1357 0.0772 0.2140 -0.0008 -0.0947 0.0129 0.0675 -0.1183 0.0064 0.1442 -0.0347 -0.4064 0.0170 -0.1628 -0.0768 0.0234 0.0735 -0.1028 -0.0610 -0.0909 -0.1340 0.0791 0.0064 -0.0903 0.0998 0.0375 withscores
1) "frogs" 2) "0.9300006031990051" 3) "frog" 4) "0.8733182549476624"
this means that "frog" degradated so much, so original vector is more close to "frogs" than to "frog".
note this is made on float values, problem is not in Q8
Comment From: antirez
Thank you for reporting! I'll look into it ASAP.
Comment From: antirez
Dear @nmmmnu there is no bug here, simply your script will enter repeated elements because you perform a case conversion, so you are updating the vector of given elements, leading to the provided behavior.
Try adding something like this:
if ($word == "frogs" || $word == "frog") printf($word)
You will see repeated entries.
Comment From: antirez
P.S. doing the packing to FP32 as you do in the script is the way to go... it is faster in the Redis side. However if you will ever profile your insertion speed, make sure to also check the performances of packing. Dynamic languages can do a mess especially in converting numbers to float strings, but also the reverse ;D
Also, despite this being a false positive, thank you for posting the issue and for trying Vector Sets. I have a zero-bug policy for vector sets and I'm more happy checking a potential false positive than leaving any real bug inside.
Comment From: nmmmnu
I noted this too and confirm it. The input supposed to be "cleaned" of this kind of differences, but seems it was not.