-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
get returns random values when lower than default OPT_POLL_TIMEOUT is set #531
Comments
Are the many thumbs up coming from people who are able to reproduce this problem? If so, please post your php-memcached and libmemcached and memcached versions |
@sodabrew TBH, I think most of the thumbs-up comes from the same project. That volume just means how important the issue is for us. |
Hi, Thanks God I'm not alone.
and
both had that weird behavior.
Commenting the first two options out makes is not reproducible on the test server, but it still happens on the prod.
What is interesting there:
I guess something inside libraries retried the first key request and considers its second response as the response for the second key. Most probably that's a problem is |
This is awesome debugging info! Tag @m6w6, and maybe just open a tracking issue on awesomized/libmemcahed that links back here. I'm thinking it may be useful to construct a C test harness that isolates to libmemcahed without PHP involved. |
Thanks for your effort! Any chance for some code that would reproduce this issue? It's unfortunate, that it has always been tried to upgrade both at the same time, so neither pecl-memcached nor libmemcached-awesome can be ruled out... That's suspicious, at least:
I'll have a look at |
I can confirm setting I guess it's some sort of "race condition" bug. I tried to run a simple script with a lot of gets, but the bug wasn't reproduced on it. On my test server it is randomly shown once per 5-10 requests. |
It could be, that I found a related commit. Hang on. |
I'm still debugging it. I already found that it works perfectly until timeout occurs. it returns an error like this And after that it starts sending wrong responses quite often until process is restarted. |
I guess there are two bad things:
I added logs to php to print everything that is returned from
We have 4 requests there. The 2nd and 3rd request requests the same key. The return value on the 2nd request is not right. It returned a value from another key (that was in the end of the previous request). The right value is returned in the 3rd response. I set 2 breakpoints in the
Here is what was printed:
As you can see, after the 2nd mget it requests old key first ( In both logs I skipped everything after the 4th request. There was 19 requests to memcached on the page. |
Let me know if you need any other internal information to debug it. |
This might be an even better candidate: awesomized/libmemcached@d7a0084 |
I'd need someone to confirm/decline any offending commit to proceed; or/and (even better) a reliable reproduce case. |
Hey! Did anyone have a chance to test whether their issue was solved with reverting either commit? |
Hi, Sorry for long response. I set the test environment. Reverted that commit (k0ka/libmemcached@6197650). But when I'm building rpm with this patch (from Remi's srpm) it hangs on test 31: Should I try to revert only changes in |
I remember fiddling around with this test when looking for offending commits. It's probably a bad test in itself. IIRC, just make https://github.com/k0ka/libmemcached/blob/v1.x/test/tests/memcached/noblock.cpp#L13 read: auto num = 10000; |
Yep, this fixes the problem with the test. However, I haven't run it yet. I'm trying to reproduce the bug as a test. I ended up with this code: #include "test/lib/common.hpp"
#include "test/lib/MemcachedCluster.hpp"
TEST_CASE("memcached_poll_timeout") {
auto test = MemcachedCluster::network();
auto memc = &test.memc;
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_NO_BLOCK, 1));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 1));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_TCP_NODELAY, 0));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 20));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_RCV_TIMEOUT, 20*1000));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_SND_TIMEOUT, 20*1000));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_RETRY_TIMEOUT, 60));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_CONNECT_TIMEOUT, 20));
memcached_return_t rc;
size_t len;
auto num = 500;
auto hit = 0u;
auto rnd = random_ascii_string(128);
REQUIRE_SUCCESS(memcached_flush(memc, 0));
for (auto i = 0; i < num; ++i) {
auto setKey = rnd + to_string(i);
REQUIRE_SUCCESS(memcached_set(memc, setKey.c_str(), setKey.length(), setKey.c_str(),
setKey.length(), 1, 0));
}
memcached_quit(memc);
auto timeoutAchieved = false;
for (auto i = 0; i < num; ++i) {
for (auto j = 0; j < i; ++j) {
auto getKey = rnd + to_string(j);
auto len = getKey.length();
char key[MEMCACHED_MAX_KEY];
memcpy(key, getKey.c_str(), len + 1);
const char *keys[1] = {key};
REQUIRE_SUCCESS(memcached_mget(memc, keys, &len, 1));
auto numResults = 0;
while (auto rv = memcached_fetch(memc, key, nullptr, &len, nullptr, &rc)) {
REQUIRE(!timeoutAchieved);
REQUIRE_SUCCESS(rc);
free(rv);
numResults++;
REQUIRE(numResults == 1);
}
if (MEMCACHED_END == rc || MEMCACHED_NOTFOUND == rc ){
continue;
}
if (MEMCACHED_TIMEOUT == rc){
timeoutAchieved = true;
continue;
}
INFO(rc);
REQUIRE(false);
}
}
} However it just hangs after |
That's interesting. It doesn't hang, though; it just takes a very long time on linux. |
Thanks, for the test case, though. I'll try to debug it on Linux later. |
I also don't understand why it returns |
In your test runs I tested it with MEMCACHED_BEHAVIOR_POLL_TIMEOUT=1000 and 20. It never happens on 1000 and always happens on 20. |
So, I'm using
|
Thanks for the extra info. Can you also share some words about your hardware? |
|
I finally created a docker container with this patch. It looks like this bug is gone - I wasn't able to reproduce it with that commit reverted. However I still see quite a lot of "TIMEOUT OCCURRED" errors. About 5%. The memcached is on the same server.
|
Just realised, your test is doing Anyway, I think, in normal circumstances, with this low of a timeout, there's still the previous response floating around, since, well, a timeout occurred while trying to fetch it, and the next request will happily fetch the previous response the next time. I also think, these timeouts are really low, even computers sometimes need some time to think 🤔 Please have a try with awesomized/libmemcached@48dcc61a and see if it fixes your issues; it basically reverts 90% of awesomized/libmemcached@d7a0084 Thank you! |
it's not 500! iterations, it's only 500*250 iterations. And 100ms is way too much for a bare metal server. The full page is generated in 200-300ms including several sql requests and 15-20 memcached requests. It took less than 1ms to process a memcached request - you may check my tcpdump log. It has timestamps. |
Uhm, I may have been sidetracked, please enlighten me!
I unterstand your sentiment, but i have been trying to understand Linux clock granularity for the past couple hours. If a server fails, it doesn't matter Most probably I've been looking in the wrong corners... |
I checked it, it works the same way - it fixes the most nasty bug with "wrong key return". However the timeout still occurs. I'm pretty sure it's not right. I can get tcpdump of interaction with memcached server to see timings. |
The current test is:
so the first iteration goes from 0 to 0 and asks for 1 keys, the 2nd - from 0 to 1 and asks for 2 keys, 3rd - from 0 to 2 and asks 3 keys, nth goes from 0 to n and asks for n keys. (actually it goes from 0 to i-1, so it asks even less keys) |
Yep, but for |
why num times? it executes Anyway, I narrowed my test to this: #include "test/lib/common.hpp"
#include "test/lib/MemcachedCluster.hpp"
TEST_CASE("memcached_poll_timeout") {
auto test = MemcachedCluster{Cluster{Server{MEMCACHED_BINARY, {"-p", "11211"}}, 1}};
auto memc = &test.memc;
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_NO_BLOCK, 1));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 0));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_TCP_NODELAY, 0));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 20));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_RCV_TIMEOUT, 20 * 1000));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_SND_TIMEOUT, 20 * 1000));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_RETRY_TIMEOUT, 60));
// REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_CONNECT_TIMEOUT, 20));
memcached_return_t rc;
auto num = 100'000;
REQUIRE_SUCCESS(memcached_flush(memc, 0));
const char* key = "testkey1";
auto len = strlen(key);
REQUIRE_SUCCESS(memcached_set(memc, key, len, key, len, 1, 0));
memcached_quit(memc);
const char *keys[1] = {key};
for (auto i = 0; i < num; ++i) {
REQUIRE_SUCCESS(memcached_mget(memc, keys, &len, 1));
auto numResults = 0;
while (auto rv = memcached_fetch(memc, nullptr, nullptr, nullptr, nullptr, &rc)) {
REQUIRE_SUCCESS(rc);
free(rv);
numResults++;
REQUIRE(numResults == 1);
}
if (MEMCACHED_END == rc) {
continue;
}
INFO(rc << " on iteration " << i);
REQUIRE(false);
}
} So it only sets one key It constantly fails for me on about 60-70 000 iteration. But it is the
I guess we might accept awesomized/libmemcached@48dcc61a as the fix for this problem. |
wait, I'm dumb. I set key to 1 sec lifetime. That's why it is expired. Making more testing... |
Or I'm a genius (accidentally)! It looks like problems starts when a key is expired. It only happens when #include "test/lib/common.hpp"
#include "test/lib/MemcachedCluster.hpp"
TEST_CASE("memcached_poll_timeout") {
auto test = MemcachedCluster{Cluster{Server{MEMCACHED_BINARY, {"-p", "11211"}}, 1}};
auto memc = &test.memc;
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_NO_BLOCK, 1));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_TCP_NODELAY, 0));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 1));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 20));
memcached_return_t rc;
REQUIRE_SUCCESS(memcached_flush(memc, 0));
const char* key = "testkey1";
auto len = strlen(key);
const char *keys[1] = {key};
REQUIRE_SUCCESS(memcached_mget(memc, keys, &len, 1));
auto numResults = 0;
while (auto rv = memcached_fetch(memc, nullptr, nullptr, nullptr, nullptr, &rc)) {
REQUIRE_SUCCESS(rc);
free(rv);
numResults++;
REQUIRE(numResults == 1);
}
if (MEMCACHED_END != rc && MEMCACHED_NOTFOUND != rc) {
INFO(rc);
REQUIRE(false);
}
} This test always fails with rc=31 (MEMCACHED_TIMEOUT) for me. Setting TL;DR:
|
And here is a test that does it: #include "test/lib/common.hpp"
#include "test/lib/MemcachedCluster.hpp"
TEST_CASE("memcached_poll_timeout") {
auto test = MemcachedCluster{Cluster{Server{MEMCACHED_BINARY, {"-p", "11211"}}, 1}};
auto memc = &test.memc;
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_NO_BLOCK, 1));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_TCP_NODELAY, 0));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_BINARY_PROTOCOL, 0));
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 1000));
memcached_return_t rc;
size_t len;
auto num = 100;
auto rnd = random_ascii_string(128);
for (auto i = 0; i < num; ++i) {
auto setKey = rnd + to_string(i);
REQUIRE_SUCCESS(memcached_set(memc, setKey.c_str(), setKey.length(), setKey.c_str(),
setKey.length(), 0, 0));
}
memcached_quit(memc);
for (auto i = 0; i < num; ++i) {
for (auto j = 0; j <= i; ++j) {
auto getKey = rnd + to_string(j);
auto len = getKey.length();
char key[MEMCACHED_MAX_KEY];
memcpy(key, getKey.c_str(), len + 1);
const char *keys[1] = {key};
if (0 == j){
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 0));
}
REQUIRE_SUCCESS(memcached_mget(memc, keys, &len, 1));
auto numResults = 0;
while (auto rv = memcached_fetch(memc, key, nullptr, &len, nullptr, &rc)) {
REQUIRE_SUCCESS(rc);
free(rv);
numResults++;
REQUIRE(numResults == 1);
}
if (0 == j){
REQUIRE(MEMCACHED_TIMEOUT == rc);
REQUIRE_SUCCESS(memcached_behavior_set(memc, MEMCACHED_BEHAVIOR_POLL_TIMEOUT, 1000));
continue;
}
if (MEMCACHED_END == rc){
continue;
}
INFO(rc);
REQUIRE(false);
}
}
} On my server it always fails on the current version of libmemcached, but always passes after awesomized/libmemcached@48dcc61a is applied. Here is what it does:
Upd: made the test more reliable. |
Hey, I pushed a much simpler test case; in case you didn't notice. Unfortunately the fix for this problem is simple & hard (like in the branch) or complicated & considerate (I'll try to write something up until tomorrow). If your observation of "memcached is choking on missing-key+binary-proto", we should file a decent issue on https://github.com/memcached/memcached, IMO. I'll also try to have a look at this. Finally, I'm definitely not understanding why Unfortunately, the libmemcached legacy is a over-engineered mess; someone should think about making a new meta-protocol-only client library. Maybe @dormando's proxy code would be a good fit for a new client library? |
Hi. Sorry I've only skimmed this; that sounds like a nasty bug. RE client replacement, this issue on php-memcached is probably the wrong place to discuss, so I created an issue topic over here: dormando/mcmc#2 TL:DR; I wrote a core of a client parser (it's messy because I didn't know what direction to go in at the time) which is in use for the proxy code + some other utilities. I'm trying to find time to expand this into a full client soon but am not sure what a useful replacement looks like. I wrote up my goals and would invite anyone on this thread who cares (certainly @m6w6 as you seem to have taken ownership of libmemcached) to come help with at least feedback if not code. |
My guess is: in most cases it will return timeout. As we asked something from the server and it took some time to process our request. So they decided to make poll_timeout=0 a convenient way to get TIMEOUT error. It's definitely usable in tests. Maybe it can be used in some weird real world examples. |
* Fix [gh #107](#107): macOS: deprecated sasl API (improve detection of `libsasl2`). * Fix [gh #131](#131): Consider renaming tools (add `CLIENT_PREFIX` build option; default: `mem`) * Fix [gh #132](#132): Add build of static library (add `BUILD_SHARED_LIBS` build option; default: `ON`). * Fix [gh #134](#134): Update client option documentation. * Fix [gh #136](#136): `libmemcachedutil` is underlinked (link against libmemcached). * Fix [gh php-memcached#531](php-memcached-dev/php-memcached#531): `get` returns random values when lower than default `OPT_POLL_TIMEOUT` is set.
Wanted to mention (for others that may come looking) that I've seen this occurring even without
Notably this issue only occurs when the memcached server nodes are experiencing some sort of stress, like kernel blocking under memory pressure and page eviction. I've had a hard time replicating without Update: Can replicate without |
yep, it was fixed |
Thank you! |
Thank you all for the report, reproduction case, and follow-up that this was fixed upstream! |
After upgrading to libmemcached to 3.1.5-6, I'm sometimes (!) getting random results when calling the get method. There's no error, but the returned value is from a different key. This does not happen on every request. I've narrowed this error down to
OPT_POLL_TIMEOUT
option. When set to lower than 500ms this starts to happen. It seems that ifOPT_POLL_TIMEOUT
happens, the library returns a seemingly random value without any error.It does not happen when
OPT_POLL_TIMEOUT
is set to default 1000ms value or omitted.Not sure if it's an issue of libmemcached or php-memcached though.
PHP version:
memcache* libraries:
some memcache stats
I will be more than happy to provide more debug info.
The text was updated successfully, but these errors were encountered: