memcached空指针内存错误与死循环问题分析（memcached dead loop and crash bug! issue #260 and issue #370）

(由于这是发在memcached邮件列表的，所以只能用一下蹩脚的英文了）

(you should read the discuss about issue #260 first: https://github.com/memcached/memcached/pull/67)

Thanks for dormando to pull me back to the do_item_alloc function, and i also read again carefully about how to reproduce and fix the issue #260. It is awful to find the extremely path to reproduce the but. But i m afraid there MAY have other problems, even at vervion .20, more higher problibility.　　

reproduce step by step in v 1.4.20:

1) hash expanding, switch the item_lock to global_lock;
2) In thread A, do_item_alloc try to get a new item, and try to get a ITEM from the LRU tailer. but it does not lock the ITEM, because it try to hold the item_locks[];
3) after do_item_alloc check the refcount(after line items.c:139), other thread B try to hold the ITEM, and increase the refcount, and believe itself hold a refcount;
4) In thread A, do_item_alloc evicted the ITEM, and initilize the ITEM(reset the refcount to 1), as a new one returning to the caller;
5) In thread B, it call item_remove to dereference the refcount, and make the refcount to 0, so the ITEM is free! (the thread A is holding the ITEM, and the ITME is in the hash table yet);
6) So, any terrible thing may happen, including crash and dead loop.

In version lower than 1.4.19, it easier to occurr even if there is no hash expanding. Because the code in do do_item_alloc as following:

1) if the cur_hv == cur_hv, (it is possible at replace operation);
or
2) if the first loop hv!=cur_hv, and hold a lock; then the second loop hv==cur_hv(without a lock, but the lock's pointer is not reset in the first loop!), then release the lock wrong.

    for (; tries >  && search != NULL; tries--, search=search->prev) {
        uint32_t hv = hash(ITEM_key(search), search->nkey, );
        /* Attempt to hash item lock the "search" item. If locked, no
         * other callers can incr the refcount
         */
        /* FIXME: I think we need to mask the hv here for comparison? */
        if (hv != cur_hv && (hold_lock = item_trylock(hv)) == NULL)   ------>
            continue;
        /* Now see if the item is refcount locked */
        if (refcount_incr(&search->refcount) != ) {
            refcount_decr(&search->refcount);
            /* Old rare bug could cause a refcount leak. We haven't seen
             * it in years, but we leave this code in to prevent failures
             * just in case */
            if (search->time + TAIL_REPAIR_TIME < current_time) {
                itemstats[id].tailrepairs++;
                search->refcount = ;
                do_item_unlink_nolock(search, hv);
            }
            if (hold_lock)
                item_trylock_unlock(hold_lock);
            continue;
        }

I think it's clear enough, but i also try to show more detail as following, and try to reproduce the phenomenon in gdb:

I don't know why the function "assoc_maintenance_thread" writed as following, but other threads will get the global_lock before "assoc_maintenance_thread" get it again.

static void *assoc_maintenance_thread(void *arg) {
 
    while (do_run_maintenance_thread) {
        int ii = ;
 
        /* Lock the cache, and bulk move multiple buckets to the new
         * hash table. */
        item_lock_global();                   ----------------> step 5) try to get the global_lock, it have to wait for other threads to                                                   -------------release the lock.
        mutex_lock(&cache_lock);
 
        for (ii = ; ii < hash_bulk_move && expanding; ++ii) {
           .....
        }
 
        mutex_unlock(&cache_lock);
        item_unlock_global();
 
        if (!expanding) {
            /* finished expanding. tell all threads to use fine-grained locks */
            switch_item_lock_type(ITEM_LOCK_GRANULAR);
            slabs_rebalancer_resume();
            /* We are done expanding.. just wait for next invocation */
            mutex_lock(&cache_lock);
            started_expanding = false;
            pthread_cond_wait(&maintenance_cond, &cache_lock);    ------------>step 1) wait here for expanding notify.
            /* Before doing anything, tell threads to use a global lock */
            mutex_unlock(&cache_lock);
            slabs_rebalancer_pause();
            switch_item_lock_type(ITEM_LOCK_GLOBAL);   ----------->step 2) switch to the global_lock, without holding the global_lock. 
                                                       ---- other thread will get the global_lock first. it is not thread-safe.
            mutex_lock(&cache_lock);
            assoc_expand();                            ---->step 3) expand the hash size, but the items are not moved to the new buckets.
            mutex_unlock(&cache_lock);                 --->step 4) release the lock, MAY be not thread-safe.
        }
    }
    return NULL;
}

and in the function "do_item_alloc", using the function "item_trylock" to get the item's lock. please look carefully at "item_trylock", it directly access "item_locks", hoping a "no-op" as it said in the comment. the item(search) is not lock at all.

So, after the checker "if (refcount_incr(&search->refcount) != 2) ", other threads may hold the item and increase the refcounte, that will make

item *do_item_alloc(char *key, const size_t nkey, const int flags,
                    const rel_time_t exptime, const int nbytes,
                    const uint32_t cur_hv) {
    uint8_t nsuffix;
    item *it = NULL;
    char suffix[];
    size_t ntotal = item_make_header(nkey + , flags, nbytes, suffix, &nsuffix);
    if (settings.use_cas) {
        ntotal += sizeof(uint64_t);
    }
 
    unsigned int id = slabs_clsid(ntotal);
    if (id == )
        return ;
 
    mutex_lock(&cache_lock);
    /* do a quick check if we have any expired items in the tail.. */
    int tries = ;
    int tried_alloc = ;
    item *search;
    void *hold_lock = NULL;
    rel_time_t oldest_live = settings.oldest_live;
 
    search = tails[id];
    /* We walk up *only* for locked items. Never searching for expired.
     * Waste of CPU for almost all deployments */
    for (; tries >  && search != NULL; tries--, search=search->prev) {
        if (search->nbytes ==  && search->nkey ==  && search->it_flags == ) {
            /* We are a crawler, ignore it. */
            tries++;
            continue;
        }
        uint32_t hv = hash(ITEM_key(search), search->nkey);
        /* Attempt to hash item lock the "search" item. If locked, no
         * other callers can incr the refcount
         */
        /* Don't accidentally grab ourselves, or bail if we can't quicklock */
        if (hv == cur_hv || (hold_lock = item_trylock(hv)) == NULL)      ---------------> 1) item_trylock always get lock from item_locks[], not global_lock
            continue;
        /* Now see if the item is refcount locked */
        if (refcount_incr(&search->refcount) != ) {                     ---------------> 2) now the serch->refcount==2, means only the lru-link reference the item.
            refcount_decr(&search->refcount);
            /* Old rare bug could cause a refcount leak. We haven't seen
             * it in years, but we leave this code in to prevent failures
             * just in case */
            if (settings.tail_repair_time &&
                    search->time + settings.tail_repair_time < current_time) {
                itemstats[id].tailrepairs++;
                search->refcount = ;
                do_item_unlink_nolock(search, hv);
            }
            if (hold_lock)
                item_trylock_unlock(hold_lock);
            continue;
        }
 
        /* Expired or flushed */
        if ((search->exptime !=  && search->exptime < current_time)   ------------------> 3) after this line, other thread may got hold the item and increase the refcount.
                                                                      ----------- ------- so if the item  is alloc as a new item (refcount reset to 1), then the other thread(
                                                                       -------------------hold the item) call do_item_remove, it will free the "new" item(because the refcount                                                                        -------------------is 0). this is very highly possible at the item evicted case.
            || (search->time <= oldest_live && oldest_live <= current_time)) {
            itemstats[id].reclaimed++;
            if ((search->it_flags & ITEM_FETCHED) == ) {
                itemstats[id].expired_unfetched++;
            }
            it = search;
            slabs_adjust_mem_requested(it->slabs_clsid, ITEM_ntotal(it), ntotal);
            do_item_unlink_nolock(it, hv);
            /* Initialize the item block: */
            it->slabs_clsid = ;
        } else if ((it = slabs_alloc(ntotal, id)) == NULL) {
            tried_alloc = ;
            if (settings.evict_to_free == ) {
                itemstats[id].outofmemory++;
            } else {
                itemstats[id].evicted++;
                itemstats[id].evicted_time = current_time - search->time;
                if (search->exptime != )
                    itemstats[id].evicted_nonzero++;
                if ((search->it_flags & ITEM_FETCHED) == ) {
                    itemstats[id].evicted_unfetched++;
                }

/* Special case. When ITEM_LOCK_GLOBAL mode is enabled, this should become a
 * no-op, as it's only called from within the item lock if necessary.
 * However, we can't mix a no-op and threads which are still synchronizing to
 * GLOBAL. So instead we just always try to lock. When in GLOBAL mode this
 * turns into an effective no-op. Threads re-synchronize after the power level
 * switch so it should stay safe.
 */
void *item_trylock(uint32_t hv) {
    pthread_mutex_t *lock = &item_locks[hv & hashmask(item_lock_hashpower)];
    if (pthread_mutex_trylock(lock) == ) {
        return lock;
    }
    return NULL;
}

gdb reproduce(partial):

b assoc.c:211
b item_get
b items.c:158

(gdb) info thread
  Id   Target Id         Frame
  6    Thread 0x7ffff4d71700 (LWP 12647) "memcached" (running)
  5    Thread 0x7ffff5572700 (LWP 12646) "memcached" (running)
  4    Thread 0x7ffff5d73700 (LWP 12645) "memcached" (running)
* 3    Thread 0x7ffff6574700 (LWP 12644) "memcached" (running)
  2    Thread 0x7ffff6d75700 (LWP 12643) "memcached" (running)
  1    Thread 0x7ffff7fe0740 (LWP 12642) "memcached" 0x00007ffff76ad9a3 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81

////step 1////add a new item, modify the value of hash_items to 100000, to making a hash expanding.
(gdb) thread
[Switching to thread  (Thread 0x7ffff6574700 (LWP ))]
#  do_item_alloc (key=0x7ffff0026414 "e", nkey=, flags=, exptime=, nbytes=, cur_hv=) at items.c:
                tried_alloc = ;
(gdb) c
Continuing.
 
Breakpoint , assoc_insert (it=0x7ffff7f35f70, hv=) at assoc.c:
        if (! expanding && hash_items > (hashsize(hashpower) * ) / ) {
(gdb)         it = do_item_get(key, nkey, hv);
p hash_items
$ =
Breakpoint , assoc_maintenance_thread (arg=0x0) at assoc.c:
            item_lock_global();
(gdb) info thread
  Id   Target Id         Frame
      Thread 0x7ffff4d71700 (LWP ) "memcached" assoc_maintenance_thread (arg=0x0) at assoc.c:
      Thread 0x7ffff5572700 (LWP ) "memcached" (running)
      Thread 0x7ffff5d73700 (LWP ) "memcached" (running)
*     Thread 0x7ffff6574700 (LWP ) "memcached" (running)
      Thread 0x7ffff6d75700 (LWP ) "memcached" (running)
      Thread 0x7ffff7fe0740 (LWP ) "memcached" (running)

/////delete all key in the cache.
/////step 2/// add a item, key="71912_yhd.serial.product.get_1.0_0", to the slot 2, as the first item now.

////step 3///get the key="71912_yhd.serial.product.get_1.0_0",
 
///step 4 //and try to add a new key="Y_ORDER_225426358_02" to the slot 2.

Breakpoint , do_item_alloc (key=0x7ffff0026414 "71912_yhd.serial.product.get_1.0_0", nkey=, flags=, exptime=, nbytes=, cur_hv=) at items.c:
                        const uint32_t cur_hv) {
(gdb) n

....
 
Breakpoint , item_get (key=0x7fffe0026414 "71912_yhd.serial.product.get_1.0_0", nkey=) at thread.c:
        hv = hash(key, nkey);
(gdb) info thread
  Id   Target Id         Frame
      Thread 0x7ffff4d71700 (LWP ) "memcached" assoc_maintenance_thread (arg=0x0) at assoc.c:
      Thread 0x7ffff5572700 (LWP ) "memcached" (running)
      Thread 0x7ffff5d73700 (LWP ) "memcached" (running)
      Thread 0x7ffff6574700 (LWP ) "memcached" do_item_alloc (key=0x7ffff0026414 "Y_ORDER_225426358_02", nkey=, flags=, exptime=, nbytes=, cur_hv=) at items.c:
*     Thread 0x7ffff6d75700 (LWP ) "memcached" item_get (key=0x7fffe0026414 "71912_yhd.serial.product.get_1.0_0", nkey=) at thread.c:
      Thread 0x7ffff7fe0740 (LWP ) "memcached" (running)
(gdb) n
        item_lock(hv);
(gdb)
        it = do_item_get(key, nkey, hv);
(gdb)
        item_unlock(hv);
 
 
(gdb) thread
[Switching to thread  (Thread 0x7ffff6574700 (LWP ))]
#  do_item_alloc (key=0x7ffff0026414 "Y_ORDER_225426358_02", nkey=, flags=, exptime=, nbytes=, cur_hv=) at items.c:
            if ((search->exptime !=  && search->exptime < current_time)
(gdb) p search->refcount
$ = 3                                      -------> now the refcount is wrong. (may be in the evicted process is better).
(gdb) bt
#  do_item_alloc (key=0x7ffff0026414 "Y_ORDER_225426358_02", nkey=, flags=, exptime=, nbytes=, cur_hv=) at items.c:
#  0x0000000000417c26 in item_alloc (key=0x7ffff0026414 "Y_ORDER_225426358_02", nkey=, flags=, exptime=, nbytes=) at thread.c:
#  0x000000000040ace9 in process_update_command (c=0x7ffff0026200, tokens=0x7ffff6573be0, ntokens=, comm=, handle_cas=false) at memcached.c:
#  0x000000000040bde2 in process_command (c=0x7ffff0026200, command=0x7ffff0026410 "set") at memcached.c:
#  0x000000000040cc47 in try_read_command (c=0x7ffff0026200) at memcached.c:
#  0x000000000040d958 in drive_machine (c=0x7ffff0026200) at memcached.c:
#  0x000000000040e4f7 in event_handler (fd=, which=, arg=0x7ffff0026200) at memcached.c:
#  0x00007ffff7ba3f24 in event_base_loop () from /usr/lib/x86_64-linux-gnu/libevent-2.0.so.
#  0x0000000000417926 in worker_libevent (arg=0x635ea0) at thread.c:
#  0x00007ffff7980182 in start_thread (arg=0x7ffff6574700) at pthread_create.c:
# 0x00007ffff76ad30d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:
(gdb)

Connected to 127.0.0.1.
Escape character is '^]'.
set a   
 
STORED
^[[A^[[B
ERROR
set b    
 
STORED
get 71912_yhd.serial.product.get_1.0_0
 
jason@gy:~$ telnet 127.0.0.1
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
set c     
 
STORED
set 71912_yhd.serial.product.get_1.0_0   
 
STORED
delete 71912_yhd.serial.product.get_1.0_0
DELETED
set 71912_yhd.serial.product.get_1.0_0   
 
STORED
delete Y_ORDER_225426358_02
delete a
delete b
delete c
NOT_FOUND
DELETED
NOT_FOUND
NOT_FOUND
set Y_ORDER_225426358_02
aaaaaaaaaaaaaaaaaaaa

I will try to write a test script to reproduce it, and try to commit a patch to fix it.

------------------------------------------------

I have got a deaploop on memcached(v1.4.15, on centos 6.3 x86_64), it can't be reproduced. In our product environment, there are hundreds of memcached instances running, and this bug happend 3 times this years. when the bug occurred, thouands of tcp connections keep in CLOSE_WAIT status and reached the maximum connection number, then clients cann't connect to the cache servers. The SA have to restart the memcached instance to recover our business, but the recent time i got the chance to create a core file.

FYI: before start, you can get the the memcached package(with a debug-info package)at: http://mirrors.htbindustries.org/CentOS/6/x86_64/,and you can get the core file at: http://pan.baidu.com/s/1kTFTRQf

backtrack in gdb.

Thread  (Thread 0x7fa8a1ee1700 (LWP )):
#  0x0000003c9e2e7c73 in epoll_wait () from /lib64/libc.so.
#  0x000000323cc12e4b in ?? () from /usr/lib64/libevent-1.4.so.
#  0x000000323cc068c3 in event_base_loop () from /usr/lib64/libevent-1.4.so.
#  0x0000000000406447 in main (argc=<value optimized out>,
    argv=<value optimized out>) at memcached.c:
 
Thread  (Thread 0x7fa89e021700 (LWP )):
#  0x0000003c9e60b43c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.
#  0x000000000040cb33 in slab_rebalance_thread (arg=<value optimized out>)
    at slabs.c:
#  0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.
#  0x0000003c9e2e767d in clone () from /lib64/libc.so.
 
Thread  (Thread 0x7fa89ea22700 (LWP )):
#  0x0000003c9e2ab91d in nanosleep () from /lib64/libc.so.
#  0x0000003c9e2ab790 in sleep () from /lib64/libc.so.
#  0x000000000040d0ad in slab_maintenance_thread (arg=<value optimized out>)
    at slabs.c:
#  0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.
#  0x0000003c9e2e767d in clone () from /lib64/libc.so.
---Type <return> to continue, or q <return> to quit---
 
Thread  (Thread 0x7fa89f423700 (LWP )):
#  0x0000003c9e60b43c in pthread_cond_wait@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.
#  0x000000000040fa8d in assoc_maintenance_thread (arg=<value optimized out>)
    at assoc.c:
#  0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.
#  0x0000003c9e2e767d in clone () from /lib64/libc.so.
 
Thread  (Thread 0x7fa89fe24700 (LWP )):
#  0x000000000040fd34 in assoc_find (key=<value optimized out>,
    nkey=<value optimized out>, hv=<value optimized out>) at assoc.c:
#  0x000000000040ef2e in do_item_get (key=0x7fa88008bd34 "B920818_0",
    nkey=<value optimized out>, hv=) at items.c:
#  0x0000000000411076 in item_get (key=0x7fa88008bd34 "B920818_0", nkey=)
    at thread.c:
#  0x000000000040731e in process_get_command (c=0x7fa88008bb30,
    tokens=0x7fa89fe23bf0, ntokens=<value optimized out>, return_cas=false)
    at memcached.c:
#  0x0000000000409b63 in process_command (c=0x7fa88008bb30,
    command=<value optimized out>) at memcached.c:
#  0x000000000040a7e2 in try_read_command (c=0x7fa88008bb30)
    at memcached.c:
---Type <return> to continue, or q <return> to quit---
#  0x000000000040b478 in drive_machine (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa88008bb30) at memcached.c:
#  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa88008bb30) at memcached.c:
#  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.
#  0x000000000041070d in worker_libevent (arg=0x1ad5d88) at thread.c:
# 0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.
# 0x0000003c9e2e767d in clone () from /lib64/libc.so.
 
Thread  (Thread 0x7fa8a0825700 (LWP )):
#  0x0000003c9e6094b3 in pthread_mutex_trylock () from /lib64/libpthread.so.
#  0x0000000000410d10 in mutex_lock (hv=<value optimized out>)
    at memcached.h:
#  item_lock (hv=<value optimized out>) at thread.c:
#  0x0000000000411069 in item_get (
    key=0x7fa8985d8944 "1368_yhd.orders.get_1.0_visitDateList_0", nkey=)
    at thread.c:
#  0x000000000040731e in process_get_command (c=0x7fa89829b7e0,
    tokens=0x7fa8a0824bf0, ntokens=<value optimized out>, return_cas=false)
    at memcached.c:
#  0x0000000000409b63 in process_command (c=0x7fa89829b7e0,
    command=<value optimized out>) at memcached.c:
#  0x000000000040a7e2 in try_read_command (c=0x7fa89829b7e0)
---Type <return> to continue, or q <return> to quit---
    at memcached.c:
#  0x000000000040b478 in drive_machine (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa89829b7e0) at memcached.c:
#  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa89829b7e0) at memcached.c:
#  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.
# 0x000000000041070d in worker_libevent (arg=0x1ad2a00) at thread.c:
# 0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.
# 0x0000003c9e2e767d in clone () from /lib64/libc.so.
 
Thread  (Thread 0x7fa8a1226700 (LWP )):
#  0x0000003c9e6094b3 in pthread_mutex_trylock () from /lib64/libpthread.so.
#  0x0000000000410d10 in mutex_lock (hv=<value optimized out>)
    at memcached.h:
#  item_lock (hv=<value optimized out>) at thread.c:
#  0x0000000000410d7d in store_item (item=0x7fa89d0dddb8, comm=,
    c=0x7fa880118a00) at thread.c:
#  0x000000000040bed4 in complete_nread_ascii (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa880118a00) at memcached.c:
#  complete_nread (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa880118a00) at memcached.c:
#  drive_machine (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa880118a00) at memcached.c:
---Type <return> to continue, or q <return> to quit---
#  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa880118a00) at memcached.c:
#  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.
#  0x000000000041070d in worker_libevent (arg=0x1acf678) at thread.c:
# 0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.
# 0x0000003c9e2e767d in clone () from /lib64/libc.so.
 
Thread  (Thread 0x7fa8a1c27700 (LWP )):
#  0x0000003c9e6094b3 in pthread_mutex_trylock () from /lib64/libpthread.so.
#  0x0000000000410d10 in mutex_lock (hv=<value optimized out>)
    at memcached.h:
#  item_lock (hv=<value optimized out>) at thread.c:
#  0x0000000000411069 in item_get (
    key=0x7fa881755f64 "4872_yhd.orders.get_1.0_0", nkey=) at thread.c:
#  0x000000000040731e in process_get_command (c=0x7fa88049e180,
    tokens=0x7fa8a1c26bf0, ntokens=<value optimized out>, return_cas=false)
    at memcached.c:
#  0x0000000000409b63 in process_command (c=0x7fa88049e180,
    command=<value optimized out>) at memcached.c:
#  0x000000000040a7e2 in try_read_command (c=0x7fa88049e180)
    at memcached.c:
#  0x000000000040b478 in drive_machine (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa88049e180) at memcached.c:
---Type <return> to continue, or q <return> to quit---
#  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa88049e180) at memcached.c:
#  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.
# 0x000000000041070d in worker_libevent (arg=0x1acc2f0) at thread.c:
# 0x0000003c9e607851 in start_thread () from /lib64/libpthread.so.
# 0x0000003c9e2e767d in clone () from /lib64/libc.so.
(gdb)
(gdb)

It's easy to know that there is something wrong in thread 4, also know that the thread 4 got a lock of an item then other thread is block by the lock.

(gdb) thread
[Switching to thread  (Thread 0x7fa89fe24700 (LWP ))]#  0x000000000040fd34 in assoc_find (key=<value optimized out>, nkey=<value optimized out>,
    hv=<value optimized out>) at assoc.c:
              if ((nkey == it->nkey) && (memcmp(key, ITEM_key(it), nkey) == )) {
(gdb) bt
#  0x000000000040fd34 in assoc_find (key=<value optimized out>,
    nkey=<value optimized out>, hv=<value optimized out>) at assoc.c:
#  0x000000000040ef2e in do_item_get (key=0x7fa88008bd34 "B920818_0",
    nkey=<value optimized out>, hv=) at items.c:
#  0x0000000000411076 in item_get (key=0x7fa88008bd34 "B920818_0", nkey=)
    at thread.c:
#  0x000000000040731e in process_get_command (c=0x7fa88008bb30,
    tokens=0x7fa89fe23bf0, ntokens=<value optimized out>, return_cas=false)
    at memcached.c:
#  0x0000000000409b63 in process_command (c=0x7fa88008bb30,
    command=<value optimized out>) at memcached.c:
#  0x000000000040a7e2 in try_read_command (c=0x7fa88008bb30)
    at memcached.c:
#  0x000000000040b478 in drive_machine (fd=<value optimized out>,
    which=<value optimized out>, arg=0x7fa88008bb30) at memcached.c:
#  event_handler (fd=<value optimized out>, which=<value optimized out>,
    arg=0x7fa88008bb30) at memcached.c:
#  0x000000323cc06b44 in event_base_loop () from /usr/lib64/libevent-1.4.so.2

(gdb) p it
$ = (item *) 0x7fa89799e5b0
(gdb) p *it
$ = {next = 0x7fa89d0646c0, prev = 0x7fa8979b0760, h_next = 0x7fa89799e5b0, time = , exptime = , nbytes = , refcount = , nsuffix =  '\n',
  it_flags =  '\003', slabs_clsid =  '\002', nkey =  '"', data = 0x7fa89799e5b0}
(gdb)

so the it->h_next is point to itself, then the deadloop happend.

i have check all the code that modify the value it->h_next, (the assoc_insert/assoc_deleteassoc_maintenance_thread function), whenever it is modified, it have to get the lock "cache_lock". so i can't find the reason why an item pointed to itself.

more infomation:

In thread 4:

si r8 --> nkey=9

di r9 --> key ="B920818_0"
hv = 0xe1db11c7, so offset in primary_hashtable is=0x111c7

(gdb) p primary_hashtable
$22 = (item **) 0x7fa8840008c

so i get the first item in the hash bucket.

(gdb) x /10g 0x00007fa8840008c0+0x8*0x111c7
0x7fa8840896f8: 0x00007fa890b46ef0

and then the deadloop item's address is the second item in the hash bucket list. yes, there is hash conflict.

(gdb) x /10xg 0x00007fa890b46ef0
0x7fa890b46ef0: 0x00007fa89d1884f0 0x00007fa89d1ee2b0
0x7fa890b46f00: 0x00007fa89799e5b0 0x01f3710601f362f6
0x7fa890b46f10: 0x0307000100000003 0x0000000000001501
0x7fa890b46f20: 0x000000030ed02eb6 0x4544524f5f594247
0x7fa890b46f30: 0x3632343532325f52 0x332032305f383533
(gdb)

the item in the hash bucket list is:

(gdb) p it
$ = (item *) 0x7fa89799e5b0
(gdb) p *it
$ = {next = 0x7fa89d0646c0, prev = 0x7fa8979b0760, h_next = 0x7fa89799e5b0, time = , exptime = , nbytes = , refcount = , nsuffix =  '\n',
  it_flags =  '\003', slabs_clsid =  '\002', nkey =  '"', data = 0x7fa89799e5b0}
(gdb)
(gdb) p *(item *)0x00007fa89d1884f0
$ = {next = 0x7fa890b4f110, prev = 0x7fa890b46ef0, h_next = 0x0, time = , exptime = , nbytes = , refcount = , nsuffix =  '\a',
  it_flags =  '\003', slabs_clsid =  '\001', nkey =  '\025', data = 0x7fa89d1884f0}
(gdb)

item[1]:"71912_yhd.serial.product.get_1.0_0 16384 8\r\n" nsuffix=10; slab 2; nkey=34; nbytes=10

(little endian)
(gdb) x /100xb 0x00007fa89799e5b0
0x7fa89799e5b0: 0xc0 0x46 0x06 0x9d 0xa8 0x7f 0x00 0x00
0x7fa89799e5b8: 0x60 0x07 0x9b 0x97 0xa8 0x7f 0x00 0x00
0x7fa89799e5c0: 0xb0 0xe5 0x99 0x97 0xa8 0x7f 0x00 0x00
0x7fa89799e5c8: 0x79 0x48 0xe3 0x01 0xf9 0x99 0xe4 0x01
0x7fa89799e5d0: 0x0a 0x00 0x00 0x00 0x01 0x00 0x0a 0x03
0x7fa89799e5d8: 0x02 0x22 0x00 0x00 0x00 0x00 0x00 0x00
0x7fa89799e5e0: 0x6d 0x33 0x4f 0xd6 0x02 0x00 0x00 0x00
0x7fa89799e5e8: 0x37 0x31 0x39 0x31 0x32 0x5f 0x79 0x68
0x7fa89799e5f0: 0x64 0x2e 0x73 0x65 0x72 0x69 0x61 0x6c
0x7fa89799e5f8: 0x2e 0x70 0x72 0x6f 0x64 0x75 0x63 0x74
0x7fa89799e600: 0x2e 0x67 0x65 0x74 0x5f 0x31 0x2e 0x30
0x7fa89799e608: 0x5f 0x30 0x20 0x20 0x31 0x36 0x33 0x38
0x7fa89799e610: 0x34 0x20 0x38 0x0d 0x0a 0x00 [0x00 0x00 8\r\n"
0x7fa89799e618: 0x00 0x00 0x00 0x00 0x2d 0x0d] 0x0a 0x0a
0x7fa89799e620: 0x08 0xe1 0x0d 0x0a 0x0d 0x70 0x0d 0x0a
(key

item[0]:"Y_ORDER_225426358_02 32 1\r\n1\r\n513\r\n" nssuffix=7,slab=1,nkey=21,nbytes=3
item[1]:"71912_yhd.serial.product.get_1.0_0 16384 8\r\n" nsuffix=10; slab 2; nkey=34; nbytes=10
when the deadloop happened, it is looking for key "B920818_0". and the key is not in the list(may be it is).

it is odd that the item[1] have no data, although the nbyte is 10!

----------------

i guess that the bug occurred when hash expanding. the hashpower=17(the default is 16), and hash_items=101025, the hash expand when hash_items>98304. It is very possible that the dealoop hapened after expanding, and then all the thread is hanged.

(gdb) p expand_bucket
$ =
(gdb) p stats.hash_is_expanding
$ = false
(gdb) p hash_items
$ =
(gdb) p **/
$ = 98304
 
(gdb) p expanding

$20 = false
(gdb)

(gdb) p hashpower
$21 = 17
(gdb)

i can't not find the bug in the code, so if any guys have any suggestion, please tell me.

thanks a lot.

memcached空指针内存错误与死循环问题分析（memcached dead loop and crash bug! issue #260 and issue #370）的更多相关文章

AddressSanitizer —— ASAN分析内存错误
简介 AddressSanitizer 是一个性能非常好的C/C++ 内存错误探测工具. 它由编译器的插桩模块和替换了malloc函数的运行时库组成. 这个工具可以探测如下这些类型的错误: 对堆.栈和 ...
[转]C++常见内存错误汇总
在系统开发过程中出现的bug相对而言是比较好解决的,花费在这个上面的调试代价不是很大,但是在系统集成后的bug往往是难以定位的bug(最好方式是打桩,通过打桩可以初步锁定出错的位置,如:进入函数前打印 ...
memcached学习笔记——存储命令源码分析上篇
原创文章,转载请标明,谢谢. 上一篇分析过memcached的连接模型,了解memcached是如何高效处理客户端连接,这一篇分析memcached源码中的process_update_command ...
memcached的内存管理与删除机制
memcached的内存管理与删除机制简介注意:Memcache最大的value也只能是1M的空间,超过1M的数据无法保存(修改memcache源代码). 注意:内存碎片化永远都存在,只是哪一 ...
setter方法的内存错误
- (void)setList:(ClassicList *)list { self.list = list; _titleLabel.text = list.activityName; _addre ...
memcached学习笔记——存储命令源码分析下篇
上一篇回顾:<memcached学习笔记——存储命令源码分析上篇>通过分析memcached的存储命令源码的过程,了解了memcached如何解析文本命令和mencached的内存管理机制 ...
memcached学习——memcached的内存分配机制Slab Allocation、内存使用机制LRU、常用监控记录（四）
内存分配机制Slab Allocation 本文参考博客:https://my.oschina.net/bieber/blog/505458 Memcached的内存分配是以slabs为单位的,会根据 ...
Memcached源代码分析 - Memcached源代码分析之消息回应（3）
文章列表: <Memcached源代码分析 - Memcached源代码分析之基于Libevent的网络模型(1)> <Memcached源代码分析 - Memcached源代码分析 ...
小心DLL链接静态库时的内存错误
本文转自http://www.bennychen.cn/2010/09/%E5%B0%8F%E5%BF%83dll%E9%93%BE%E6%8E%A5%E9%9D%99%E6%80%81%E5%BA% ...

随机推荐

python内置下载服务器
python内置了一个下载服务器.例如你的同事要让你传的文件位于某一个目录下面,那么你可以进入这个目录,然后执行下面的命令启动一个下载服务器 python2 python -m SimpleHTTPS ...
redis 模拟redis server接收信息
一.实现说明客户端使用jedis正常set值到redis服务器 2. 模拟服务器接收jedis发送的信息二.jedis客户端代码 package com.ahd.redis; import r ...
关于IDEA中@Autowired 注解报错~图文
例如鼠标放上去会报错如下: Could not autowire. No beans of 'StudentMapper' type found. less... (Ctrl+F1) Inspecti ...
lilo - 安装引导装入程序
总述主要功能: ” /sbin/lilo” - 安装引导装入程序辅助用途: ”/sbin/lilo –q” - 查询影射表 ”/sbin/lilo –R” - 设置下次启动的默认命令行 ”/sbi ...
python-文件的修改
python-文件的修改修改文件的方法第一种方法: 第二种方法: f=open("my-heart","r") f_new=open("my-he ...
Nginx Windows下安装使用及权重分配
内容目录 Nginx 下载启动Nginx关闭NginxNginx使用注意事项使用Nginx代理服务器做负载均衡Nginx配置静态资源Nginx权重分配方式Nginx负载均衡参数描述写在最后 Nginx ...
thinkphp 多条件联合查询 where例句
$where['username'] = array("eq",$username); $where['phone'] = array("eq",$userna ...
[工具] BurpSuite--XssValidator插件
0x00 安装所需软件: 1.burpsuite 2.xssvalidator 源码:https://github.com/nVisium/xssValidator(按照编译指导编译) burpsu ...
安装BCG界面库会导致vs2013qt库配置消失
安装BCG界面库会导致vs2013qt库配置消失安装BCG界面库会导致vs2013qt库配置消失安装BCG界面库会导致vs2013qt库配置消失
layui中从子窗口传递数据到父窗口,第三个子弹层的值传给第二个弹层
最近做一个项目的需要多个弹层,每个弹层中还需要数据传递, 经过测试,以下方法三个弹层才有效,如果只是有两个弹层,请用其它方法大概如图,看图自己应该明白如何在在b页面选择好的值传给a页面的问题,这个 ...

memcached空指针内存错误与死循环问题分析（memcached dead loop and crash bug! issue #260 and issue #370）

memcached空指针内存错误与死循环问题分析（memcached dead loop and crash bug! issue #260 and issue #370）的更多相关文章

随机推荐

热门专题