Friday, October 14, 2005

Dispatcher locks and Bug 5017148


As part of the opensolaris release, I'm going to describe about the dispatcher locks, thread locks and a bug which I root-caused last year. The investigation didn't take much time, but it was an interesting one because door does magic in the kernel at the time of handoff to other thread (client to server or server to client). So let me begin with what's a dispatcher lock:

1. What's a dispatcher lock

Dispatcher lock is a one byte lock (disp_lock_t) which is acquired at high pil (DISP_LEVEL) and DISP_LEVEL is the interrupt level at which dispatcher operations should be performed. There are other symbolic interrupt levels viz. CLOCK_LEVEL and LOCK_LEVEL in machlock.h

Following are the interfaces for dispatcher lock which are described in disp_lock.c

disp_lock_init() initializes dispatcher lock.
disp_lock_destroy() destroys dispatcher lock.
disp_lock_enter() acquires dispatcher lock.
disp_lock_exit() releases dispatcher lock and checks for kernel preemption.
disp_lock_exit_nopreempt() releases dispatcher lock without checking for kernel preemption.
disp_lock_enter_high() acquires another dispatcher lock when the thread is already holding a dispatcher lock.
disp_lock_exit_high() releases the top level dispatcher lock.

Here are the facts about dispatcher locks :-

(a) Being a spin lock which are acquired at high level, dispatcher locks should be acquired for a short duration and shouldn't make blocking calls.
(b) While releasing dispatcher lock, you can be preempted if cpu_kprunrun (kernel preemption) is set. You can use disp_lock_exit_nopreempt() if you don't want to be preempted.
(c) While holding dispatcher lock, you are not preemptible.
(d) Since dispatcher lock raises pil to DISP_LEVEL, the old pil is saved in t_oldspl of the thread structure (kthread_t)

2. What's a thread lock


Thread lock is a per-thread entity which protects t_state and state-related flags of a kernel thread. Thread lock hangs off kthread_t as t_lockp. t_lockp is a pointer to thread dispatcher lock and the pointer is changed whenever the state of the kernel thread is changed. One would acquire thread lock using thread_lock() routine giving the kernel thread pointer. thread_lock() is responsible for getting the correct dispatcher lock for the thread. The dance done by thread_lock() is interesting because t_lockp is pointer and can get changed during the course of spinning for a dispatcher lock. Hence thread_lock() saves t_lockp pointer and ensures that we acquire the right thread lock.

Now lets take a look at the interfaces in Solaris kernel which are described in disp_lock.c and thread.h

thread_lock() is called to require thread lock.
thread_unlock() is called to release thread lock and it checks for kernel preemption.
thread_lock_high() is called to acquire another thread lock while holding one.
thread_unlock_high() is called to release thread lock while holding one.
thread_unlock_nopreempt() is called to release thread lock without checking for kernel preemption.

3. Various types of thread locks in Solaris Kernel


Now that I've described about thread lock, it's very important for us to understand what dispatcher locks are acquired depending upon the state of the thread. In order to find out this, you need to first understand the one-to-one mapping between the state of the thread and it's corresponding dispatcher lock:

TS_RUN (runnable) ---> disp_lock of the dispatch queue in a CPU (cpu_t) or global preemption queue of a CPU partition
TS_ONPROC (running ) ---> cpu_thread_lock in a CPU (cpu_t)
TS_SLEEP (sleep) ---> sleepq bucket lock or turnstile chain lock
TS_STOPPED (stopped) ---> stop_lock (a global dispatcher lock) for stopped threads.

There're two global dispatcher locks: shuttle_lock and transition_lock in Solaris Kernel. When thread lock of a thread is pointing to shuttle_lock, it means that the thread is sleeping on a door and when thread lock points to transition_lock, it means that thread is in transition to another state (for instance when the state of the thread sleeping on a semaphore is changed from TS_SLEEP to TS_RUN or during yield()). transition_lock is always held and is never released.

4. Examples of thread lock

Now lets understand what all thread locks will be involved from wakeup (or unsleep) to onproc (running) of a thread. Lets assume that T1 (thread 1) is blocked on a condition variable CV1 and T2 (thread 2) signals T1 as part of wakeup. First cv_signal() grabs sleepq bucket lock and decrements the waiters count on CV1. It then calls sleepq_wakeone_chan() to wakeup T1. sleepq_wakeone_chan()'s responsibility is to unlink T1 from the sleepq list (using t_link of kthread_t) and calls CL_WAKEUP (scheduling class specific wakeup routine). Assuming T1 is in time sharing class (TS), ts_wakeup() gets called. Now ts_wakeup() which in turn calls dispatcher enqueue routine (setfrontdq() or setbackdq()) changes the state of T1 thread to TS_RUN and changes t_lockp to point to disp_lock of the chosen CPU. At last sleepq_wakeone_chan() drops disp_lock of the dispatch queue and finally sleepq dispatcher lock is also released in cv_signal(). Once T1 is chosen to run, disp() removes T1 from the dispatch queue of the CPU and changes the state to TS_ONPROC and t_lockp to cpu_thread_lock of the CPU.

void
cv_signal(kcondvar_t *cvp)
{
condvar_impl_t *cp = (condvar_impl_t *)cvp;

/* make sure the cv_waiters field looks sane */
ASSERT(cp->cv_waiters <= CV_MAX_WAITERS);
if (cp->cv_waiters > 0) {
sleepq_head_t *sqh = SQHASH(cp);
disp_lock_enter(&sqh->sq_lock);
ASSERT(CPU_ON_INTR(CPU) == 0);
if (cp->cv_waiters & CV_WAITERS_MASK) {
kthread_t *t;
cp->cv_waiters--;
t = sleepq_wakeone_chan(&sqh->sq_queue, cp);
/*
* If cv_waiters is non-zero (and less than
* CV_MAX_WAITERS) there should be a thread
* in the queue.
*/
ASSERT(t != NULL);
} else if (sleepq_wakeone_chan(&sqh->sq_queue, cp) == NULL) {
cp->cv_waiters = 0;
}
disp_lock_exit(&sqh->sq_lock);
}
}


The second example is from the phase of preemption. We know that there are two types of preemption in Solaris kernel viz. user preemption (cpu_runrun) and kernel preemption (cpu_kprunrun). Assume that T1 is being preempted in favour of a high priority thread. As a result T1 will call preempt() once T1 realizes that it has to give up the CPU (there're hooks in Solaris kernel to determine this). preempt() first grabs thread lock effectively cpu_thread_lock on itself and calls THREAD_TRANSITION() to change the t_lockp to transition_lock. Note that the state of T1 is still TS_ONPROC while t_lockp is pointing to transition_lock, because T1 is in transition phase (from TS_ONPROC -> TS_RUN). THREAD_TRANSITION() also releases previous dispatcher lock because transition_lock is always held. preempt() then calls CL_PREEMPT(), scheduling class specific preemption routine, to enqueue T1 on a particular CPU. From here on it's same as described in the first example.

void
preempt()
{
kthread_t *t = curthread;
klwp_t *lwp = ttolwp(curthread);

if (panicstr)
return;

TRACE_0(TR_FAC_DISP, TR_PREEMPT_START, "preempt_start");

thread_lock(t);

if (t->t_state != TS_ONPROC || t->t_disp_queue != CPU->cpu_disp) {
/*
* this thread has already been chosen to be run on
* another CPU. Clear kprunrun on this CPU since we're
* already headed for swtch().
*/
CPU->cpu_kprunrun = 0;
thread_unlock_nopreempt(t);
TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
} else {
if (lwp != NULL)
lwp->lwp_ru.nivcsw++;
CPU_STATS_ADDQ(CPU, sys, inv_swtch, 1);
THREAD_TRANSITION(t);
CL_PREEMPT(t);
DTRACE_SCHED(preempt);
thread_unlock_nopreempt(t);

TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");

swtch(); /* clears CPU->cpu_runrun via disp() */
}
}

5. An example of a dispatcher lock and Bug 5017148.


Apart from illustrating dispatcher lock, I'll also describe a problem which I had found a while back. This's involves kernel door implementation too.

I usually begin with looking at what CPUs are doing whenever I take a look at a crash dump from a system hang:

> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 0001041d2b0 1b 1 0 60 no no t-0 3001ba04900 cluster
1 30019fe4030 1d 2 0 101 no no t-0 3003d873a40 rgmd
2 3001a38aab8 1d 1 0 165 yes yes t-0 2a1003ebd20 sched
3 0001041b778 1d 2 0 60 yes yes t-0 3004fac3c80 cluster


CPU 0 is spinning for a mutex 0x30001d7cae0 which is held by thread 0x3004fac3c80 running on CPU 3. Please note that thread will spin for a mutex only when the owner is running and in this case owner of the mutex happens to be onproc on CPU 3.

> 0x30001d7cae0$ 0x30001d7cae0: owner/waiters
3004fac3c80
>

CPU 3 is our clock interrupt CPU (run ::cycinfo -v and figure out where the clock handler is registered) and thread 0x3004fac3c80 on CPU 3 seems to be spinning in cv_block() for sleepq bucket lock (sleepq_head[]). In order to find out which sleepq bucket this thread is looking for, we can look at wait chanel t_wchan (t_lwpchan.lc_wchan) and using the hash function SQHASH(), I found out the right bucket. Since we're already holding thread lock (effectively cpu_thread_lock of CPU 3) and looking for sleepq bucket lock, this would have blocked clock interrupts too. This can be verifyed from the pending clock interrupts in ::cycinfo -v.

Lets disassemble cv_block() thread 3004fac3c80 is stuck

cv_block+0x9c: add %i2, 8, %i0
cv_block+0xa0: call -0x460e0
cv_block+0xa4: mov %i0, %o0

> 0x3004fac3c80::print kthread_t t_lockp
t_lockp = cpu0+0xb8
> cpu0=J
1041b778 // CPU 3
> 0x3004fac3c80::print kthread_t ! grep wchan
lc_wchan = 0x3006fc52d20

And the sleepq bucket happens to be :-


> 0x10471d88::print sleepq_head_t
{
sq_queue = {
sq_first = 0x3001b476ee0
}
sq_lock = 0xff <----- dispatcher lock is held
}

Thread 3003d873a40 running on CPU 1 is spinning in thread_lock_high().

> 3003d873a40::findstack
stack pointer for thread 3003d873a40: 2a1025964a1
[ 000002a1025964a1 panic_idle+0x1c() ]
000002a102596551 prom_rtt()
000002a1025966a1 thread_lock_high+0xc()
000002a102596751 sema_p+0x60()
000002a102596801 kobj_open+0x84()
000002a1025968d1 kobj_open_file+0x44()
[.]
000002a102597011 xdoor_proxy+0x20c()
000002a1025971f1 door_call+0x204()
000002a1025972f1 syscall_trap32+0xa8()
>

Now this's an interesting stack. Looking at the sema_p() code, we see that we first grab the sleepq bucket lock and then try to grab thread lock.

Since the hashing function SQHASH() would return the same index for 0x3006fc52d20 and 0x300819f3118, we see that sema_p() getting stuck on the thread lock which is held by thread running on CPU 3 and thread running on CPU 3 is stuck because sleep queue bucket lock is held by thread running on CPU 1.


> 0x3003d873a40::print kthread_t t_lockp
t_lockp = cpu0+0xb8
> cpu0+0xb8/x
cpu0+0xb8: ff00

Now lets find out the real problem of this deadlock. Lets look t_cpu of thread 0x3003d873a40 and we see that thread 0x3003d873a40 running on CPU 1 has t_lockp pointing to CPU 3's cpu_thread_lock. This's really nasty as we would expect it to point to CPU 1's cpu_thread_lock.

> 0x3003d873a40::print kthread_t ! grep cpu
t_bound_cpu = 0
t_cpu = 0x30019fe4030
t_lockp = cpu0+0xb8 // CPU 3's cpu_thread_lock
t_disp_queue = cpu0+0x78

The cause of this problem is that the door_get_server(), while doing the handoff to server thread, is getting preempted because disp_lock_exit() checks for kernel preemption.
static kthread_t *
door_get_server(door_node_t *dp)
{
[.]
/*
* Mark the thread as ONPROC and take it off the list
* of available server threads. We are committed to
* resuming this thread now.
*/
disp_lock_t *tlp = server_t->t_lockp;
cpu_t *cp = CPU;

pool->dp_threads = server_t->t_door->d_servers;
server_t->t_door->d_servers = NULL;
/*
* Setting t_disp_queue prevents erroneous preemptions
* if this thread is still in execution on another processor
*/
server_t->t_disp_queue = cp->cpu_disp;
CL_ACTIVE(server_t);
/*
* We are calling thread_onproc() instead of
* THREAD_ONPROC() because compiler can reorder
* the two stores of t_state and t_lockp in
* THREAD_ONPROC().
*/
thread_onproc(server_t, cp);
disp_lock_exit(tlp);
return (server_t);
[.]


As a result server thread's t_lockp points to incorrect cpu_thread_lock because client thread started running on different CPU when client thread did shuttle_resume() to server thread. We can see that door_return() (which return the results to the caller) releases dispatcher lock without getting preempted, so we didn't notice this problem in door_return().

On the move for cracking another problem now...In fact we don't get sleep if we don't take a look at the crash dump :-)


Technorati Tag:
Technorati Tag:

Compiler reordering problem

I'm going to write about a compiler reordering problem in door_return() function which was observed in July 2002. The customer was able to reproduce the problem for us and it took me a while to figure out that it was a compiler reordering problem. I must thank our customers for being so co-operative when we get such issues. I must have given instrumented kernels for at least five times before I found out the problem. It's bug 4699850.

The symptom was very clear. System used to panic in Solaris Kernel Dispatcher routines and one of the symptom was system panicing in dispdeq() while removing a kernel thread from the dispatch queue of a CPU.

We know that compiler can reorder C statments if they are independent. Assume this piece of C code:

#define THREAD_SET_STATE(tp, state, lp) \
((tp)->t_state = state, (tp)->t_lockp = lp)

t_lockp is a pointer to a dispatcher lock and we don't know whether lp is held or not. When a thread is made TS_ONPROC, the t_lockp of the corresponding thread points to cpu_thread_lock of CPU (cpu_t). In the above mentioned C code, these stores can be reordered can be re-ordered by compiler, so the lp should be held while calling setting the threads state.

In door_return(), when server thread is about to handoff to client thread to return the results, it makes the client thread TS_ONPROC and calls shuttle_resume() on client thread. The responsibility of shuttle_resume() is to make client/server thread TS_ONPROC and the caller sleeps on shuttle_lock sync obj.

While putting a thread onproc, dispatcher routines need not hold cpu_thread_lock and hence in door_return() if we call THREAD_ONPROC(), we effectively lost thread lock on the client thread.

Now lets look at the two stores again. It t_lockp reaches global visibility before t_state, we can effectively lose thread lock on the thread. Assume another thread on different CPU is sending a signal to client door thread. Once the thread lock is lost on the client thread, the thread which is sending signal to client thread could see the old state of client thread (in this case it happens to be TS_SLEEP). Since the state is TS_SLEEP, eat_signal() will do setrun() on the client thread which enqueues client thread in the dispatch queue of the CPU. As a result, we can see some very strange things happening which also included dispdeq() panic.

The following code in door_return() was faulty:

int
door_return(caddr_t data_ptr, size_t data_size,
door_desc_t *desc_ptr, uint_t desc_num, caddr_t sp)
{
[.]
tlp = caller->t_lockp;
/*
* Setting t_disp_queue prevents erroneous preemptions
* if this thread is still in execution on another
* processor
*/
caller->t_disp_queue = cp->cpu_disp;
CL_ACTIVE(caller);
/*
* We are calling thread_onproc() instead of
* THREAD_ONPROC() because compiler can reorder
* the two stores of t_state and t_lockp in
* THREAD_ONPROC().
*/
thread_onproc(caller, cp);
disp_lock_exit_high(tlp);
shuttle_resume(caller, &door_knob);
[.]
}


I had used TNF (trace normal form) for finding out this problem. But now we have a powerful tool to trace from userland to kernel and of course it's Dtrace.


Technorati Tag:
Technorati Tag:
Technorati Tag: