Friday, October 14, 2005

Compiler reordering problem

I'm going to write about a compiler reordering problem in door_return() function which was observed in July 2002. The customer was able to reproduce the problem for us and it took me a while to figure out that it was a compiler reordering problem. I must thank our customers for being so co-operative when we get such issues. I must have given instrumented kernels for at least five times before I found out the problem. It's bug 4699850.

The symptom was very clear. System used to panic in Solaris Kernel Dispatcher routines and one of the symptom was system panicing in dispdeq() while removing a kernel thread from the dispatch queue of a CPU.

We know that compiler can reorder C statments if they are independent. Assume this piece of C code:

#define THREAD_SET_STATE(tp, state, lp) \
((tp)->t_state = state, (tp)->t_lockp = lp)

t_lockp is a pointer to a dispatcher lock and we don't know whether lp is held or not. When a thread is made TS_ONPROC, the t_lockp of the corresponding thread points to cpu_thread_lock of CPU (cpu_t). In the above mentioned C code, these stores can be reordered can be re-ordered by compiler, so the lp should be held while calling setting the threads state.

In door_return(), when server thread is about to handoff to client thread to return the results, it makes the client thread TS_ONPROC and calls shuttle_resume() on client thread. The responsibility of shuttle_resume() is to make client/server thread TS_ONPROC and the caller sleeps on shuttle_lock sync obj.

While putting a thread onproc, dispatcher routines need not hold cpu_thread_lock and hence in door_return() if we call THREAD_ONPROC(), we effectively lost thread lock on the client thread.

Now lets look at the two stores again. It t_lockp reaches global visibility before t_state, we can effectively lose thread lock on the thread. Assume another thread on different CPU is sending a signal to client door thread. Once the thread lock is lost on the client thread, the thread which is sending signal to client thread could see the old state of client thread (in this case it happens to be TS_SLEEP). Since the state is TS_SLEEP, eat_signal() will do setrun() on the client thread which enqueues client thread in the dispatch queue of the CPU. As a result, we can see some very strange things happening which also included dispdeq() panic.

The following code in door_return() was faulty:

int
door_return(caddr_t data_ptr, size_t data_size,
door_desc_t *desc_ptr, uint_t desc_num, caddr_t sp)
{
[.]
tlp = caller->t_lockp;
/*
* Setting t_disp_queue prevents erroneous preemptions
* if this thread is still in execution on another
* processor
*/
caller->t_disp_queue = cp->cpu_disp;
CL_ACTIVE(caller);
/*
* We are calling thread_onproc() instead of
* THREAD_ONPROC() because compiler can reorder
* the two stores of t_state and t_lockp in
* THREAD_ONPROC().
*/
thread_onproc(caller, cp);
disp_lock_exit_high(tlp);
shuttle_resume(caller, &door_knob);
[.]
}


I had used TNF (trace normal form) for finding out this problem. But now we have a powerful tool to trace from userland to kernel and of course it's Dtrace.


Technorati Tag:
Technorati Tag:
Technorati Tag:

No comments: