Friday, October 14, 2005

Dispatcher locks and Bug 5017148


As part of the opensolaris release, I'm going to describe about the dispatcher locks, thread locks and a bug which I root-caused last year. The investigation didn't take much time, but it was an interesting one because door does magic in the kernel at the time of handoff to other thread (client to server or server to client). So let me begin with what's a dispatcher lock:

1. What's a dispatcher lock

Dispatcher lock is a one byte lock (disp_lock_t) which is acquired at high pil (DISP_LEVEL) and DISP_LEVEL is the interrupt level at which dispatcher operations should be performed. There are other symbolic interrupt levels viz. CLOCK_LEVEL and LOCK_LEVEL in machlock.h

Following are the interfaces for dispatcher lock which are described in disp_lock.c

disp_lock_init() initializes dispatcher lock.
disp_lock_destroy() destroys dispatcher lock.
disp_lock_enter() acquires dispatcher lock.
disp_lock_exit() releases dispatcher lock and checks for kernel preemption.
disp_lock_exit_nopreempt() releases dispatcher lock without checking for kernel preemption.
disp_lock_enter_high() acquires another dispatcher lock when the thread is already holding a dispatcher lock.
disp_lock_exit_high() releases the top level dispatcher lock.

Here are the facts about dispatcher locks :-

(a) Being a spin lock which are acquired at high level, dispatcher locks should be acquired for a short duration and shouldn't make blocking calls.
(b) While releasing dispatcher lock, you can be preempted if cpu_kprunrun (kernel preemption) is set. You can use disp_lock_exit_nopreempt() if you don't want to be preempted.
(c) While holding dispatcher lock, you are not preemptible.
(d) Since dispatcher lock raises pil to DISP_LEVEL, the old pil is saved in t_oldspl of the thread structure (kthread_t)

2. What's a thread lock


Thread lock is a per-thread entity which protects t_state and state-related flags of a kernel thread. Thread lock hangs off kthread_t as t_lockp. t_lockp is a pointer to thread dispatcher lock and the pointer is changed whenever the state of the kernel thread is changed. One would acquire thread lock using thread_lock() routine giving the kernel thread pointer. thread_lock() is responsible for getting the correct dispatcher lock for the thread. The dance done by thread_lock() is interesting because t_lockp is pointer and can get changed during the course of spinning for a dispatcher lock. Hence thread_lock() saves t_lockp pointer and ensures that we acquire the right thread lock.

Now lets take a look at the interfaces in Solaris kernel which are described in disp_lock.c and thread.h

thread_lock() is called to require thread lock.
thread_unlock() is called to release thread lock and it checks for kernel preemption.
thread_lock_high() is called to acquire another thread lock while holding one.
thread_unlock_high() is called to release thread lock while holding one.
thread_unlock_nopreempt() is called to release thread lock without checking for kernel preemption.

3. Various types of thread locks in Solaris Kernel


Now that I've described about thread lock, it's very important for us to understand what dispatcher locks are acquired depending upon the state of the thread. In order to find out this, you need to first understand the one-to-one mapping between the state of the thread and it's corresponding dispatcher lock:

TS_RUN (runnable) ---> disp_lock of the dispatch queue in a CPU (cpu_t) or global preemption queue of a CPU partition
TS_ONPROC (running ) ---> cpu_thread_lock in a CPU (cpu_t)
TS_SLEEP (sleep) ---> sleepq bucket lock or turnstile chain lock
TS_STOPPED (stopped) ---> stop_lock (a global dispatcher lock) for stopped threads.

There're two global dispatcher locks: shuttle_lock and transition_lock in Solaris Kernel. When thread lock of a thread is pointing to shuttle_lock, it means that the thread is sleeping on a door and when thread lock points to transition_lock, it means that thread is in transition to another state (for instance when the state of the thread sleeping on a semaphore is changed from TS_SLEEP to TS_RUN or during yield()). transition_lock is always held and is never released.

4. Examples of thread lock

Now lets understand what all thread locks will be involved from wakeup (or unsleep) to onproc (running) of a thread. Lets assume that T1 (thread 1) is blocked on a condition variable CV1 and T2 (thread 2) signals T1 as part of wakeup. First cv_signal() grabs sleepq bucket lock and decrements the waiters count on CV1. It then calls sleepq_wakeone_chan() to wakeup T1. sleepq_wakeone_chan()'s responsibility is to unlink T1 from the sleepq list (using t_link of kthread_t) and calls CL_WAKEUP (scheduling class specific wakeup routine). Assuming T1 is in time sharing class (TS), ts_wakeup() gets called. Now ts_wakeup() which in turn calls dispatcher enqueue routine (setfrontdq() or setbackdq()) changes the state of T1 thread to TS_RUN and changes t_lockp to point to disp_lock of the chosen CPU. At last sleepq_wakeone_chan() drops disp_lock of the dispatch queue and finally sleepq dispatcher lock is also released in cv_signal(). Once T1 is chosen to run, disp() removes T1 from the dispatch queue of the CPU and changes the state to TS_ONPROC and t_lockp to cpu_thread_lock of the CPU.

void
cv_signal(kcondvar_t *cvp)
{
condvar_impl_t *cp = (condvar_impl_t *)cvp;

/* make sure the cv_waiters field looks sane */
ASSERT(cp->cv_waiters <= CV_MAX_WAITERS);
if (cp->cv_waiters > 0) {
sleepq_head_t *sqh = SQHASH(cp);
disp_lock_enter(&sqh->sq_lock);
ASSERT(CPU_ON_INTR(CPU) == 0);
if (cp->cv_waiters & CV_WAITERS_MASK) {
kthread_t *t;
cp->cv_waiters--;
t = sleepq_wakeone_chan(&sqh->sq_queue, cp);
/*
* If cv_waiters is non-zero (and less than
* CV_MAX_WAITERS) there should be a thread
* in the queue.
*/
ASSERT(t != NULL);
} else if (sleepq_wakeone_chan(&sqh->sq_queue, cp) == NULL) {
cp->cv_waiters = 0;
}
disp_lock_exit(&sqh->sq_lock);
}
}


The second example is from the phase of preemption. We know that there are two types of preemption in Solaris kernel viz. user preemption (cpu_runrun) and kernel preemption (cpu_kprunrun). Assume that T1 is being preempted in favour of a high priority thread. As a result T1 will call preempt() once T1 realizes that it has to give up the CPU (there're hooks in Solaris kernel to determine this). preempt() first grabs thread lock effectively cpu_thread_lock on itself and calls THREAD_TRANSITION() to change the t_lockp to transition_lock. Note that the state of T1 is still TS_ONPROC while t_lockp is pointing to transition_lock, because T1 is in transition phase (from TS_ONPROC -> TS_RUN). THREAD_TRANSITION() also releases previous dispatcher lock because transition_lock is always held. preempt() then calls CL_PREEMPT(), scheduling class specific preemption routine, to enqueue T1 on a particular CPU. From here on it's same as described in the first example.

void
preempt()
{
kthread_t *t = curthread;
klwp_t *lwp = ttolwp(curthread);

if (panicstr)
return;

TRACE_0(TR_FAC_DISP, TR_PREEMPT_START, "preempt_start");

thread_lock(t);

if (t->t_state != TS_ONPROC || t->t_disp_queue != CPU->cpu_disp) {
/*
* this thread has already been chosen to be run on
* another CPU. Clear kprunrun on this CPU since we're
* already headed for swtch().
*/
CPU->cpu_kprunrun = 0;
thread_unlock_nopreempt(t);
TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");
} else {
if (lwp != NULL)
lwp->lwp_ru.nivcsw++;
CPU_STATS_ADDQ(CPU, sys, inv_swtch, 1);
THREAD_TRANSITION(t);
CL_PREEMPT(t);
DTRACE_SCHED(preempt);
thread_unlock_nopreempt(t);

TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");

swtch(); /* clears CPU->cpu_runrun via disp() */
}
}

5. An example of a dispatcher lock and Bug 5017148.


Apart from illustrating dispatcher lock, I'll also describe a problem which I had found a while back. This's involves kernel door implementation too.

I usually begin with looking at what CPUs are doing whenever I take a look at a crash dump from a system hang:

> ::cpuinfo
ID ADDR FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD PROC
0 0001041d2b0 1b 1 0 60 no no t-0 3001ba04900 cluster
1 30019fe4030 1d 2 0 101 no no t-0 3003d873a40 rgmd
2 3001a38aab8 1d 1 0 165 yes yes t-0 2a1003ebd20 sched
3 0001041b778 1d 2 0 60 yes yes t-0 3004fac3c80 cluster


CPU 0 is spinning for a mutex 0x30001d7cae0 which is held by thread 0x3004fac3c80 running on CPU 3. Please note that thread will spin for a mutex only when the owner is running and in this case owner of the mutex happens to be onproc on CPU 3.

> 0x30001d7cae0$ 0x30001d7cae0: owner/waiters
3004fac3c80
>

CPU 3 is our clock interrupt CPU (run ::cycinfo -v and figure out where the clock handler is registered) and thread 0x3004fac3c80 on CPU 3 seems to be spinning in cv_block() for sleepq bucket lock (sleepq_head[]). In order to find out which sleepq bucket this thread is looking for, we can look at wait chanel t_wchan (t_lwpchan.lc_wchan) and using the hash function SQHASH(), I found out the right bucket. Since we're already holding thread lock (effectively cpu_thread_lock of CPU 3) and looking for sleepq bucket lock, this would have blocked clock interrupts too. This can be verifyed from the pending clock interrupts in ::cycinfo -v.

Lets disassemble cv_block() thread 3004fac3c80 is stuck

cv_block+0x9c: add %i2, 8, %i0
cv_block+0xa0: call -0x460e0
cv_block+0xa4: mov %i0, %o0

> 0x3004fac3c80::print kthread_t t_lockp
t_lockp = cpu0+0xb8
> cpu0=J
1041b778 // CPU 3
> 0x3004fac3c80::print kthread_t ! grep wchan
lc_wchan = 0x3006fc52d20

And the sleepq bucket happens to be :-


> 0x10471d88::print sleepq_head_t
{
sq_queue = {
sq_first = 0x3001b476ee0
}
sq_lock = 0xff <----- dispatcher lock is held
}

Thread 3003d873a40 running on CPU 1 is spinning in thread_lock_high().

> 3003d873a40::findstack
stack pointer for thread 3003d873a40: 2a1025964a1
[ 000002a1025964a1 panic_idle+0x1c() ]
000002a102596551 prom_rtt()
000002a1025966a1 thread_lock_high+0xc()
000002a102596751 sema_p+0x60()
000002a102596801 kobj_open+0x84()
000002a1025968d1 kobj_open_file+0x44()
[.]
000002a102597011 xdoor_proxy+0x20c()
000002a1025971f1 door_call+0x204()
000002a1025972f1 syscall_trap32+0xa8()
>

Now this's an interesting stack. Looking at the sema_p() code, we see that we first grab the sleepq bucket lock and then try to grab thread lock.

Since the hashing function SQHASH() would return the same index for 0x3006fc52d20 and 0x300819f3118, we see that sema_p() getting stuck on the thread lock which is held by thread running on CPU 3 and thread running on CPU 3 is stuck because sleep queue bucket lock is held by thread running on CPU 1.


> 0x3003d873a40::print kthread_t t_lockp
t_lockp = cpu0+0xb8
> cpu0+0xb8/x
cpu0+0xb8: ff00

Now lets find out the real problem of this deadlock. Lets look t_cpu of thread 0x3003d873a40 and we see that thread 0x3003d873a40 running on CPU 1 has t_lockp pointing to CPU 3's cpu_thread_lock. This's really nasty as we would expect it to point to CPU 1's cpu_thread_lock.

> 0x3003d873a40::print kthread_t ! grep cpu
t_bound_cpu = 0
t_cpu = 0x30019fe4030
t_lockp = cpu0+0xb8 // CPU 3's cpu_thread_lock
t_disp_queue = cpu0+0x78

The cause of this problem is that the door_get_server(), while doing the handoff to server thread, is getting preempted because disp_lock_exit() checks for kernel preemption.
static kthread_t *
door_get_server(door_node_t *dp)
{
[.]
/*
* Mark the thread as ONPROC and take it off the list
* of available server threads. We are committed to
* resuming this thread now.
*/
disp_lock_t *tlp = server_t->t_lockp;
cpu_t *cp = CPU;

pool->dp_threads = server_t->t_door->d_servers;
server_t->t_door->d_servers = NULL;
/*
* Setting t_disp_queue prevents erroneous preemptions
* if this thread is still in execution on another processor
*/
server_t->t_disp_queue = cp->cpu_disp;
CL_ACTIVE(server_t);
/*
* We are calling thread_onproc() instead of
* THREAD_ONPROC() because compiler can reorder
* the two stores of t_state and t_lockp in
* THREAD_ONPROC().
*/
thread_onproc(server_t, cp);
disp_lock_exit(tlp);
return (server_t);
[.]


As a result server thread's t_lockp points to incorrect cpu_thread_lock because client thread started running on different CPU when client thread did shuttle_resume() to server thread. We can see that door_return() (which return the results to the caller) releases dispatcher lock without getting preempted, so we didn't notice this problem in door_return().

On the move for cracking another problem now...In fact we don't get sleep if we don't take a look at the crash dump :-)


Technorati Tag:
Technorati Tag:

Compiler reordering problem

I'm going to write about a compiler reordering problem in door_return() function which was observed in July 2002. The customer was able to reproduce the problem for us and it took me a while to figure out that it was a compiler reordering problem. I must thank our customers for being so co-operative when we get such issues. I must have given instrumented kernels for at least five times before I found out the problem. It's bug 4699850.

The symptom was very clear. System used to panic in Solaris Kernel Dispatcher routines and one of the symptom was system panicing in dispdeq() while removing a kernel thread from the dispatch queue of a CPU.

We know that compiler can reorder C statments if they are independent. Assume this piece of C code:

#define THREAD_SET_STATE(tp, state, lp) \
((tp)->t_state = state, (tp)->t_lockp = lp)

t_lockp is a pointer to a dispatcher lock and we don't know whether lp is held or not. When a thread is made TS_ONPROC, the t_lockp of the corresponding thread points to cpu_thread_lock of CPU (cpu_t). In the above mentioned C code, these stores can be reordered can be re-ordered by compiler, so the lp should be held while calling setting the threads state.

In door_return(), when server thread is about to handoff to client thread to return the results, it makes the client thread TS_ONPROC and calls shuttle_resume() on client thread. The responsibility of shuttle_resume() is to make client/server thread TS_ONPROC and the caller sleeps on shuttle_lock sync obj.

While putting a thread onproc, dispatcher routines need not hold cpu_thread_lock and hence in door_return() if we call THREAD_ONPROC(), we effectively lost thread lock on the client thread.

Now lets look at the two stores again. It t_lockp reaches global visibility before t_state, we can effectively lose thread lock on the thread. Assume another thread on different CPU is sending a signal to client door thread. Once the thread lock is lost on the client thread, the thread which is sending signal to client thread could see the old state of client thread (in this case it happens to be TS_SLEEP). Since the state is TS_SLEEP, eat_signal() will do setrun() on the client thread which enqueues client thread in the dispatch queue of the CPU. As a result, we can see some very strange things happening which also included dispdeq() panic.

The following code in door_return() was faulty:

int
door_return(caddr_t data_ptr, size_t data_size,
door_desc_t *desc_ptr, uint_t desc_num, caddr_t sp)
{
[.]
tlp = caller->t_lockp;
/*
* Setting t_disp_queue prevents erroneous preemptions
* if this thread is still in execution on another
* processor
*/
caller->t_disp_queue = cp->cpu_disp;
CL_ACTIVE(caller);
/*
* We are calling thread_onproc() instead of
* THREAD_ONPROC() because compiler can reorder
* the two stores of t_state and t_lockp in
* THREAD_ONPROC().
*/
thread_onproc(caller, cp);
disp_lock_exit_high(tlp);
shuttle_resume(caller, &door_knob);
[.]
}


I had used TNF (trace normal form) for finding out this problem. But now we have a powerful tool to trace from userland to kernel and of course it's Dtrace.


Technorati Tag:
Technorati Tag:
Technorati Tag:

Thursday, July 21, 2005

An interesting signal delivery related problem

Recently, we found an interesting performance problem using Dtrace. The program was when using Virtual timer created using setitimer(2). The interval passed was 10m (one clock tick) but SIGVTALRM signal used to arrive late and sometimes 6 ticks or more. Now how will you Dtrace the code and from where will you start tracing? I'll start tracing from signal generation to delivery. In Solaris kernel to post a signal we use sigtoproc() and eat_signal() is  called on the thread to make the thread on proc (TS_ONPROC) depending upon the state (TS_RUN, TS_SLEEP, TS_STOPPED). psig() is called we kernel finds a pending signal (for instance when returning from trap).

The program spins in userland after setting up the timer. Since the state of thread would be TS_ONPROC, it would be required to poke the target CPU if thread happens to be running on different CPU. So I started tracing following functions: sigtoproc(), eat_signal(), poke_cpu() and psig(). Now lets take a look at the Dtrace probes output:


CPU Probe ID Function
8 11263 eat_signal:entry 1027637980027920 sig : 28
8 2981 poke_cpu:entry 1027637980030560 cpu : 9
8 11263 eat_signal:entry 1027637990025440 sig : 28
8 2981 poke_cpu:entry 1027637990032160 cpu : 9
8 11263 eat_signal:entry 1027638000036320 sig : 28
8 2981 poke_cpu:entry 1027638000043600 cpu : 9
8 11263 eat_signal:entry 1027638010025520 sig : 28
8 2981 poke_cpu:entry 1027638010032240 cpu : 9
8 11263 eat_signal:entry 1027638020023840 sig : 28
8 2981 poke_cpu:entry 1027638020031280 cpu : 9
8 11263 eat_signal:entry 1027638030028720 sig : 28
8 2981 poke_cpu:entry 1027638030035920 cpu : 9
8 11263 eat_signal:entry 1027638040024480 sig : 28
[.]
9 8317 psig:entry 1027638170086480 sig : 28

If you calculate the difference (ie timestamp) between psig() and the first eat_signal(), you will notice that the difference is huge.

1027638170086480-1027637980027920
190058560 = 19 ticks (190 ms)


We also noticed that CPU 8 (from where sigtoproc() is being called by clock_tick()) is poking CPU 9, however CPU 9 is not preempting the current running thread (program which is spinning). So why and how will it happen? In order to understand this, I'll first describe a bit on how preemption works in Solaris. In order to preempt a running thread, kernel sets t_astflag (using aston() macro) and also sets appropriate CPU preemption flag. There are two CPU preemption flags viz: cpu_runrun for user level preemptions and cpu_kprunrun for kernel level preemptions. RT threads can preempt TS or SYS or IA class threads since kernel level preemptions typically kicks off when current running threads priority is <= 100 (KPQPRI). For signal we don't set CPU level preemption flags. We just need to set t_sig_check and t_astflag followed by poke call.

Since we are interested in user level preemption, we should know what happens when CPU 8 poked CPU 9 (using cross call). If the current running thread on CPU 9 is in userland, then we call user_rtt() which calls trap() if the checks for t_astflag succeeds. So lets check whether t_astflag would be set when we call eat_signal() or not. And that's where the problem was. If the target thread in eat_signal() is TS_ONPROC, we should set t_astflag and then poke the CPU. It will be clear from the following probe that the running thread on CPU 9
was getting preempted because the time quantum finished and clock would have set t_astflag in cpu_surrender().

9 15055 post_syscall:entry 1027637970269440
8 11263 eat_signal:entry 1027637980027920 sig : 28
8 2981 poke_cpu:entry 1027637980030560 cpu : 9
[.]
8 11263 eat_signal:entry 1027638040024480 sig : 28
8 2981 poke_cpu:entry 1027638040026800 cpu : 9
[.]
8 11263 eat_signal:entry 1027638110024160 sig : 28
8 2981 poke_cpu:entry 1027638110026560 cpu : 9
8 2435 cpu_surrender:entry 1027638170024720 t:3001b7af3e0
8 2981 poke_cpu:entry 1027638170027280 cpu : 9
8 11263 eat_signal:entry 1027638170032720 sig : 28
9 2919 poke_cpu_intr:entry 1027638170033760
8 2981 poke_cpu:entry 1027638170034400 cpu : 9
9 3390 trap:entry 1027638170037840 type :512, pc: 10984, ast:1
8 2981 poke_cpu:entry 1027638170038640 cpu : 9
9 2919 poke_cpu_intr:entry 1027638170045680
9 1497 trap_cleanup:entry 1027638170054880 0
9 8317 psig:entry 1027638170086480 sig : 28
9 2278 trap_rtt:entry 1027638170117440
9 15055 post_syscall:entry 1027638170143360
9 8317 psig:entry 1027638170150880 sig : 2

So Dtrace did help us in finding out where the problem could be. This is just once example. Happy Dtracing...

Tuesday, July 19, 2005

Dtrace rocks...

Sometime back I had a problem with my desktop and as a result it started crawling whenever Java ticker used to kick in. I think I must share this with the rest of the world. I'd also share a kernel problem that we cracked and it was related to performance. So Dtrace has helped in solving many problems so far.

My desktop running Solaris 10 started crawling when I noticed that Xsun is eating up 68% of CPU. From prstat(1M)

# prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
594 ****** 85M 78M run 30 0 14:03:19 68% Xsun/1
796 root 16M 13M sleep 59 0 1:06:25 5.8% stfontserverd/18
[.]

I then started Dtrac'ing Xsun and noticed that lwp_sigmask() syscall call is being made too frequently by Xsun. Here is the data :-

# ./syscall.d
^C
Ran for 26 seconds


writev 2832
pollsys 3261
read 5910
doorfs 27199
lwp_sigmask 217592

LWP ID COUNT
1 217592

libc.so.1`__systemcall6+0x20
libc.so.1`pthread_sigmask+0x1b4
libc.so.1`sigprocmask+0x20
libc.so.1`sighold+0x54
libST.so.1`fsexchange+0x78
libST.so.1`FSSessionDisposeFontInstance+0x8c
9063

libc.so.1`__systemcall6+0x20
libc.so.1`pthread_sigmask+0x1b4
libc.so.1`sigprocmask+0x20
libc.so.1`sigrelse+0x54
libST.so.1`fsexchange+0xc0
libST.so.1`FSSessionGetFontRenderingParams+0x8c

...and many more such stack traces from libST.so.1`fsexchange().

Infact the stack is like this:-

libc.so.1`__systemcall6+0x20
libc.so.1`pthread_sigmask+0x1b4
libc.so.1`sigprocmask+0x20
libc.so.1`sighold+0x54
libST.so.1`fsexchange+0x90
libST.so.1`FSSessionGetFontRenderingParams+0x8c
libST.so.1`GetRenderProps+0x344
libST.so.1`GlyphVectorRepQuery+0xf4
libST.so.1`STGlyphVectorQuery+0xd0
SUNWXst.so.1`_XSTUseCache+0x68

Notice that in this stack trace, we are calling sighold() and sigrelse() too frequently. So this process is disabling and enabling signals for some reason. Looks like we are rendering characters, but why do we block and unblock signals in this path?. Here is the Dtrace script which was used :-

#!/usr/sbin/dtrace -s

#pragma D option quiet

BEGIN
{
start = timestamp;

}

syscall:::entry
/execname == "Xsun"/
{
@s[probefunc] = count();
}

syscall::lwp_sigmask:entry
/execname == "Xsun"/
{
@c[curthread->t_tid] = count();
@st[ustack(6)] = count();
}

END
{
printf("Ran for %d seconds\n\n", (timestamp - start) / 1000000000);

trunc(@s,5);
printa(@s);

printf("\n%-10s %-10s\n", "LWP ID", "COUNT");
printa("%-10d %@d\n", @c);

printa(@st);
}

In fact Dtrace could help us in solving much more complex problems. Happy Dtrac'ing...

Tuesday, May 24, 2005

::cpupart -v for mdb(1m)

Most of you would have used ::cpupart in mdb(1m) to determine the partitions you have on your system. For those who don't know what's partition, then it's an objecct (or kernel entity) which consists of set of CPUs and a global dispatch queue (or global preemption queue). In fact processor sets (which are created from userland using psrset(1M) are abstraction of CPU partitions.

One of the thing which I'm currently working on is to introduce a new option to ::cpupart which will print all the runnable threads in the global dispatch queue of a CPU partition. It's very similar to what ::cpuinfo -v does. Here is the sample output :-

On x86 :-
---------
> ::cpupart -v
ID ADDR NRUN #CPU CPUS
0 fec2a1f8 298 2 0-1
|
+--> PRI THREAD PROC
100 d19b1000 sema
100 d19aca00 sema
100 d19ab200 sema
100 d19a7000 sema
100 d0b90a00 sema
100 d19a5a00 sema
100 d19a2e00 sema
100 d19aea00 sema
100 d19b4200 sema
[.]
>

On SPARC :-
-----------
> ::cpupart -v
ID ADDR NRUN #CPU CPUS
0 18a8c50 25 8 4-11
|
+--> PRI THREAD PROC
100 3000a7b1660 sema
100 3000a7c55e0 sema
100 3000a7b0d00 sema
100 3000a7c4960 sema
100 3000a7b5c80 sema
100 3000a7b4380 sema
100 300084a2c80 sema-1
100 3000826c3a0 sema-1
[.]
>

Wednesday, February 16, 2005

Dtrace : calculating time spent in functions

The other day (10th Feb 2005), I was giving a demo of Dtrace to BEA India folks in ITPL Bangalore and thought that having a Dtrace script which could demonstrate following things would be nice :-

(a) time spent in functions called
(b) shows both user and kernel level tracing capabilities in the same function call context.
(c) uses $target and -c to demonstrate tracing when a process is launched.

I hope people would find it interesting. Here is the output from script. Please note that the time spent is expressed in nanoseconds.

# which w
/usr/bin/w
# dtrace -c w -Fs ./userfunc.target.d main
[.]
5 -> malloc 0
5 -> assert_no_libc_locks_held 0
5 <- assert_no_libc_locks_held 6880
5 -> lmutex_lock 0
5 <- lmutex_lock 7840
5 -> _malloc_unlocked 0
5 -> cleanfree 0
5 <- cleanfree 7600
5 -> realfree 0
5 <- realfree 8000
5 <- _malloc_unlocked 35360
5 -> lmutex_unlock 0
5 <- lmutex_unlock 8240
5 <- malloc 89200
5 -> sysinfo 0 <-------- getting into kernel
5 -> pre_syscall 0
5 -> syscall_mstate 0
5 <- syscall_mstate 4560
5 <- pre_syscall 13120
5 -> systeminfo 0
5 <- systeminfo 5760
5 -> post_syscall 0
5 -> clear_stale_fd 0
5 <- clear_stale_fd 5200
5 -> syscall_mstate 0
5 <- syscall_mstate 4000
5 <- post_syscall 21680

5 <- sysinfo 60720
[.]

# cat userfunc.target.d
/*
* This script calculates time spent in all the functions (userland & kernel)
* called once a particular function is traced.
*
* Please see /usr/demo/dtrace/userfunc.d also.
*
* Usage:
* # dtrace -c -Fs ./userfunc.target.d
*/

BEGIN
{
self->depth = 0;
}

pid$target::$1:entry
{
self->trace = 1;
self->depth = 0;
self->timestamp[self->depth++] = timestamp;
}

pid$target::$1:return
/self->trace/
{
self->trace = 0;
trace(timestamp - self->timestamp[0]);
}

fbt:::entry,
pid$target:::entry
/self->trace && self->timestamp[self->depth - 1]/
{
self->timestamp[self->depth++] = timestamp;
trace(0);
}

fbt:::return,
pid$target:::return
/self->trace && self->timestamp[self->depth - 1]/
{
self->depth--;
trace(timestamp - self->timestamp[self->depth]);
self->timestamp[self->depth] = 0;
}

Thursday, February 3, 2005

Small demo of Resource Management, Contracts and Service Management Framework

I thought I should share these small demos with you folks. I've started playing with Resource Management, Zones,
CPU-shares, Service Management Framework and Contract. I hope you will find it a bit interesting.

Projects


I first created two projects and following resource controls were added to these projects. projadd(1m) is used to add a project and projects(1) is used to list projects in the system.

# projadd -G other -c "Project 1" -K "project.cpu-shares=(privileged,1,none)" proj1
# projmod -a -K "task.max-lwps=(privileged,3,deny)" proj1
# projmod -a -K "rcap.max-rss=20971520" proj1
# projadd -G other -c "Project 2" -K "project.cpu-shares=(privileged,3,none)" -K "rcap.max-rss=10485760" proj2

Now lets assign proj1 to user1 and proj2 to user2 so that whenever user1 and/or user2 login, the enforcement of these resource control ome into effect.

# projmod -U user1 proj1
# projmod -U user2 proj2

We need to add these two lines into /etc/user_attr :-

# diff /etc/user_attr.org /etc/user_attr
12a13,14
> user1::::project=proj1
> user2::::project=proj2

So this's how our two projects look like :-

# projects -l
[.]
proj1
projid : 100
comment: "Project 1"
users : user1
groups : other
attribs: project.cpu-shares=(privileged,1,none)
rcap.max-rss=20971520
task.max-lwps=(privileged,3,deny)
proj2
projid : 101
comment: "Project 2"
users : user2
groups : other
attribs: project.cpu-shares=(privileged,3,none)
rcap.max-rss=10485760
#

proj1 specifies three things

(a) It assign CPU-shares as 1
(b) resident set size (RSS) can go upto 20m
(c) maximum number of lwps we can have in this project is 3

proj2 specifies two things

(a) It assign CPU-shares as 3
(b) resident set size can go upto 10m

What's the indent of this demo on CPU-shares -- I'd like to show that 25% utilization in proj1 and 75% utilization in proj2 will be enforced when there is competition from other projects. Since I use same workload which is to spin and hog CPU, it's little easy for me to demostrate this fact here.

# prstat -J
[.]
PROJID NPROC SIZE RSS MEMORY TIME CPU PROJECT
101 2 2408K 1816K 0.0% 0:01:33 74% proj2
100 2 2408K 1816K 0.0% 0:00:56 25% proj1
1 2 7504K 6352K 0.0% 0:00:00 0.2% user.root
0 67 336M 183M 0.3% 0:00:27 0.1% system


Total: 73 processes, 247 lwps, load averages: 1.75, 0.77, 0.32


Now, lets change the CPU-shares of proj2 to 2 and see what happens :-

# prctl -n project.cpu-shares -r -v 2 -i project proj2
# prctl -n project.cpu-shares -i project proj2
project: 101: proj2
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
project.cpu-shares
privileged 2 - none -
system 65.5K max none -
# prstat -J
[.]
PROJID NPROC SIZE RSS MEMORY TIME CPU PROJECT
101 2 2408K 1816K 0.0% 0:03:24 65% proj2
100 2 2408K 1816K 0.0% 0:01:44 33% proj1
1 2 7504K 6344K 0.0% 0:00:00 0.2% user.root
0 67 336M 183M 0.3% 0:00:27 0.1% system


Resource Management in Zones

Here I show how to create a simply zone (without a logical interface attached to it) and subsequently we will assign shares to newly created zones.

# zonecfg -z myzone1
myzone1: No such zone configured
Use 'create' to begin configuring a new zone.
zonecfg:myzone1> create
zonecfg:myzone1> add fs
zonecfg:myzone1:fs> set dir=/mnt/local
zonecfg:myzone1:fs> set special=/opt/sfw
zonecfg:myzone1:fs> set type=lofs
zonecfg:myzone1:fs> set options=[ro,nodevices]
zonecfg:myzone1:fs> end
zonecfg:myzone1> add rctl
zonecfg:myzone1:rctl> set name=zone.cpu-shares
zonecfg:myzone1:rctl> add value (priv=privileged,limit=1,action=none)
zonecfg:myzone1:rctl> end
zonecfg:myzone1> add attr
zonecfg:myzone1:attr> set name=comment
zonecfg:myzone1:attr> set type=string
zonecfg:myzone1:attr> set value="first zone"
zonecfg:myzone1:attr> end
zonecfg:myzone1> set autoboot=true
zonecfg:myzone1> set zonepath=/export/home/zone/myzone1
zonecfg:myzone1> verify
zonecfg:myzone1> info
zonepath: /export/home/zone/myzone1
autoboot: true
pool:
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
fs:
dir: /mnt/local
special: /opt/sfw
raw not specified
type: lofs
options: [ro,nodevices]
rctl:
name: zone.cpu-shares
value: (priv=privileged,limit=1,action=none)
attr:
name: comment
type: string
value: "first zone"
zonecfg:myzone1>

Now lets install and boot the zone. Remember you can boot, reboot and halt zones independently.

# zoneadm -z myzone1 install
# zoneadm -z myzone1 boot

Repeat these steps to create myzone2, but assign 3 shares to myzone2.

myzone2 will look like this :-

# zonecfg -z myzone2
zonecfg:myzone2> info
zonepath: /export/home/zone/myzone2
autoboot: true
pool:
inherit-pkg-dir:
dir: /lib
inherit-pkg-dir:
dir: /platform
inherit-pkg-dir:
dir: /sbin
inherit-pkg-dir:
dir: /usr
fs:
dir: /mnt/local
special: /opt/sfw
raw not specified
type: lofs
options: [ro,nodevices]
rctl:
name: zone.cpu-shares
value: (priv=privileged,limit=3,action=none)
attr:
name: comment
type: string
value: "Second Zone"
zonecfg:myzone2>

Now I run spin program from /mnt/local to put load on each zone.

# prstat -Z
ZONEID NPROC SIZE RSS MEMORY TIME CPU ZONE
3 9 27M 22M 0.1% 0:11:57 74% myzone2
2 9 27M 21M 0.1% 0:04:17 25% myzone1
0 50 188M 123M 0.3% 0:01:17 0.3% global
1 6 24M 19M 0.1% 0:00:10 0.0% myzone3

Total: 74 processes, 256 lwps, load averages: 2.02, 2.00, 1.73

# prctl -n zone.cpu-shares -i zone myzone1
zone: 2: myzone1
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
zone.cpu-shares
privileged 1 - none -
system 65.5K max none -
# prctl -n zone.cpu-shares -i zone myzone2
zone: 3: myzone2
NAME PRIVILEGE VALUE FLAG ACTION RECIPIENT
zone.cpu-shares
privileged 3 - none -
system 65.5K max none -
#

You can use following command to see the status of zones

# zoneadm list -cv


Contracts


ctrun(1) can be used for restarting the command again in the event of signal, core, hardware error. You don't require ps(1m) and other onitering scripts to see what's happening to your command. See contrat(4) for more details. Here is a small demo of ctrun(1) :-

# ctrun -f signal -r 0 /spin
# ps -eaf | grep spin
root 1175 1083 0 13:27:52 pts/1 0:00 ctrun -f signal -r 0 /spin
root 1177 1 20 13:27:52 pts/1 0:05 /spin
# kill -9 1177
# ps -eaf | grep spin
root 1175 1083 0 13:27:52 pts/1 0:00 ctrun -f signal -r 0 /spin
root 1181 1 15 13:28:04 pts/1 0:04 /spin
#

In case of SIGKILL, ctrun takes care of re-starting the command automatically. Option -r is to specify the re-try count and zero means indefinitely.

You can also use -i option to see what events are taking places. For instace if there is a process getting added to the contract or exiting the contract, the newly created contract would post those events.

Service Management Framework


This is my first service which I created using Service Management Framewrok. This demonstrates how a simple service can be created and what will happen in an event of service dieing.

# svccfg
svc:> import /sample.xml
svc:> quit
# svcs -p spin
STATE STIME FMRI
online 13:41:45 svc:/system/spin:default
13:41:45 1257 spin
# svcs -l spin
fmri svc:/system/spin:default
name First SMF service
enabled true
state online
next_state none
state_time Tue Feb 01 13:41:45 2005
logfile /var/svc/log/system-spin:default.log
restarter svc:/system/svc/restarter:default
contract_id 181
dependency require_all/none svc:/system/filesystem/local (online)
# ps -eaf | grep spin
root 1257 1 54 13:41:46 ? 0:17 /spin
# kill -9 1257
# svcs -p spin
STATE STIME FMRI
online 13:42:08 svc:/system/spin:default
13:42:08 1263 spin
# ps -eaf | grep spin
root 1263 1 45 13:42:08 ? 0:13 /spin
# svccfg
svc:> select spin
svc:/system/spin> listprop
general framework
general/entity_stability astring Unstable
general/single_instance boolean true
fs dependency
fs/entities fmri svc:/system/filesystem/local
fs/grouping astring require_all
fs/restart_on astring none
fs/type astring service
start method
start/exec astring /spin
start/project astring :default
start/resource_pool astring :default
start/timeout_seconds count 60
start/type astring method
start/working_directory astring :default
application framework
application/auto_enable boolean true
application/stability astring Evolving
tm_common_name template
tm_common_name/C ustring "First SMF service"
svc:/system/spin>

Isn't that great? No need to have fancy scripts to moniter the daemons. Moreover you can specify dependencies between the services. For example if you want to make sure that network service is up and running, you would specify this dependency in the dependency property. You can also add methods like stop (whenever a service is stopped; could be using svcadm disable) or refresh (whenever the service configration is read again).

These two programs were used for the demo.

spin.c
------
main()
{
for(;;) ;
}

spin-service.c
--------------
int main()
{
if (fork() == 0) {
/*
* Child process and let it spin
*/
for(;;) ;
}
return (0);
}

Here are few good links which would come handy
(a) Resource Management Configuration Example
(b) CPU-shares
(c) Resource Controls
(d) Overview on Projects and Tasks
(e) Resource Caping
(f) Resource Management in Solaris Zones

Thursday, January 6, 2005

inter-process mutex

Recently, Surya Prakki (colleague of mine in Kernel Sustaining Group) and I worked on an interesting problem. We initially thought that it would be a random corruption, but later we figured out that it was an application problem. Customer also told us that his application was working fine on Solaris 9 and when they migrated to Solaris 10, they started seeing their application dumping core.

Owner field of mutex_t was getting corrupted with an invalid address which was resulting in application dumping core (SIGSEGV). This problem was reproducible easily on single CPU machine. The application was dieing in mutex_trylock_adaptive() routine when it tried to dereference the owner. The owner field of the mutex had strange data, but when you look at the core dump the owner field was zero. So this surprised us a lot and we believed that there is some race.

All sorts of thoughts came to our mind including corruption of registers when cpu switches context to another thread or some other thread overwriting the member in the mutex due to over/under run of an array.

We first started debugging this problem using procfs watchpoint. We first set a watchpoint on the virtual address of the owner field of mutex_t structure. We used mdb's :w macro for this purpose. Watchpoint used to fire frequently because application had two threads which contend for lock quite often. So we decided to use a script having "$c, $r, :c" in it. But whenever corruption happened, target process never got the corresponding watchpoint trap. So it surprised us a lot and we started wondering how this would happen.

We then started using Dtrace and truss to figure what is happening. We were trying to find out what is happening from the point mutex_unlock() clears the owner field till the process dumps core. In this process, we started ruling out the things which we thought in the beginning. We were running out of ideas now when we carefully looked at the mutex_t members and noticed that magic number is correct and type of the mutex is USYNC_THREAD. We then started using Dtrace probes when we context switch to another thread. We wanted to figure out whether context switching is playing any role here or not. During this course, we noticed that another process was getting on to the CPU after the process which dumped core released the mutex. This rang the bell in our mind. We also noticed that the mutex address (virtual address) was same when this context switch happened.

We took a look at the pmap(1) output and noticed that the mutex is from shared memory segment. The other process had used the same key (see shmget(2) system call). What it means is the mutex was used between the processes. We noticed that the corrupted value was a valid address in the other process (a thread address in fact). This surprised us again because we had seen USYNC_THREAD as the type of the mutex and we had *believed* that this mutex is being used between the threads of the same process. This disappointed us a lot. If the mutex is to be used between processes, then the type of the mutex has to be USYNC_PROCESS because one can't really dereference the owner when the mutex is being used between the processes (inter-process mutex).

From the man pages of mutex_init(3THR)

USYNC_THREAD
The mutex can synchronize threads only in this pro-
cess. arg is ignored.

USYNC_PROCESS
The mutex can synchronize threads in this process and
other processes. arg is ignored. The object initial-
ized with this attribute must be allocated in memory
shared between processes, either in System V shared
memory (see shmop(2)) or in memory mapped to a file
(see mmap(2)). If the object is not allocated in such
shared memory, it will not be shared between
processes.

We asked the submitter of the bug to make this modification and it all worked fine. Customer came back saying our diagnosis is correct and he modified the application accordingly.

Having spent a week or so, the bottom line is that don't take things for granted :)