<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3121096342469843106</id><updated>2011-08-26T04:18:03.512-07:00</updated><title type='text'>Saurabh Mishra (Technical Blog)</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>19</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-3152266385826295122</id><published>2009-09-02T15:22:00.000-07:00</published><updated>2009-09-02T15:23:05.961-07:00</updated><title type='text'>Writing a new Ethernet device driver for Solaris</title><content type='html'>&lt;p&gt;This blog entry goes into describing what all you should keep in mind while writing a new Ethernet device driver for Solaris. What we will not go into are LSO, HW checksum and supporting multiple RX rings as I have not written code for these features. &lt;/p&gt;    &lt;p&gt;Most Ethernet controllers will have descriptor based TX and RX. The starting point for writing a new device driver is getting attach() and detach() working. Well that's fairly easy but mostly we would want to do following things in attach() :&lt;/p&gt;    &lt;p&gt;- Get the vendor/device-id and make sure we have correct chip by looking at the revision.&lt;br /&gt;&lt;/p&gt;    &lt;p&gt;- Pre-allocate all DMA buffers for TX. You will have to anyway pre-allocate all RX buffers. This is the simplest model you can think off but will require bcopy (an extra copy during TX/RX). But hey you are just starting...&lt;br /&gt;&lt;/p&gt;    &lt;p&gt;- Allocate interrupts, Register MAC and MII.&lt;/p&gt;    &lt;p&gt; - Reset PHY if required and do it before starting MII (mii_start() function). Reset the device too...&lt;br /&gt;&lt;/p&gt;    &lt;p&gt;- You must enable device interrupts before returning from attach() and this should be the last operation before returning from attach().&lt;/p&gt;   &lt;p&gt;- MII layer in Solaris will take care of PHY operations and dladm link properties too. So you need to have getprop and setprop  in MAC callback (m_callback). MII can also take care of some common Statistics and ndd. You need to implement PHY read/write/reset operations which are PHY specific.&lt;br /&gt;&lt;/p&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt;One noticeable thing I'd like to point out here is that have one DMA alloc and free function to allocate and free a DMA handle/memory. It simplifies code a lot. The same function can be used to allocate TX/RX descriptor ring, DMA buffers for TX/RX and memory for statistics or control block. You need to pass DMA attribute structure and a flag (DMA Read/Write flag). A typical example of such a function will look like this :-&lt;/p&gt;    &lt;p&gt;typedef struct  xxxx_dma_data {&lt;br /&gt;        ddi_dma_handle_t        hdl;&lt;br /&gt;        ddi_acc_handle_t        acchdl;&lt;br /&gt;        ddi_dma_cookie_t        cookie;&lt;br /&gt;        caddr_t                 addr;&lt;br /&gt;        size_t                  len;&lt;br /&gt;        uint_t                  count;&lt;br /&gt;} xxxx_dma_t; &lt;/p&gt;    &lt;p&gt;&lt;br /&gt;&lt;/p&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt;xxxx_dma_t *&lt;br /&gt;xxxx_alloc_a_dma_blk(xxxx_t *xxxxp, ddi_dma_attr_t *attr, int size, int flag)&lt;br /&gt;{&lt;/p&gt;    &lt;pre&gt; int err;&lt;br /&gt; xxxx_dma_t *dma;&lt;br /&gt;&lt;br /&gt; dma = kmem_zalloc(sizeof (xxxx_dma_t), KM_SLEEP);&lt;br /&gt;&lt;br /&gt; err = ddi_dma_alloc_handle(xxxxp-&gt;xxxx_dip, attr,&lt;br /&gt;     DDI_DMA_SLEEP, NULL, &amp;amp;dma-&gt;hdl);&lt;br /&gt;&lt;br /&gt; if (err != DDI_SUCCESS) {&lt;br /&gt;  goto fail;&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; err = ddi_dma_mem_alloc(dma-&gt;hdl,&lt;br /&gt;     size, &amp;amp;xxxx_mem_attr, DDI_DMA_CONSISTENT, DDI_DMA_SLEEP, NULL,&lt;br /&gt;     &amp;amp;dma-&gt;addr, &amp;amp;dma-&gt;len, &amp;amp;dma-&gt;acchdl);&lt;br /&gt;&lt;br /&gt; if (err != DDI_SUCCESS) {&lt;br /&gt;  ddi_dma_free_handle(&amp;amp;dma-&gt;hdl);&lt;br /&gt;  goto fail;&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; err = ddi_dma_addr_bind_handle(dma-&gt;hdl, NULL, dma-&gt;addr,&lt;br /&gt;     dma-&gt;len, flag | DDI_DMA_CONSISTENT, DDI_DMA_SLEEP,&lt;br /&gt;     NULL, &amp;amp;dma-&gt;cookie, &amp;amp;dma-&gt;count);&lt;br /&gt;&lt;br /&gt; if (err != DDI_SUCCESS) {&lt;br /&gt;  ddi_dma_mem_free(&amp;amp;dma-&gt;acchdl);&lt;br /&gt;  ddi_dma_free_handle(&amp;amp;dma-&gt;hdl);&lt;br /&gt;  goto fail;&lt;br /&gt; }&lt;br /&gt;&lt;br /&gt; return (dma);&lt;br /&gt;fail:&lt;br /&gt; kmem_free(dma, sizeof (xxxx_dma_t));&lt;br /&gt; return (NULL);&lt;br /&gt;&lt;/pre&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt; }&lt;/p&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt;void&lt;br /&gt;xxxx_free_a_dma_blk(xxxx_dma_t *dma)&lt;br /&gt;{&lt;/p&gt;    &lt;pre&gt; if (dma != NULL) {&lt;br /&gt;  (void) ddi_dma_unbind_handle(dma-&gt;hdl);&lt;br /&gt;  ddi_dma_mem_free(&amp;amp;dma-&gt;acchdl);&lt;br /&gt;  ddi_dma_free_handle(&amp;amp;dma-&gt;hdl);&lt;br /&gt;  kmem_free(dma, sizeof (xxxx_dma_t));&lt;br /&gt; }&lt;br /&gt;&lt;/pre&gt;    &lt;p&gt;}&lt;br /&gt;&lt;/p&gt;    &lt;p&gt;&lt;br /&gt;&lt;/p&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt;&lt;u&gt;Some of the corner cases you must take care&lt;/u&gt;:&lt;br /&gt;&lt;/p&gt;-  Test the code path when there are no more TX descriptors available for the driver to send a pkt. You must call mac_tx_update() once a descriptor is reclaimed. Some drivers start reclaiming once threshold is reached.&lt;br /&gt;   &lt;p&gt;- Make sure you handle RX FIFO overflow interrupt properly. The driver may not have enough RX descriptor to receive pkts further and hence you must consume posted RX descriptors. Some chips require reset during RX FIFO.&lt;br /&gt;&lt;/p&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt;&lt;u&gt;General things that you may want to consider: &lt;/u&gt;&lt;br /&gt;&lt;/p&gt;    &lt;p&gt;- Call mac_tx_update() outside lock.&lt;/p&gt;    &lt;p&gt;- Try to raise a software interrupt whenever a hardware interrupt is raised. Don't spend too much time processing pkts in the hardware interrupt context.&lt;/p&gt;    &lt;p&gt;- Make sure chip is quiesced when detach is called.&lt;/p&gt;   &lt;p&gt;- Use DDI's ddi_periodic_add(9F) instead of timeout(9F).&lt;/p&gt;    &lt;p&gt;- Test suspend/resume and quiesce (for fast reboot to work).&lt;/p&gt;    &lt;p&gt;- I think most the Multicast filters are hash-based but I have seen a CAM (Content Addressable Memory) based filter too. It can get tricky to support multicasting and in that case just enable ALL multicast. Hash-based multicast filter are easy to implement. You can have a reference count for every bit in the 64-bit variable. Once the reference count for the bit reaches zero, you make the bit zero. Otherwise it should remain set.&lt;br /&gt;&lt;/p&gt;    &lt;p&gt;- Make sure you handle link status change properly and re-program the MAC register if required at different link speed/duplex. &lt;/p&gt;    &lt;p&gt;- Look for memory leaks (enable kmem_flags = 0xf in /etc/system and take crash dump; then run ::findleaks in mdb)&lt;br /&gt;&lt;/p&gt;    &lt;p&gt;&lt;br /&gt;&lt;/p&gt;    You can use &lt;a href="http://www.opensolaris.org/os/community/device_drivers/projects/nicdrvtest/;jsessionid=74CDFBA0FD7DE2F915F4D0CC0EAAC720" title="NICDRV Testsuite"&gt;NICDRV&lt;/a&gt; or &lt;a href="http://www.sun.com/bigadmin/hcl/hcts/index.jsp#download" title="HCTS Testsuite"&gt;HCTS&lt;/a&gt; for testing and NICDRV will stress test most of the components in your driver including MAXQ, FTP, Ping with different payloads, load/unload of the driver, Multicast, dladm(1m) features, VLAN, VNIC etc.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-3152266385826295122?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/3152266385826295122/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=3152266385826295122' title='39 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/3152266385826295122'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/3152266385826295122'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2009/09/writing-new-ethernet-device-driver-for.html' title='Writing a new Ethernet device driver for Solaris'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>39</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-120454129371456945</id><published>2009-08-04T14:45:00.000-07:00</published><updated>2009-08-24T14:47:05.095-07:00</updated><title type='text'>EOI (End-of-Interrupt) vs Directed-EOI</title><content type='html'>&lt;p&gt;This post is to help us distinguish between EOI and Directed-EOI. When a local APIC clears EOI register, it does two things :-&lt;/p&gt; &lt;p&gt;- Clear the appropriate bit in the ISR register of the local APIC.&lt;/p&gt; &lt;p&gt;- Issue a broadcast EOI message to all the IOAPICs in the system.&lt;/p&gt;&lt;p&gt;In Solaris, we clear EOI register of the local APIC at two different places:&lt;br /&gt;&lt;/p&gt; &lt;p&gt;- For edge interrupts, we clear EOI register while raising the TPR (Task Priroity register) i.e apic_intr_enter().&lt;/p&gt; &lt;p&gt;- For level-triggered interrupts, we clear EOI register when exiting from interrupt handler i.e apic_intr_exit().&lt;/p&gt; &lt;p&gt;The notion of Directed-EOI had come from x2APIC specification. The Directed-EOI here does not refer to generating broadcast EOI message to all the IOAPICs. What we do here is clear ISR in the local APIC (by writing 0 to EOI register in the local APIC) and then clear the appropriate vector index in the IOAPIC. Some CPUs are capable of masking the broadcast EOI message and that's when Directed-EOI comes handy. Note that Directed-EOI has no meaning when interrupt is Edge. For Edge interrupt, we don't send any Directed-EOI.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-120454129371456945?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/120454129371456945/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=120454129371456945' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/120454129371456945'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/120454129371456945'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2009/08/eoi-end-of-interrupt-vs-directed-eoi.html' title='EOI (End-of-Interrupt) vs Directed-EOI'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-3752268913904257244</id><published>2009-06-16T14:48:00.000-07:00</published><updated>2009-08-24T14:49:24.156-07:00</updated><title type='text'>x2APIC and a new device driver for Broadcom Fast Ethernet chips</title><content type='html'>Its been quite a while since I wrote something technical on my blog. I have been working on quite a few things off-late. Since my integration of x2APIC - a new Local APIC model which uses MSR (Model Specific Register) on future generation Intel Processors, I took a small challenge to work on Device Drivers and that too an Ethernet Controller. Having gained no knowledge about Networking  and Device Driver in past years, I thought this is the time to jump-in. Better late than never you know. So this blog is really about two major things :-&lt;br /&gt;   &lt;p&gt;&lt;br /&gt;&lt;b&gt;&lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic_regops.c#50" title="x2APIC"&gt;x2APIC&lt;/a&gt; &lt;/b&gt;- A new Local APIC (Advance Programmable Interrupt Controller). It improves performance as the local APICs can write to registers parallely. With xAPIC (MMIO model), we use-to map local APIC registers in memory and hence any write to I/O space used to get serialize. x2APIC has some improvements in IPI (Inter-Processor Interrupt) too. It also extends support for Local APIC ID &gt; 255 but I don't think any BIOS  programs Local APIC ID &gt; 255 as of now.&lt;br /&gt;&lt;br /&gt;&lt;a title="Broadcom (bfe) Fast Ethernet Driver" href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bfe/bfe.c#55"&gt;&lt;b&gt;Broadcom Fast Ethernet (SUNWbfe)&lt;/b&gt;&lt;/a&gt; - This is a project which turned out to be a good experience. I had no prior knowledge of writing device drivers or Ethernet controllers. Initially, I was quite confused about the Ring-Architecture, Descriptors and Buffers. I was not able to fit everything in a big picture and convince myself that it works. I managed to learn about them after spending some two weeks looking for some documents on how TX/RX rings are organized. So the first thing was to document how a TX/RX ring is organized and it's well described &lt;a title="bfe" href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/bfe/bfe.c#55"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Solaris now have support for Broadcom 100-T-Base Fast Ethernet controller. It is a bit old Ethernet controller but a popular one. Moreover it makes lot of sense on Netbooks than laptops. This chip has only one TX and RX ring. The number of descriptors are programmable and it supports Multicast through CAM (Content Addressable Memory for 64 entries). It does not have support for Jumbo frame though and hence MTU is 1500. Having integrated bfe in Solaris Nevada the other day, my next target is to add support for Atheros/Attansic Ethernet controllers. They come in three flavors :-&lt;br /&gt;&lt;br /&gt;- &lt;b&gt;Atheros/Attansic  L2&lt;/b&gt;  Fast Ethernet  as device-id 0x2048&lt;br /&gt;&lt;br /&gt;- &lt;b&gt;Atheros/Attansic's AR8121/AR8113&lt;/b&gt;     PCI-E Ethernet Controller as device-id 0x1026&lt;br /&gt;&lt;br /&gt;- &lt;b&gt;Atheros/Attansic L1&lt;/b&gt; Gigabit Ethernet 10/100/1000 Base  as device-id 0x1048&lt;br /&gt;&lt;br /&gt;The plan is to have support for all the three chips in atge (a new device driver or SUNWatge). I have started the work and I expect it to complete in two-three months timeframe.&lt;/p&gt;    &lt;p&gt; &lt;/p&gt;    &lt;p&gt;/&lt;a href="http://saurabhslr.blogspot.com/"&gt;Saurabh&lt;/a&gt;&lt;/p&gt;    &lt;a href="http://saurabhslr.blogspot.com/"&gt;http://saurabhslr.blogspot.com&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-3752268913904257244?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/3752268913904257244/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=3752268913904257244' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/3752268913904257244'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/3752268913904257244'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2009/06/x2apic-and-new-device-driver-for.html' title='x2APIC and a new device driver for Broadcom Fast Ethernet chips'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-2056461641676790738</id><published>2009-04-19T14:49:00.000-07:00</published><updated>2009-08-24T14:50:05.164-07:00</updated><title type='text'>Install-Time-Update (ITU) and Driver Binding in Solaris</title><content type='html'>&lt;p&gt;If you ever wonder how to create install time driver updates for Solaris 10 and Nevada, then you may want to read this blog entry as it involves few tricks here and there.  There are two ways to make your device work with Solaris. The install-time-update (aka ITU DU or ITU diskette) &lt;span style="text-decoration: none;"&gt;&lt;span&gt;is only required for the case where the disk drive will become the Solaris boot drive.&lt;/span&gt;&lt;/span&gt; For all other case, you should be able to generate a package and run pkgadd(1m) command to install the driver package on running Solaris.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;                    &lt;/p&gt;&lt;h2&gt;&lt;u&gt;ITU Method&lt;/u&gt;&lt;br /&gt;&lt;/h2&gt;In order  to install Solaris onto a bootable drive supported by your driver, you can use an Install Time Update (ITU).  The ITU must have your driver (both 32-bit and 64-bit binaries) and PCI-IDs of the device your driver supports.&lt;br /&gt; &lt;h4&gt;&lt;u&gt;&lt;b&gt;How to construct an ITU&lt;/b&gt;&lt;/u&gt;&lt;/h4&gt;  &lt;ul&gt;&lt;li&gt;&lt;p style="margin-bottom: 0in; text-decoration: none;"&gt;&lt;span&gt;Make  sure you have Solaris 10 and Nevada binaries of yours driver for both the  32-bit and 64-bit Operating System and the your_driver.conf (driver  configuration) file. You should get the pkg_drv(1m) command by  installing the SUNWpkgd package from &lt;span style="text-decoration: underline;"&gt;&lt;u&gt;this&lt;/u&gt; &lt;a href="http://developers.sun.com/solaris/developer/support/driver/tools/SUNWpkgd/html/index.html"&gt;link&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt;  In order to create an ITU for Solaris 10 and Nevada, you would want  to create two directories and run pkg_drv(1m) there.&lt;/p&gt; &lt;/li&gt;&lt;/ul&gt;  &lt;h4&gt;&lt;u&gt;&lt;b&gt;For Solaris 10&lt;/b&gt;&lt;/u&gt;&lt;/h4&gt;   &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; # mkdir -p /var/tmp/your_driver.5.10&lt;br /&gt;# cd /var/tmp/your_driver.5.10&lt;/p&gt;  &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; Copy your driver and your_driver.conf file in the current directory.  &lt;/p&gt;     &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; # mkdir -p kernel/drv/amd64&lt;br /&gt;# cp &lt;32-bit&gt; .&lt;br /&gt;# cp &lt;32-bit&gt; kernel/drv/your_driver&lt;br /&gt;# cp &lt;64-bit&gt; kernel/drv/amd64&lt;br /&gt;# cp your_driver.conf .&lt;br /&gt;# pkg_drv -i '"pciVVVV,DDDD.SSSS.ssss"'  -o `pwd`/PKG -c scsi -r 5.10 your_driver&lt;br /&gt;&lt;/p&gt; &lt;p&gt;VVVV = Vendor-id&lt;br /&gt;DDDD = Device-id&lt;br /&gt;SSSSS = Subsystem-vendor-id&lt;br /&gt;ssss = Subsystem-device-id&lt;br /&gt;PKG = your_driver.&lt;br /&gt;'-c scsi' is for device class and in this example we have been discussing about disk drive.&lt;br /&gt;&lt;/p&gt;&lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; The output of the pkg_drv(1m) will resemble the output below :-&lt;br /&gt;&lt;/p&gt;    &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; input file: drv=your_driver&lt;br /&gt;input file: conf=your_driver.conf&lt;br /&gt;WARNING: pkg_drv: pkg/driver name exists in /etc/driver_aliases&lt;br /&gt;Suggested Package Naming Conventions: 8 characters, with the first capitalized characters uniquely specifying the company (e.g. stock market ticker). The remaining characters specify the driver (e.g. SUNWcadd for a CAD driver from Sun Microsystems). The driver name must be unique across all Solaris platforms and releases.  &lt;/p&gt;                     &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; ## Building pkgmap from package prototype file.&lt;br /&gt;## Processing pkginfo file.&lt;br /&gt;## Attempting to volumize 8 entries in pkgmap.&lt;br /&gt;part  1 -- 276 blocks, 29 entries&lt;br /&gt;## Packaging one part.&lt;br /&gt;/tmp/12546/PKG/pkgmap&lt;br /&gt;/tmp/12546/PKG/pkginfo&lt;br /&gt;/tmp/12546/PKG/reloc/boot/solaris/devicedb/master&lt;br /&gt;/tmp/12546/PKG/install/copyright&lt;br /&gt;/tmp/12546/PKG/install/depend&lt;br /&gt;/tmp/12546/PKG/install/i.master&lt;br /&gt;/tmp/12546/PKG/reloc/kernel/drv/your_driver&lt;br /&gt;/tmp/12546/PKG/reloc/kernel/drv/your_driver.conf&lt;br /&gt;/tmp/12546/PKG/install/postinstall&lt;br /&gt;/tmp/12546/PKG/install/postremove&lt;br /&gt;/tmp/12546/PKG/install/r.master&lt;br /&gt;## Validating control scripts.&lt;br /&gt;## Packaging complete.&lt;br /&gt;output pkg: See package directory PKG in /tmp/12546&lt;br /&gt;pkg_drv: 2 warnings 0 errors&lt;/p&gt;                    &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt;&lt;br /&gt;bash-3.2# find /tmp/12546&lt;br /&gt;/tmp/12546&lt;br /&gt;/tmp/12546/PKG&lt;br /&gt;/tmp/12546/PKG/pkgmap&lt;br /&gt;/tmp/12546/PKG/pkginfo&lt;br /&gt;/tmp/12546/PKG/reloc&lt;br /&gt;/tmp/12546/PKG/reloc/boot&lt;br /&gt;/tmp/12546/PKG/reloc/boot/solaris&lt;br /&gt;/tmp/12546/PKG/reloc/boot/solaris/devicedb&lt;br /&gt;/tmp/12546/PKG/reloc/boot/solaris/devicedb/master&lt;br /&gt;/tmp/12546/PKG/reloc/kernel&lt;br /&gt;/tmp/12546/PKG/reloc/kernel/drv&lt;br /&gt;/tmp/12546/PKG/reloc/kernel/drv/your_driver&lt;br /&gt;/tmp/12546/PKG/reloc/kernel/drv/your_driver.conf&lt;br /&gt;/tmp/12546/PKG/install&lt;br /&gt;/tmp/12546/PKG/install/copyright&lt;br /&gt;/tmp/12546/PKG/install/depend&lt;br /&gt;/tmp/12546/PKG/install/i.master&lt;br /&gt;/tmp/12546/PKG/install/postinstall&lt;br /&gt;/tmp/12546/PKG/install/postremove&lt;br /&gt;/tmp/12546/PKG/install/r.master  &lt;/p&gt;   &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; Copy the following files from '/tmp/12546' as follows :-&lt;br /&gt;&lt;/p&gt;      &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; # cd /var/tmp/your_driver.5.10&lt;br /&gt;# cp /tmp/12546/PKG/pkgmap .&lt;br /&gt;# cp /tmp/12546/PKG/install/postinstall .&lt;br /&gt;# cp /tmp/12546/PKG/install/postremove .&lt;br /&gt;# cp /tmp/12546/PKG/install/copyright .&lt;br /&gt;&lt;br /&gt;&lt;/p&gt; &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; You can run 'pkgproto' command or make a prototype file manually :&lt;/p&gt;            &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; bash-3.2# cat &gt; prototype&lt;br /&gt;i copyright&lt;br /&gt;i postremove&lt;br /&gt;i postinstall&lt;br /&gt;i pkginfo&lt;br /&gt;d none kernel 0755 root sys&lt;br /&gt;d none kernel/drv 0755 root sys&lt;br /&gt;d none kernel/drv/amd64 0755 root sys&lt;br /&gt;f none kernel/drv/amd64/your_driver 0644 root sys&lt;br /&gt;f none kernel/drv/your_driver 0644 root sys&lt;br /&gt;f none kernel/drv/your_driver.conf 0644 root sys  &lt;/p&gt;   &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; Make sure you include both the 32-bit and 64-bit binaries of your driver. Once this is completed, we will construct the package again to include 64-bit binary of the driver.&lt;/p&gt; &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; # cd /var/tmp/your_driver.5.10&lt;br /&gt;# pkgmk -r . -d /tmp&lt;/p&gt;   &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; This will create '/tmp/PKG' directory under /tmp and that's where the package is. For example :-&lt;br /&gt;&lt;/p&gt;                 &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; bash-3.2# pkgmk -r . -d /tmp&lt;br /&gt;## Building pkgmap from package prototype file.&lt;br /&gt;## Processing pkginfo file.&lt;br /&gt;## Attempting to volumize 6 entries in pkgmap.&lt;br /&gt;part  1 -- 444 blocks, 23 entries&lt;br /&gt;## Packaging one part.&lt;br /&gt;/tmp/PKG/pkgmap&lt;br /&gt;/tmp/PKG/pkginfo&lt;br /&gt;/tmp/PKG/install/copyright&lt;br /&gt;/tmp/PKG/reloc/kernel/drv/amd64/your_driver&lt;br /&gt;/tmp/PKG/reloc/kernel/drv/your_driver&lt;br /&gt;/tmp/PKG/reloc/kernel/drv/your_driver.conf&lt;br /&gt;/tmp/PKG/install/postinstall&lt;br /&gt;/tmp/PKG/install/postremove&lt;br /&gt;## Validating control scripts.&lt;br /&gt;## Packaging complete.&lt;br /&gt;bash-3.2#&lt;/p&gt;   &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; Do following things to repack package in DU (Diskette) :-&lt;br /&gt;&lt;/p&gt;    &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt;# cd /tmp&lt;br /&gt;# find PKG -print | cpio -o  &gt; /tmp/pkg_of_your_driver&lt;br /&gt;# compress /tmp/pkg_of_your_driver&lt;br /&gt;# cd  /var/tmp/your_driver.5.10/PKG&lt;br /&gt;# cp /tmp/pkg_of_your_driver.Z PKG/DU/sol_210/i86pc/Product/your_driver.Z  &lt;/p&gt;  &lt;h4&gt;&lt;u&gt;&lt;b&gt;For Solaris Neavda&lt;/b&gt;&lt;/u&gt;&lt;/h4&gt;   &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; Repeat the same steps as we did for Solaris 10 except for following things :-&lt;br /&gt;&lt;/p&gt; &lt;ul&gt;&lt;li&gt;&lt;p style="margin-bottom: 0in; text-decoration: none;"&gt;  Create a new directory '/var/tmp/your_driver.5.11' since you are  working on Solaris Nevada. Make sure pkg_drv(1m) command run with '-r 5.11'.&lt;/p&gt;  &lt;/li&gt;&lt;li&gt;&lt;p style="margin-bottom: 0in; text-decoration: none;"&gt;  When copying your_driver.Z copy to DU, make sure you change the path to  'sol_211' in ' PKG/DU/sol_210/i86pc/Product/your_driver.Z'.&lt;br /&gt;&lt;br /&gt;&lt;/p&gt; &lt;/li&gt;&lt;/ul&gt;  &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; Once you have created ITU for Solaris 10 and Nevada, we will bundle them in one DVD/CD (or ISO file). In the directories '/var/tmp/your_driver.5.11' and '/var/tmp/your_driver.5.10', you will find a directory called 'PKG'. You must copy the files under 'PKG' to one directory in order to bundle them together.&lt;br /&gt;&lt;/p&gt;    &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; # mkdir -p /var/tmp/YOUR_DRIVER-DU&lt;br /&gt;# cd /var/tmp/YOUR_DRIVER-DU&lt;br /&gt;# cp -rf /var/tmp/your_driver.5.11/PKG/* .&lt;br /&gt;# cp -rf /var/tmp/your_driver.5.10/PKG/* .&lt;/p&gt;  &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt;&lt;br /&gt;Please run the following command to make an ISO file from the directory /var/tmp/YOUR_DRIVER-DU :&lt;/p&gt;   &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; # mkisofs -o your_driver.iso -r /var/tmp/YOUR_DRIVER-DU&lt;br /&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; This will create an ISO file 'your_driver.iso' and a DVD/CD can be burned by running the following command line at the prompt :-&lt;br /&gt;&lt;/p&gt;  &lt;p style="margin-bottom: 0in; text-decoration: none;"&gt; # cdrw -i /var/tmp/YOUR_DRIVER-DU/your_driver.iso&lt;br /&gt;&lt;/p&gt;  &lt;p&gt; In order to install Solaris on boot drives, you use Solaris Installer DVD and choose option '5' (Apply Driver Updates)'. Kindly follow the instructions when prompted.&lt;br /&gt;&lt;/p&gt;&lt;p style="margin-bottom: 0in; text-decoration: none;"&gt;The other way is to bundle the device driver in Solaris bootable media itself or for network installation. Kindly follow the instructions described at this &lt;a href="http://www.sun.com/bigadmin/features/articles/device_driver_install.jsp"&gt;link.&lt;/a&gt;At the above &lt;a href="http://www.sun.com/bigadmin/features/articles/device_driver_install.jsp"&gt;link&lt;/a&gt;, it describes how to pack/unpack Solaris miniroot in order to make changes to Solaris bootable media.&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;&lt;h2&gt;&lt;u&gt;Driver Binding in Solaris&lt;br /&gt;&lt;/u&gt;&lt;/h2&gt;&lt;p&gt;Driver binding in Solaris is not so easy to understand. The way Solaris binds a driver is based on the precedence.  This precedence list is maintained in the 'compatible' property of the device driver.  The two functions which are responsible for creating 'compatible' property and finding the correct binding for the driver are - &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/io/pci/pci_boot.c#1585"&gt;add_compatible()&lt;/a&gt; and &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/devcfg.c#2635"&gt;ddi_compatible_driver_major()&lt;/a&gt; respectively.&lt;br /&gt;&lt;br /&gt;The responsibility of add_compatible() function is to create 'compatible property' for driver binding in the order described below. For PCI Card, the precedence is created as follows :-&lt;br /&gt;&lt;/p&gt;&lt;p&gt; *   pciVVVV,DDDD.SSSS.ssss.RR   (0)&lt;br /&gt; *   pciVVVV,DDDD.SSSS.ssss         (1)&lt;br /&gt; *   pciSSSS,ssss                                   (2)&lt;br /&gt; *   pciVVVV,DDDD.RR                    (3)&lt;br /&gt; *   pciVVVV,DDDD                          (4)&lt;br /&gt; *   pciclass,CCSSPP                            (5)&lt;br /&gt; *   pciclass,CCSS                                (6)&lt;br /&gt;&lt;br /&gt;For PCI Express card, the precedence will look like this :&lt;/p&gt;&lt;p&gt; *   pciexVVVV,DDDD.SSSS.ssss.RR   (0)&lt;br /&gt; *   pciexVVVV,DDDD.SSSS.ssss         (1)&lt;br /&gt; *   pciexVVVV,DDDD.RR                    (2)&lt;br /&gt; *   pciexVVVV,DDDD                          (3)&lt;br /&gt; *   pciexclass,CCSSPP                            (4)&lt;br /&gt; *   pciexclass,CCSS                                (5)&lt;br /&gt; *   pciVVVV,DDDD.SSSS.ssss.RR     (6)&lt;br /&gt; *   pciVVVV,DDDD.SSSS.ssss            (7)&lt;br /&gt; *   pciSSSS,ssss                                      (8)&lt;br /&gt; *   pciVVVV,DDDD.RR                       (9)&lt;br /&gt; *   pciVVVV,DDDD                             (10)&lt;br /&gt; *   pciclass,CCSSPP                               (11)&lt;br /&gt; *   pciclass,CCSS                                   (12)&lt;br /&gt;&lt;/p&gt;&lt;p&gt;RR = Revision number&lt;br /&gt;CC = Class code&lt;br /&gt;(0) = being the highest precedence&lt;br /&gt;(12) = being the least precedence.&lt;/p&gt;&lt;p&gt;You can get the 'compatible' property by running 'prtconf -vp' command. If the Solaris fails to find a binding using 'compatible' property, then it tries by 'nodename' and the 'nodename' is constructed from Subsystem-vendor-id (SSSS) and Subsystem-device-id (ssss) of the device. The PCI-ID which we have been seeing here is embedded in the PCI Config space of the device.&lt;br /&gt;&lt;/p&gt;Device Drivers and device firmware must make sure that the proper PCI-IDs are chosen to avoid conflict with existing PCI-IDs. If your device is PCI-Express based card, then you must add 'pciexVVVV,DDDD.SSSS' like PCI-IDs in /etc/driver_aliases or via add_drv(1m) or pkg_drv(1m) command.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-2056461641676790738?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/2056461641676790738/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=2056461641676790738' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/2056461641676790738'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/2056461641676790738'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2009/04/install-time-update-itu-and-driver.html' title='Install-Time-Update (ITU) and Driver Binding in Solaris'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-1807412175550375296</id><published>2009-03-20T14:51:00.001-07:00</published><updated>2009-08-24T14:52:25.961-07:00</updated><title type='text'>Latency group (lgroup) in Solaris on NUMA aware machines</title><content type='html'>All of you would have heard about NUMA (Non-uniform-memory-access) machines. I'm going to describe how the memory latency groups (called lgroup in Solaris) are layed out. While working on Multi-CPU binding project, I had to learn these aspects to implement how to choose a lgroup for a thread having least latency from its earlier home lgroup.&lt;br /&gt;&lt;br /&gt;This figure below describes how the lgroup structures are layed out on SPARC based NUMA aware machines. The root lgroup (0) is the top most level of the hierarchy having all the resource sets in the system. lgroup id 1, 2 and 3 are having four CPUs each (system board) and are leaf nodes in this case. On sparc, the remote latency from lgroup 2 to 1 or 3 is same i.e they are equidistant having local and remote latency. In Solaris, we have something called lgroup partition load (lpl_t) which represents the leaf-nodes having CPUs and memory. Each cpu_t (CPU struture) will have cpu_lpl. lpl's are also used when CPU partitons are created (processor sets are the best example). There's a global table of lgroups called lgrp_table[]. Each partition will have its lpl's in cp_lgrploads[] (cpupart_t). Both the tables are indexed by lgroup id. A thread will be homed to an lpl with in the CPU partition.&lt;br /&gt;&lt;img src="http://blogs.sun.com/saurabh_mishra/resource/sparc.jpg" /&gt; &lt;br /&gt;&lt;br /&gt;On a 4-way amd64, the lgroup representation is quite interesting as we have local and in remote we have one and two hops. For example psrinfo(1M) revealed this :-&lt;br /&gt;0       on-line   since 06/09/2006 06:49:25&lt;br /&gt;1       on-line   since 06/09/2006 06:49:31&lt;br /&gt;2       on-line   since 06/09/2006 06:49:33&lt;br /&gt;3       on-line   since 06/09/2006 06:49:35&lt;br /&gt;&lt;br /&gt;Each CPU is a leaf lgroup. The diagram below explains this very well. In the this kind of configuration, we will have non-leaf nodes as 5, 6, 7 and 8 representing resource sets which are one hop away. For example lgroup id 5 is having 1,2,3 (local and one hop away from lgroup 1). The root lgroup id (0) will have everything.&lt;br /&gt;&lt;img src="http://blogs.sun.com/saurabh_mishra/resource/4-way-amd64.jpg" /&gt;&lt;br /&gt;On SPARC, we have two levels of memory hierarchy whereas on 4-way amd64 has three levels of memory hierarchy. 8 way amd64 should have four levels of memory hierarchy. The scheduling of threads starts from it's home lgroup and goes up the hierarchy. For example if the home of a thread (t-&gt;t_lpl) is lgroup 1 (CPU 0 is the resource set), then we would first look at CPU 0 and if thread can't run there, then we will look at the parent of lgroup 1 (lpl_parent) which is lgroup 5 having 1,2,3 as resource sets. Same is true when idle thread steals the work from other CPUs. The locality is kept in mind.&lt;br /&gt;&lt;br /&gt;The lgroup hierarchical representation is more interesting when there are three hops (for example on a 8-way amd64 box). I'll leave it for next time. Thanks to Jonathan Chew for taking time and explaining all this. I thought it'd be worth to blog about this since it's a bit complex design.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-1807412175550375296?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/1807412175550375296/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=1807412175550375296' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/1807412175550375296'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/1807412175550375296'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2009/03/latency-group-lgroup-in-solaris-on-numa.html' title='Latency group (lgroup) in Solaris on NUMA aware machines'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-5404766346168960259</id><published>2009-03-20T14:51:00.000-07:00</published><updated>2009-08-24T14:51:46.346-07:00</updated><title type='text'>Solaris APIC implementation with respect to MSI/MSI-x interrupts</title><content type='html'>Here's some basic information on APIC before we dive into Solaris details and if you want more detail on APIC then you can refer to this &lt;a title="Intel APIC" href="http://en.wikipedia.org/wiki/Intel_APIC_Architecture"&gt;Wiki&lt;/a&gt;.  Solaris details are based on Solaris Neavda Build 84.&lt;br /&gt;&lt;h4&gt;&lt;u&gt;What's Local APIC&lt;/u&gt; &lt;/h4&gt;&lt;p&gt;Local APIC (LAPIC) is part of the CPU chip and it contains (a) mechanism for generating/accepting interrupts (b) a timer (c) manages all external interrupts for the processor and (d) accept and generate inter-processor-interrupts (IPIs). &lt;/p&gt;&lt;h4&gt;&lt;u&gt;What's IOAPIC&lt;/u&gt; &lt;/h4&gt;&lt;p&gt;This is a separate chip that is wired to the local APIC so that it can forward interrupts to the appropriate CPU (and to local APIC). &lt;br /&gt;&lt;/p&gt;&lt;h4&gt;&lt;u&gt;What's Local APIC Table&lt;/u&gt; &lt;/h4&gt;&lt;p&gt;Interrupt vectors are numbered 0x00 through 0xFF in APIC and 0x00...0x1F are reserved for exceptions. The interrupt vectors in the range 0x20...0xFF are available for programming the interrupts in APIC. Like the Local APIC's, the IOAPIC will assign a priority to the interrupt based on the vector number and and it uses 4 top bits of the vector number to distinguish priority and ignores the lower 4 bits. For example if the vector number is 0x3F then the priority would be 0x3. In Solaris, this priority mask is represented by APIC_IPL_MASK (0xF0) and the vector mask is represented by APIC_VECTOR_MASK (0x0F).  &lt;/p&gt;&lt;p&gt;Since we can't use vector range from 0x00...0x1F, Solaris represents APIC_BASE_VECT (0x20) as the base vector and  APIC_MAX_VECTOR (0xFF) being the maximum number of vectors in the local APIC. APIC_AVAIL_VECTOR is calculated based on this formula :-&lt;/p&gt;&lt;p&gt;APIC_MAX_VECTOR+1-APIC_BASE_VECT  and it translates to (0xFF+1-0x20) which is 224 vectors in decimal.  &lt;/p&gt;&lt;p&gt;Note that vectors are grouped in 16 priority groups and each group has 0x10 number of vectors. These 16 vectors share the same priority.&lt;br /&gt;&lt;/p&gt;&lt;h4&gt;&lt;u&gt;APIC Data Structures in Solaris&lt;/u&gt;&lt;/h4&gt;&lt;p&gt;Here is&lt;b&gt; &lt;/b&gt;the big picture on how the various APIC data structures are related to each other. These data structures are described below :-&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;img src="http://blogs.sun.com/saurabh_mishra/resource/apic.jpg" /&gt; &lt;br /&gt;&lt;br /&gt;   &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c"&gt;&lt;b&gt;apic_irq_table[]&lt;/b&gt;&lt;/a&gt; - Holds all IRQ entires. Each entry is of type apic_irq_t and total size of the table is APIC_MAX_VECTOR + 1. Note that IRQ has no meaning with respect to MSI/MSI-x.&lt;/p&gt;&lt;p&gt;A typical apic_irq_t entry in the apic_ira_table[] looks like this :-&lt;/p&gt;&lt;p&gt; &gt; ::interrupts&lt;br /&gt;IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# ISR(s)&lt;br /&gt;22   0x61 6   PCI    Lvl Fixed  1   2     0x0/0x16  bge_intr, ata_intr&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&gt; apic_irq_table+(0t22*8)/J&lt;br /&gt;apic_irq_table+0xb0:            fffffffec10d7f38 &lt;/p&gt;&lt;p&gt;&gt; fffffffec10d7f38::print apic_irq_t&lt;br /&gt;{&lt;br /&gt;    airq_mps_intr_index = 0xfffd&lt;br /&gt;    airq_intin_no = 0x16                 // set since it's FIXED type interrupt.&lt;br /&gt;    airq_ioapicindex = 0&lt;br /&gt;    airq_dip = 0xfffffffec01fd9c0    // dev info&lt;br /&gt;    airq_major = 0xca&lt;br /&gt;    airq_rdt_entry = 0xa061&lt;br /&gt;    airq_cpu = 0x1&lt;br /&gt;    airq_temp_cpu = 0x1&lt;br /&gt;    airq_vector = 0x61    // note that it matches with ::interrupts output&lt;br /&gt;    airq_share = 0x2       // two interrupts are sharing the same IRQ and vector&lt;br /&gt;    airq_share_id = 0&lt;br /&gt;    airq_ipl = 0x6         // IPL&lt;br /&gt;    airq_iflag = {&lt;br /&gt;        intr_po = 0x3&lt;br /&gt;        intr_el = 0x3&lt;br /&gt;        bustype = 0xd&lt;br /&gt;    }&lt;br /&gt;    airq_origirq = 0xa&lt;br /&gt;    airq_busy = 0&lt;br /&gt;    airq_next = 0&lt;br /&gt;}&lt;br /&gt;&gt; 0xfffffffec01fd9c0::print 'struct dev_info' ! grep name&lt;br /&gt;    devi_binding_name = 0xfffffffec01fcf88 "pci-ide"&lt;br /&gt;    devi_node_name = 0xfffffffec01fcf88 "pci-ide"&lt;br /&gt;    devi_compat_names = 0xfffffffec0206940 "pci1002,4379.1025.10a.80"&lt;br /&gt;    devi_rebinding_name = 0&lt;br /&gt;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;b&gt;&lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic.c"&gt;apic_ipltopri[]&lt;/a&gt; -  &lt;/b&gt;This array holds Solaris IPL priority to APIC priority. For example :-&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&gt; apic_ipltopri::print&lt;br /&gt;[ 0x10, 0x20, 0x20, 0x20, 0x30, 0x50, 0x70, 0x80, 0x80, 0x80, 0x90, 0xa0, 0xb0, 0xc0, 0xd0,&lt;br /&gt;0xf0, 0 ]&lt;br /&gt;&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Note the order of priority assignment. Higher vector numbers are being assigned to higher IPL. Also note that 0x20 is given to index 1,2,3 which means that IPL 1,2,3 share the same vector range 0x20...0x2F.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;And apic_ipltopri[] is declared as :- &lt;/p&gt;&lt;p&gt;uchar_t apic_ipltopri[MAXIPL + 1];      /* unix ipl to apic pri */&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;b&gt;&lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic.c"&gt;apic_vectortoipl[]&lt;/a&gt; -&lt;/b&gt; This array is a bit complex. The main purpose of this array is to initialize apic_ipltopri[] array.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;apic_init()&lt;br /&gt;{&lt;br /&gt;        [.]&lt;br /&gt;        apic_ipltopri[0] = APIC_VECTOR_PER_IPL; /* leave 0 for idle */&lt;br /&gt;        for (i = 0; i &lt; (APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL); i++) {&lt;br /&gt;                if ((i &lt; ((APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL) - 1)) &amp;amp;&amp;amp;&lt;br /&gt;                    (apic_vectortoipl[i + 1] == apic_vectortoipl[i]))&lt;br /&gt;                        /* get to highest vector at the same ipl */&lt;br /&gt;                        continue;&lt;br /&gt;                for (; j &lt;= apic_vectortoipl[i]; j++) {&lt;br /&gt;                        apic_ipltopri[j] = (i &lt;&lt; APIC_IPL_SHIFT) +&lt;br /&gt;                            APIC_BASE_VECT;&lt;br /&gt;                }&lt;br /&gt;        }&lt;/p&gt;&lt;p&gt;        [.]&lt;/p&gt;&lt;p&gt;}&lt;br /&gt;&lt;/p&gt;&lt;p&gt;uchar_t apic_vectortoipl[APIC_AVAIL_VECTOR / APIC_VECTOR_PER_IPL] = {&lt;br /&gt;        3, 4, 5, 5, 6, 6, 9, 10, 11, 12, 13, 14, 15, 15&lt;br /&gt;};&lt;/p&gt;&lt;p&gt;Note that IPL 5  share vector range 0x40...0x5F (or 0x20...0x3F for optimization) and that's why vector index 2 and 3 have IPL 5. Similarly vector index 4,5 have IPL 6 (0x40...0x5F or 0x60...to 0x7F).&lt;br /&gt;&lt;/p&gt;&lt;p&gt; *      IPL             Vector range.           as passed to intr_enter&lt;br /&gt; *      0               none.&lt;br /&gt; *      1,2,3           0x20-0x2f               0x0-0xf&lt;br /&gt; *      4               0x30-0x3f               0x10-0x1f&lt;br /&gt; *      5               0x40-0x5f               0x20-0x3f&lt;br /&gt; *      6               0x60-0x7f               0x40-0x5f&lt;br /&gt; *      7,8,9           0x80-0x8f               0x60-0x6f&lt;br /&gt; *      10              0x90-0x9f               0x70-0x7f&lt;br /&gt; *      11              0xa0-0xaf               0x80-0x8f&lt;br /&gt; *      ...             ...&lt;br /&gt; *      15              0xe0-0xef               0xc0-0xcf&lt;br /&gt; *      15              0xf0-0xff               0xd0-0xdf&lt;br /&gt; */&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;b&gt;&lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c"&gt;apic_vector_to_irq[]&lt;/a&gt; - &lt;/b&gt;This array holds IRQ number given the vector number. If an element of this array contains APIC_RESV_IRQ (0xFE) then it means that the vector is free and can be allocated. apic_navail_vector() function checks this array to figure out how many vectors are available.&lt;/p&gt;&lt;p&gt;Here an example on how IPL to vector priority is mapped in Solaris :-&lt;/p&gt;&lt;p&gt;Lets say we got network interrupt at IPL 6  (ath - wifi interrupt) having vector number 0x60 (as shown above in the ::interrupt output).  Now Solaris will block all interrupts at and below IPL 6 which is done by apic_intr_enter() function. In this function, the caller actually subtracts 0x20 (APIC_BASE_VECT) from the vector number. Anyway, this is done for optimization but lets come to the point - apic_ipls[] array is used to get to the IPL which will be programmed in the APIC register. So we first get nipl as&lt;br /&gt;&lt;/p&gt;&lt;p&gt;         nipl = apic_ipls[vector];      // vector is 0x40 not 0x60 as mentioned above and nipl will be 0x6&lt;br /&gt;        *vectorp = irq = apic_vector_to_irq[vector + APIC_BASE_VECT];      // This is done to get actual vector and irq.&lt;br /&gt;&lt;br /&gt;and then this statement blocks all the interrupts at and below the vector priority (or IPL).&lt;br /&gt;&lt;/p&gt;&lt;p&gt;        apicadr[APIC_TASK_REG] = apic_ipltopri[nipl];&lt;br /&gt;&lt;/p&gt;&lt;p&gt;So we write 0x70 to APIC task register to block interrupts. Note that Solaris uses range 0x60...0x7F for IPL 6 :-&lt;/p&gt;&lt;p&gt;*      IPL          Vector range.           as passed to apic_intr_enter()&lt;br /&gt;*      6               0x60-0x7f               0x40-0x5f&lt;/p&gt;&lt;p&gt;and it does not matter whether you write 0x70 or 0x7F as they all do the same work which is block interrupts at IPL 6 or below.&lt;br /&gt;&lt;/p&gt;&lt;h4&gt;&lt;u&gt;Solaris x86 Interrupt Handling&lt;/u&gt; &lt;/h4&gt;&lt;p&gt;Now that we have glimpsed through the data structures involved, lets look at how Solaris x86 handles Interrupt. I prefer to describe interrupt handling before describing how interrupts are allocated because I felt interrupt handling is easier to understand. &lt;/p&gt;&lt;p&gt;Lets first go through how Solaris x86  is designed in terms of psm ops.  For example, PCI express has its own  psm ops which is &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic.c#204"&gt;&lt;b&gt;&lt;i&gt;apic_ops&lt;/i&gt;&lt;/b&gt;&lt;/a&gt; and PCI has its own psm_ops which is  &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/psm/uppc.c#133"&gt;&lt;b&gt;&lt;i&gt;uppc_ops&lt;/i&gt;.&lt;/b&gt;&lt;/a&gt; In fact xVM (Zen based hypervisor) has its own psm_ops called &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86xpv/io/psm/xpv_psm.c"&gt;&lt;b&gt;&lt;i&gt;xen_psm_ops.&lt;/i&gt;&lt;/b&gt;&lt;/a&gt; It is psm_install() who is responsible for installing psm in Solaris x86 world.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;apic_probe_common() is what gets called when psm_install() jumps into psm_probe() for each psm_ops. apic_probe_common() does many things and one of them being mapping 'apicadr[]' (you would have seen this before; I referred it for setting APIC priority i.e task register). apic_cpus[] array also gets initialized by ACPI i.e acpi_probe() because ACPI tables have all the information like local apic cpu id, version etc.&lt;/p&gt;&lt;p&gt;Now lets see what happens when local APIC generates an interrupt. The interrupt could come from IOAPIC or MSI/MSI-x based generated interrupt (in-band message). Solaris calls &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/ml/interrupt.s#75"&gt;cmnint()&lt;/a&gt; or &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/ml/interrupt.s#75"&gt;_interrupt().&lt;/a&gt; These are same and call &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/os/intr.c#917"&gt;do_interrupt()&lt;/a&gt; once regs is setup. do_interrupt() will first set the PIL so that CPU does not get any interrupt at or below PIL. Raising the priority of CPU is done using setlvl pointer to function. This pointer gets set to appropriate psm_ops's psm_intr_enter and in our case it will be apic_intr_enter(). Now comes the dispatching interrupt part which is done by calling switch_sp_and_call() once the stack of interrupt thread is setup. Recall that Solaris handles interrupts in thread context if PIL is at or below LOCK_LEVEL (0xa). High level interrupts (0xa...0xf) are handled in current thread's stack.&lt;/p&gt;&lt;p&gt;&lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32/ml/i86_subr.s#3920"&gt;switch_sp_and_call()&lt;/a&gt; can dispatch three type of interrupts -- (a) software interrupts (b) high level interrupts and (c) normal device interrupts.&lt;/p&gt;&lt;p&gt;In our example, we have been looking at wifi interrupt and it will be (c) which maps to &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/os/intr.c#871"&gt;dispatch_hardint()&lt;/a&gt; routine. &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/os/intr.c#871"&gt;dispatch_hardint()&lt;/a&gt; calls &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/avintr.c#652"&gt;av_dispatch_autovect()&lt;/a&gt; after enabling interrupts. Now that we are touching &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/avintr.c#652"&gt;av_dispatch_autovect()&lt;/a&gt; routine, I must explain what is &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/avintr.c#88"&gt;&lt;b&gt;autovect[]&lt;/b&gt;&lt;/a&gt; array. If you remember add_avintr() which is responsible for registering a hardware interrupt handler then I think you can skip this part. &lt;i&gt;&lt;b&gt;&lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/avintr.c#88"&gt;autovect[]&lt;/a&gt; &lt;/b&gt;&lt;/i&gt;has MAX_VECT (256) elements and each element is of type 'struct av_head'. The first pointer in 'struct av_head' points to 'struct autovec' and autovec structure will have all the information about interrupt handler, arguments passed to interrupt handler, priority level etc. Note that more than one interrupt handler can share the same vector and they are linked by 'av_link' in 'struct autovec'. For example :-&lt;/p&gt;&lt;p&gt;&gt; ::interrupts&lt;br /&gt;IRQ  Vect IPL Bus    Trg Type   CPU Share APIC/INT# ISR(s)&lt;br /&gt;22   0x61 6   PCI    Lvl Fixed  1   2     0x0/0x16  bge_intr, ata_intr&lt;/p&gt;&lt;p&gt;&gt; ::sizeof 'struct av_head'&lt;br /&gt;sizeof (struct av_head) = 0x10&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&gt; autovect+(0x10*0t22)=J                 // Take the IRQ and index into autovect[] array.&lt;br /&gt;                fffffffffbc52ba0&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&gt; fffffffffbc52ba0::print 'struct av_head'&lt;br /&gt;{&lt;br /&gt;    avh_link = 0xfffffffec50d2cc0&lt;br /&gt;    avh_hi_pri = 0x6        // take a look at bge_intr() and its priority below&lt;br /&gt;    avh_lo_pri = 0x5        // take a look at ata_inr() and its priority below&lt;br /&gt;}&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&gt; 0xfffffffec50d2cc0::print 'struct autovec'&lt;br /&gt;{&lt;br /&gt;    av_link = 0xfffffffec10d2f40&lt;br /&gt;    av_vector = bge_intr&lt;br /&gt;    av_intarg1 = 0xfffffffec50d5000&lt;br /&gt;    av_intarg2 = 0&lt;br /&gt;    av_ticksp = 0xfffffffec506ae20&lt;br /&gt;    av_prilevel = 0x6&lt;br /&gt;    av_intr_id = 0xfffffffec537a078&lt;br /&gt;    av_dip = 0xfffffffec01f8400&lt;br /&gt;}&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&gt; 0xfffffffec10d2f40::print 'struct autovec'&lt;br /&gt;{&lt;br /&gt;    av_link = 0&lt;br /&gt;    av_vector = ata_intr&lt;br /&gt;    av_intarg1 = 0xfffffffec00bc8c0&lt;br /&gt;    av_intarg2 = 0&lt;br /&gt;    av_ticksp = 0xfffffffec0528898&lt;br /&gt;    av_prilevel = 0x5&lt;br /&gt;    av_intr_id = 0xfffffffec10cbe78&lt;br /&gt;    av_dip = 0xfffffffec01fd9c0&lt;br /&gt;}&lt;br /&gt;&gt;&lt;br /&gt;&lt;br /&gt;Here's an example which we have been discussing :-&lt;br /&gt;&lt;br /&gt;bash-3.00# dtrace -n av_dispatch_autovect:entry'/`autovect[args[0]].avh_link-&gt;av_vector/{@[args[0]]=count(); printf("%a, %x", `autovect[args[0]].avh_link-&gt;av_vector, args[0])}'&lt;br /&gt;&lt;/p&gt;&lt;p&gt;  1   2391       av_dispatch_autovect:entry ath`ath_intr, 13&lt;br /&gt;  1   2391       av_dispatch_autovect:entry ath`ath_intr, 13&lt;br /&gt;  1   2391       av_dispatch_autovect:entry ath`ath_intr, 13&lt;br /&gt;  1   2391       av_dispatch_autovect:entry ath`ath_intr, 13&lt;br /&gt; &lt;/p&gt;&lt;p&gt;There is a very interesting blog by Anish at this &lt;a title="Anish's entry on Solaris x86 interrupt handling" href="http://blogs.sun.com/anish/date/20050614"&gt;link&lt;/a&gt; on APIC and Solaris x86 interrupt handling.&lt;br /&gt; &lt;/p&gt;&lt;h4&gt;&lt;u&gt;How does Solaris APIC implementation allocates Interrupt&lt;/u&gt; &lt;/h4&gt;&lt;p&gt;Now that we looked at how APIC is structured in Solaris x86 and how interrupts are handled, lets look at how interrupts are allocated. There are three types of interrupts --  DDI_INTR_TYPE_FIXED, DDI_INTR_TYPE_MSI and DDI_INTR_TYPE_MSIX in the order they are evolved. Solaris DDI routine ddi_intr_get_supported_types() can be called to retrieve types of interrupt supported by the Bus.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;In case of MSI, apic_alloc_msi_vectors() gets called and in case of MSI-x, apic_alloc_msix_vectors() gets called to allocate the appropriate number of interrupt vectors. Note that MSI supports 32 number of vectors per device function and MSI-x supports 2048 number of vectors per device function however in Solaris x86 we only support 2 MSI-x interrupt vectors per device (the reason for studying APIC and MSI-x by me). On SPARC, Solaris supports far more MSI-x interrupt and configured by #msix-request property in DDI. This hard limit is determined by i_ddi_get_msix_alloc_limit() function however even on SPARC it seems we limit to 8.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;msix_alloc_limit = MAX(DDI_MAX_MSIX_ALLOC, ddi_msix_alloc_limit);&lt;/p&gt;&lt;p&gt;/* Default number of MSI-X resources to allocate */&lt;br /&gt;#define DDI_DEFAULT_MSIX_ALLOC  2&lt;br /&gt;&lt;br /&gt;/* Maximum number of MSI-X resources to allocate */&lt;br /&gt;#define DDI_MAX_MSIX_ALLOC      8&lt;br /&gt;&lt;/p&gt;&lt;p&gt;These limits will change when Interrupt Resource Management (IRM) framework is integrated in Solaris.&lt;/p&gt;&lt;p&gt;Anyway, lets get back to the topic. Depending upon the interrupt type and bus intr ops, Solaris will jump to interrupt ops. In our case, we will get into pci_common_intr_ops() from ddi_intr_alloc(9F) to allocate the interrupts with cmd DDI_INTROP_ALLOC. We will not get into FIXED type interrupts as they are hard wired via IOAPIC and fairly easy (I suppose).  It's the psm_intr_ops which gets into action with cmd PSM_INTR_OP_ALLOC_VECTORS and we land up in apic_intr_ops(). &lt;/p&gt;&lt;p&gt;apic_intr_ops&lt;br /&gt;{&lt;br /&gt;        [.]&lt;br /&gt;        case PSM_INTR_OP_ALLOC_VECTORS:&lt;br /&gt;                if (hdlp-&gt;ih_type == DDI_INTR_TYPE_MSI)&lt;br /&gt;                        *result = apic_alloc_msi_vectors(dip, hdlp-&gt;ih_inum,&lt;br /&gt;                            hdlp-&gt;ih_scratch1, hdlp-&gt;ih_pri,&lt;br /&gt;                            (int)(uintptr_t)hdlp-&gt;ih_scratch2);&lt;br /&gt;                else&lt;br /&gt;                        *result = apic_alloc_msix_vectors(dip, hdlp-&gt;ih_inum,&lt;br /&gt;                            hdlp-&gt;ih_scratch1, hdlp-&gt;ih_pri,&lt;br /&gt;                            (int)(uintptr_t)hdlp-&gt;ih_scratch2);&lt;br /&gt;                break;&lt;br /&gt;                [.]&lt;br /&gt;}&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;br /&gt;&lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic.c"&gt;&lt;i&gt;&lt;b&gt;apic_alloc_msi_vectors()&lt;/b&gt;&lt;/i&gt;&lt;/a&gt; - This function allocates 'count' number of vectors for the device. 'count' has to be power of 2 and the priority is passed by the caller. The first thing which this function does is - it checks whether we have enough vectors available at the priority to satisfy the request and tt is done by routine apic_navail_vector(). We start our search whether we can get contiguous vectors and the value returned by &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic_introp.c#181"&gt;apic_find_multi_vectors()&lt;/a&gt; is our starting point. It seems MSI has this constraint to give contiguous vectors only. I don't why.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;The next step is to check whether we have enough irq's in the apic_irq_table[]. This is done by the function apic_check_free_irqs().  If we succeed in finding enough IRQ entries in the table, apic_alloc_msi_vector() proceeds to allocate irq which is done by &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c#2513"&gt;apic_allocate_irq().&lt;/a&gt; The IRQ no. returned by this function is finally used by autovect[] table to index into the appropriate vector. We will go into autovect[] again soon but for now lets see how we select CPU. The selection of CPU for this IRQ is done by &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c#2251"&gt;apic_bind_intr()&lt;/a&gt; for the first interrupt in 'count' number of vectors and subsequent vectors are bound to the same CPU. These steps are done in a loop for 'count' number of times.&lt;/p&gt;&lt;p&gt;Now that we have setup IRQ in the apic_irq_table[] with priority, vector, target CPU etc, we are set to enable the interrupt. BTW, all this is mostly done in driver's attach(9E) entry point but mostly in two phases with in the attach(9E) entry point -- (i) add interrupts by allocating them (ii) enable interrupts.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;&lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic.c"&gt;&lt;i&gt;&lt;b&gt;apic_alloc_msix_vectors()&lt;/b&gt;&lt;/i&gt;&lt;/a&gt; - This function does similar work as done for MSI interrupts except that we allocate the vector (apart from allocating the IRQ entry in the apic_irq_table[]) and bind the interrupt to CPU by calling &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c#2251"&gt;apic_bind_intr()&lt;/a&gt; for each request in 'count'). MSI-x does have the limitation of contiguous vectors as MSI has. Vector allocation is done by routine &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic.c#2095"&gt;apic_allocate_vector()&lt;/a&gt; which returns the free vector by walking apic_vector_to_irq[] table and looking for APIC_RESV_IRQ slot. The range is determined by the priority passed to it. For example if priority passed is 6, then range would be&lt;/p&gt;&lt;p&gt;        highest = apic_ipltopri[ipl] + APIC_VECTOR_MASK;&lt;br /&gt;        lowest = apic_ipltopri[ipl - 1] + APIC_VECTOR_PER_IPL;&lt;br /&gt;&lt;br /&gt;        if (highest &lt; lowest) /* Both ipl and ipl - 1 map to same pri */&lt;br /&gt;                lowest -= APIC_VECTOR_PER_IPL;&lt;/p&gt;&lt;p&gt;highest is 0x7f (0x70 + 0x0f) and lowest would be 0x60 (0x50+0x10) and this matches with our observation in the beginning of the blog. &lt;/p&gt;&lt;p&gt;A typical flow of this dance is as follows :-&lt;br /&gt;&lt;/p&gt;&lt;p&gt;  1  22557    apic_alloc_msix_vectors:entry name pciex8086,10a7, inum :  0, count : 2, pri :6&lt;br /&gt;              pcplusmp`apic_intr_ops+0x114&lt;br /&gt;              npe`pci_common_intr_ops+0x8f1&lt;br /&gt;              npe`npe_intr_ops+0x21&lt;br /&gt;              unix`i_ddi_intr_ops+0x54&lt;br /&gt;              unix`i_ddi_intr_ops+0x54&lt;br /&gt;              genunix`ddi_intr_alloc+0x263&lt;br /&gt;              igb`igb_alloc_intrs_msix+0x134&lt;br /&gt;              igb`igb_alloc_intrs+0x64&lt;br /&gt;              igb`igb_attach+0xcb&lt;br /&gt;              genunix`devi_attach+0x87&lt;br /&gt;&lt;br /&gt;  1  22485         apic_navail_vector:entry name : pciex8086,10a7, pri 6&lt;br /&gt;  1  22486        apic_navail_vector:return                31&lt;br /&gt;  1  22547          apic_allocate_irq:entry        72&lt;br /&gt;  1  22419         apic_find_free_irq:entry start :72, end : 253&lt;br /&gt;  1  22417          apic_find_io_intr:entry        72&lt;br /&gt;  1  22548         apic_allocate_irq:return                72&lt;br /&gt;  1  22479       apic_allocate_vector:entry ipl : 6, irq: 72, pri: 1&lt;br /&gt;  1  22480      apic_allocate_vector:return                96&lt;br /&gt;  1  22473             apic_bind_intr:entry name : pciex8086,10a7, irq  72&lt;br /&gt;  1  22474            apic_bind_intr:return                 0 &lt;/p&gt;&lt;p&gt;Now lets talk about how driver enables interrupts once they are allocated. Interrupts can be enabled in block (more than one at once by DDI ddi_intr_block_enable(9F)) or calling explicitly ddi_intr_enable(9F) for each  interrupt however we will discuss ddi_intr_enable(9F) . Once again we will end up in pci_common_intr_ops() and call pci_enable_intr() which does two things mainly :-&lt;/p&gt;&lt;p&gt;-  Translate the interrupt if needed. This is done by &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c#1693"&gt;apic_introp_xlate().&lt;/a&gt; If the interrupt is MSI or MSI-x, we call &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c#2087"&gt;apic_setup_irq_table()&lt;/a&gt; if the IRQ entry in the apic_irq_table[] is not setup. In our example, we have already done this so apic_introp_xlate() just returns IRQ number from 'apic_vector_to_irq[airqp-&gt;airq_vector]'. airqp is an entry in the apic_irq_table[] which gets assigned by calling &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic_introp.c#228"&gt;apic_find_irq().&lt;/a&gt;&lt;br /&gt;&lt;/p&gt;&lt;p&gt;-  Add the interrupt handler by calling &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/avintr.c#224"&gt;add_avintr().&lt;/a&gt; We have actually touched this routine in this blog but it is worth mentioning - when in the life cycle of setting up interrupts we bind an interrupt handler (ISR or Interrupt Service Routine) to vector. The main task of &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/io/avintr.c#224"&gt;add_avintr()&lt;/a&gt; is to insert  &lt;i&gt;'autovec'&lt;/i&gt; in the appropriate index and call insert_av(). The other and the most important thing is to program the interrupt which is done by addspl(). addspl() is another pointer to function from the family of setlvl, setspl etc. In APIC case, it will be apic_addspl() which is just a wrapper over apic_addspl_common(). There are four arguments passed to it :-&lt;/p&gt;&lt;p&gt;apic_addspl_common(int irqno, int ipl, int min_ipl, int max_ipl)&lt;br /&gt;&lt;/p&gt;&lt;p&gt;We first get the pointer from apic_irq_table[] by indexing irqno and check if we need to upgrade vector or just check IPL in case this interrupt needs to be shared.  Eventually we will land up in apic_setup_io_intr() which does the main task. In fact &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c#2765"&gt;apic_rebind()&lt;/a&gt; binds an interrupt to a CPU and apic_rebind() is called from apic_setup_io_intr(). Since we are discussing MSI/MSI-x and once &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c#2765"&gt;apic_rebind()&lt;/a&gt; does sanity checks it will call   &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/pcplusmp/apic_introp.c#79"&gt;apic_pci_msi_enable_vector().&lt;/a&gt; The following statement is what we write to program the interrupt :-&lt;br /&gt;&lt;/p&gt;&lt;p&gt;        /* MSI Address */&lt;br /&gt;        msi_addr = (MSI_ADDR_HDR | (target_apic_id &lt;&lt; MSI_ADDR_DEST_SHIFT));&lt;br /&gt;        msi_addr |= ((MSI_ADDR_RH_FIXED &lt;&lt; MSI_ADDR_RH_SHIFT) |&lt;br /&gt;            (MSI_ADDR_DM_PHYSICAL &lt;&lt; MSI_ADDR_DM_SHIFT));&lt;br /&gt;&lt;br /&gt;        /* MSI Data: MSI is edge triggered according to spec */&lt;br /&gt;        msi_data = ((MSI_DATA_TM_EDGE &lt;&lt; MSI_DATA_TM_SHIFT) | vector);&lt;br /&gt;&lt;/p&gt;&lt;p&gt;apic_pci_msi_enable_mode() is also called from apic_rebind() to enable the interrupt once it's programmed. That's how per-vector masking is controlled I suppose.&lt;/p&gt;Since we are touch how we bind an interrupt to a CPU, I should also mention how Solaris selects CPU to bind an interrupt. The routine &lt;a href="http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/i86pc/io/mp_platform_common.c#2251"&gt;apic_bind_intr()&lt;/a&gt; is responsible for doing this and the decision is based on value of tunable 'apic_intr_policy'. You can define three type of policy -- (a) INTR_ROUND_ROBIN_WITH_AFFINITY - round robin and affinity based policy which returns same CPU for the same dip (or device). This is the default policy. (b) INTR_LOWEST_PRIORITY - I don't know because it's not implemented and (c) INTR_ROUND_ROBIN - select cpu in round-robin fashion using 'apic_next_bind_cpu' global variable. Choosing between INTR_ROUND_ROBIN_WITH_AFFINITY vs INTR_ROUND_ROBIN may not be easy but I think the decision should be based on throughput vs locality awareness.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-5404766346168960259?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/5404766346168960259/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=5404766346168960259' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/5404766346168960259'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/5404766346168960259'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2009/03/solaris-apic-implementation-with.html' title='Solaris APIC implementation with respect to MSI/MSI-x interrupts'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-232753834417733027</id><published>2008-01-14T14:53:00.000-08:00</published><updated>2009-08-24T14:54:20.571-07:00</updated><title type='text'>xVM experience so far</title><content type='html'>I recently configured xVM on Solaris - HVM (hardware-assisted virtual machine) and PV (Paravirtualized) guest (domU) domains. I could easily install Solaris 10 Update 5 as HVM domU, boot, configure network interface and assign IP. The plan is to have multiple domU as testbed having Solaris 10 and Solaris Nevada. This would cut down on machines and sanity checks can be done quickly as I don't have to install/boot OS every time. I can easily run functional tests if not performance benchmarks. The performance of Solaris 10 as HVM domain is not as good as Solaris Nevada (PV domU) and especially when there are more than one VCPUs but I guess it's being worked. I think the performance would drastically improve when we have PV (Paravirtualized) drivers for Solaris 10. I'll soon experiment installing xVM on my laptop and configure Windows XP as HVM domain.&lt;br /&gt;&lt;br /&gt;Here's a small demo describing my experience so far with xVM :-&lt;br /&gt;&lt;br /&gt;For installing the Solaris PV domuU, I used this sample script.&lt;br /&gt;&lt;br /&gt;bash-3.2# cat snv.1.py&lt;&gt;name = 'solaris-pv'&lt;br /&gt;memory = '1024'&lt;br /&gt;vcpus = 4&lt;br /&gt;# for installation&lt;br /&gt;disk = [ 'file:/var/tmp/solarisdvd.iso,6:cdrom,r', 'phy:/dev/zvol/dsk/snv-pool/vol,0,w' ]&lt;br /&gt;on_poweroff = 'restart'&lt;br /&gt;on_reboot   = 'restart'&lt;br /&gt;on_crash    = 'preserve'&lt;br /&gt;&lt;br /&gt;In 'disk', you will see 'file and 'phys' and they specify what kind of media it is. Once you have specified the location in 'disk', you also specify the type of access like read (r) or write (w).&lt;br /&gt;&lt;br /&gt;Once you run '#xm create script.py', you will see OS installation screen and once the installation is completed, I used a similar script but removed solarisdvd paragraph from 'disk' (mentioned in the .py file).&lt;br /&gt;&lt;br /&gt;name = 'solaris-pv'&lt;br /&gt;memory = '1024'&lt;br /&gt;vcpus = 4&lt;br /&gt;disk = [ 'phy:/dev/zvol/dsk/snv-pool/vol,0,w' ]&lt;br /&gt;on_poweroff = 'destroy'&lt;br /&gt;on_reboot   = 'restart'&lt;br /&gt;on_crash    = 'preserve'&lt;br /&gt;vif = [ 'mac=0:14:4f:2:12:35, ip=10.5.63.98, bridge=nge1' ]&lt;br /&gt;&lt;br /&gt;With the 'vif' property you can specify what network interface you want. You can also set 'config/default-nic' property in xvm/xend service if you want to override the NIC. Finally, once you have booted guest domain, you will see the interface as rtls0. You can run 'dlmadn show-dev' to see if network interface is really configured or not and run ifconfig(1m) to plumb the interface.&lt;br /&gt;&lt;br /&gt;You can see the resources of each as follows.&lt;br /&gt;&lt;br /&gt;bash-3.2# xm list&lt;br /&gt;Name                                      ID   Mem VCPUs      State   Time(s)&lt;br /&gt;Domain-0                                   0  4973     4     r-----   4019.6&lt;br /&gt;S10U5HVM                                   8  2056     1     r-----     40.8&lt;br /&gt;solaris-pv                                10  1024     1     r-----      5.0&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;I also found following links to be very helpful as I learnt how to configure domU.&lt;br /&gt;&lt;a href="http://blogs.sun.com/cwb/date/20070719"&gt; Write-up from Chris Beal&lt;/a&gt;&lt;br /&gt;&lt;a href="http://blogs.sun.com/mbrowarski/category/virtualization+-+EN"&gt; Write-up from mbrowarski&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-232753834417733027?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/232753834417733027/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=232753834417733027' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/232753834417733027'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/232753834417733027'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2008/01/xvm-experience-so-far.html' title='xVM experience so far'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-9048354048199614618</id><published>2006-10-10T14:54:00.000-07:00</published><updated>2009-08-24T14:54:53.561-07:00</updated><title type='text'>Multi-CPU Binding in Solaris</title><content type='html'>We are working on a framework which would allow processes/thread to have affinity to more than one CPU. The affinities could be divided into three categories -- (a) strong affinity (b) weak affinity and (c) negative affinity.&lt;br /&gt;&lt;br /&gt;(a) strong affinity :- This type of affinity would allow processes/threads to run only on specified CPUs.&lt;br /&gt;&lt;br /&gt;(b) weak affinity :- This type of affinity would allow processes/threads to run on its home lgroup or CPUs specified or any CPUs if it can't run on home lgroup/CPUs. The order is also followed in the same way when Solaris Dispatcher would choose a CPU.&lt;br /&gt;&lt;br /&gt;(c) negative affinity :- This type of affinity would allow processes/threads to not run on the CPUs specified.&lt;br /&gt;&lt;br /&gt;At present, only strong/negative affinity could change thread's home lgroup; so on a NUMA aware machine, users need to be more cautious. These affinity are stored in bitmask of CPUs &lt;a href="http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/sys/cpuvar.h#366"&gt; (cpuset_t)&lt;/a&gt;. During offline phase, CPU will be removed from thread's bitmask and if it happens to be the only CPU in its bitmask, we would generate an event using contract fs so that application programs can take appropriate action in an event when affinity is revoked during offline or even when a CPU goes out from processor set.&lt;br /&gt;&lt;br /&gt;The boundaries laid by CPU partitions will still be there and Multi-CPU binding will not allow processes/threads to cross partitions (or proessesor sets).&lt;br /&gt;&lt;br /&gt;Idle thread is also modified to accordingly look for work. Strong affinity threads can't be stolen if a thread doesn't have that CPU in its bitmask. Weak affinity threads can be stolen. Run queue balancing done by &lt;a href="http://cvs.opensolaris.org/source/xref/on/usr/src/uts/common/disp/disp.c#1153"&gt; setbackdq()&lt;/a&gt; is done for all the affinities.&lt;br /&gt;&lt;br /&gt;An example of it :-&lt;br /&gt;&lt;br /&gt;bash-3.00# ./pbind -s 528-530 `pgrep aff`&lt;br /&gt;&lt;br /&gt;bash-3.00# dtrace -s ./a.d    ## D script capturing context switches.&lt;br /&gt;CPU no. of times ran&lt;br /&gt;     529              197&lt;br /&gt;     528              208&lt;br /&gt;     530              210&lt;br /&gt;&lt;br /&gt;bash-3.00# ./pbind -q `pgrep aff`&lt;br /&gt;process id 3211: not bound&lt;br /&gt;process id 3211: strong affinity to: 528-530&lt;br /&gt;&lt;br /&gt;bash-3.00# psradm -f 529 528&lt;br /&gt;&lt;br /&gt;bash-3.00# dtrace -s ./a.d     ## D script capturing context switches.&lt;br /&gt;CPU no. of times ran&lt;br /&gt;     530              255&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;If you were to offline CPU 530 also, this would cause us to revoke the affinities because this process had strong affinity and there wouldn't be any CPU where it can run. The purpose is to allow offline (for DR or other FMA events). Same hold true for processor set as well if a CPU is removed from the pset and it happens to be be last CPU in the threads CPU bitmask.&lt;br /&gt;&lt;br /&gt;We can preserve affinity to a CPU when a CPU is offlined so that when it is brought back users don't have to bother about finding a suitable CPU provided it's not the last CPU in its bitmask. I'm not sure whether it would be good or do we really want to do this. I do have a prototype based on that.&lt;br /&gt;&lt;br /&gt;The above demo is just for what we are trying to achive and it's in the prototyping stage.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-9048354048199614618?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/9048354048199614618/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=9048354048199614618' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/9048354048199614618'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/9048354048199614618'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2006/10/multi-cpu-binding-in-solaris.html' title='Multi-CPU Binding in Solaris'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-5839782831597656516</id><published>2006-03-08T14:54:00.000-08:00</published><updated>2009-08-24T14:55:40.920-07:00</updated><title type='text'>VFS/Vnode Layer in Solaris</title><content type='html'>In past I have mostly written on dispatcher locks (thread locks), scheduler, signal, procfs. This is for the first time, I'm writing about filesystem. I hope it'll help you in increasing awareness on filesystem so that developing filesystem specific things on Solaris is made easy.&lt;br /&gt;&lt;br /&gt;In this blog, I'll dessribe about how to implement VFS (Virtual Filesystem) Layer and Vnode layer for any filesystem. There are two ways you can read disk data :-&lt;br /&gt;&lt;br /&gt;(a) using buffer cache : bread() is used to read a block of the device. The block number is always with respect to the device. brelse() must be called once buffer data is read from buf_t-&gt;b_un.b_addr&lt;br /&gt;&lt;br /&gt; (b) using segmap driver and setting up the pages.&lt;br /&gt;&lt;br /&gt; Using VFS layer, we can export following filesystem operations :-&lt;br /&gt;&lt;br /&gt;(a) mount : In this operation, we need to first see whether device can be mounted or not. We also need to read the super-block (depending upon whether it's primary partition or logical partition). We are required to create pseudo device also using following calls&lt;br /&gt;&lt;br /&gt; pseudodev = makedevice(getmajor(xdev), minor);     // xdev is the device passed to mount(1m) devvp = makespecvp(xdev, VBLK);                    // devvp is used to do reads&lt;br /&gt;&lt;br /&gt;Once the pseudo device is created, we open the device to read super-block and check the filesystem signature. This information is copied to in-core super-block. Now comes the hard work to mount the filesystem. Here we get the vnode for the mount point and mark it VROOT (vp-&gt;v_flag). The VFS structure is also filled. For instance vfs_data will have pointer to fs structure (struct ufsvfs) which will have super-block and other general other information about the filesystem. VFS layer routines takes care of adding vfs structure to the global array 'vfssw' of struct vfssw type.&lt;br /&gt;&lt;br /&gt;(b) unmount : This operation is very critical. Unmount should not go through while processes are inside the mount point unless -f (force flag is passed to umount(1m)). We need to maintain the reference count so that we don't allow unmount to go through while process's current working directory is inside the mount point. For this we can increment the reference count whenever vnode is allocated and decrement it whenever vnode is released via VOP_INACTIVE(). Hence xxx_unmount() operation should first check whether it's safe to unmount the filesystem or not. DNLC will be purged by VFS layer routines before we land in filesystem specific unmount operation.&lt;br /&gt;&lt;br /&gt;(c) stat on the filesytem : df(1m) calls stat for each mount point. In this operations, we are required to return following information in statvfs64 structure :&lt;br /&gt;&lt;br /&gt; &lt;pre wrap="true"&gt;f_bsize    // block size&lt;br /&gt;f_frsize   // block size. UFS has fragment size to accomodate small files.&lt;br /&gt;f_blocks   // total number of blocks in the filesystem&lt;br /&gt;f_bfree    // free blocks&lt;br /&gt;f_files = (fsfilcnt64_t)-1;&lt;br /&gt;f_ffree = (fsfilcnt64_t)-1;&lt;br /&gt;f_favail = (fsfilcnt64_t)-1;&lt;br /&gt;f_fsid     // filesystem id&lt;br /&gt;(void) strcpy(sp-&gt;f_basetype, vfssw[vfsp-&gt;vfs_fstype].vsw_name);   // name&lt;br /&gt;f_flag = vf_to_stf(vfsp-&gt;vfs_flag);   // flag&lt;br /&gt;f_namemax            // MAX filename size.&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(d) sync operation : For read-only filesystem, we don't need to implement sync. Otherwise, it's used for flushing dirty pages in the filesystem.&lt;br /&gt;&lt;br /&gt; (e) root operation : used by filesystem lookups to determine the root (or mount point). We are required to hold the vnode.&lt;br /&gt;&lt;br /&gt;Vnode layer exports following operations. We will focus on operations which are required to support read operations on the filesystem. Write operations are very tricky as you need to implement host of other operations and locking the filesystem.&lt;br /&gt;&lt;br /&gt;(a) read : This operation is invoked whether read(2) is called. In this routine, we use segmap to read the data of the file. We force fault the pages using&lt;br /&gt;&lt;br /&gt; segmap_getmapflt(segkmap, vp, (off + mapon), &lt;length&gt;, 1, S_READ);&lt;br /&gt;&lt;br /&gt;and then uiomove is called to copy back to userland. We also release the smp (segmap entry) using segmap_release() once uiomove() is done. Please note that segmap uses 8192 (MAXBSIZE), so according you're required to manage the offset (off) and mapon which are calculated as :&lt;br /&gt;&lt;br /&gt; off = uoff &amp;amp; (offset_t)MAXBMASK; mapon = (u_offset_t)(uoff &amp;amp; (offset_t)MAXBOFFSET);&lt;br /&gt;&lt;br /&gt;(b) getattr : In this operation, we need to return 'vattr' struture. 'ls -l' read this struture. Following members are relvant here :-&lt;br /&gt;&lt;br /&gt; &lt;pre wrap="true"&gt;va_type   // type of vnode&lt;br /&gt;va_mode   // mode&lt;br /&gt;va_uid    // uid&lt;br /&gt;va_gid     // gid&lt;br /&gt;va_atime.tv_sec // access time&lt;br /&gt;va_mtime.tv_sec    // modification time&lt;br /&gt;va_ctime.tv_sec    // creation time&lt;br /&gt;va_size     // size&lt;br /&gt;va_nlink    // link count&lt;br /&gt;va_blksize  // block size&lt;br /&gt;va_nblocks  // number of blocks&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;(c) lookup : This is the heart of any filesystem. We must provide lookup in the filesystem before we can read files or seach in a directory. This routine understands the filesystem structure. In this operation, you can also use DNLC (Directory name lookup cache) to enhance the fs lookup. The Vnode and name will be cached and we don't to go to the disk all the time to search for a file/directory. dnlc_enter() can be used to put an entry in DNLC and dnlc_lookup() can be used to search whether vnode can be found in DNLC given the name. Both the routines increment v_count using VN_HOLD().&lt;br /&gt;&lt;br /&gt;(d) getpage_miss/getpage : This routine will read the block of a file given the offset. Here we need to setup the page using page_create_va() and prepare for reading the block data using pageio_setup(). In order to issue the IO, we do following things in order -- bdev_strategy(), biowait() and then pageio_done(). In order to support read-ahead, we can use pvn_read_kluster() routines. Filesystem specific getpage() routine will call getpage_miss() to read the block. In getpage(), we also do page_lookup() in order to save going to disk if page is already there in memory.&lt;br /&gt;&lt;br /&gt;(e) readdir : This operation is used to read the directory entries. uio_offset passed in uio struture is the key thing here. If uio_offset is same as the filesize, then we have read all the directory entries. If that's not the case, then we read directory entries starting from the last offset which is passed to us in uio_offset. At the end, we are required to return the new offset in uio_offset, so that next time when readdir() is call again, we can read more directory entries.&lt;br /&gt;&lt;br /&gt;There are host of other functions which are required when write is also supported on the filesystem. For instance putpage, write etc. In order to support mmap(), we need to use segvn segment driver instead of segmap.&lt;/length&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-5839782831597656516?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/5839782831597656516/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=5839782831597656516' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/5839782831597656516'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/5839782831597656516'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2006/03/vfsvnode-layer-in-solaris.html' title='VFS/Vnode Layer in Solaris'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-4945677654085423383</id><published>2005-10-14T14:56:00.000-07:00</published><updated>2009-08-24T14:56:50.511-07:00</updated><title type='text'>Dispatcher locks and Bug 5017148</title><content type='html'>&lt;h3&gt;&lt;br /&gt;&lt;/h3&gt; As part of the opensolaris release, I'm going to describe about the dispatcher locks, thread locks and a bug which I root-caused last year. The investigation didn't take much time, but it was an interesting one because door does magic in the kernel at the time of  handoff to other thread (client to server or server to client). So let me begin with what's a dispatcher lock:&lt;br /&gt;&lt;br /&gt;                          &lt;h3&gt;1. What's a dispatcher lock&lt;/h3&gt;              Dispatcher lock is a one byte lock (disp_lock_t) which is acquired     at  high  pil (DISP_LEVEL) and &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/sparc/sys/machlock.h"&gt;DISP_LEVEL&lt;/a&gt;        is the interrupt level at which dispatcher operations should be performed.        There are other symbolic interrupt levels viz. CLOCK_LEVEL and LOCK_LEVEL        in &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/sparc/sys/machlock.h"&gt;machlock.h&lt;/a&gt;&lt;br /&gt;            &lt;br /&gt;             Following are the interfaces for dispatcher lock which are described      in  &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c"&gt;disp_lock.c&lt;/a&gt;&lt;br /&gt;            &lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c#disp_lock_init"&gt;disp_lock_init()&lt;/a&gt;        initializes dispatcher lock.&lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c#disp_lock_destroy"&gt;disp_lock_destroy()&lt;/a&gt;        destroys dispatcher lock.&lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c#disp_lock_enter"&gt;disp_lock_enter()&lt;/a&gt;        acquires dispatcher lock.&lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c#disp_lock_exit"&gt;disp_lock_exit()&lt;/a&gt;        releases dispatcher lock and checks for kernel preemption.&lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c#disp_lock_exit_nopreempt"&gt;disp_lock_exit_nopreempt()&lt;/a&gt;        releases dispatcher lock without checking for kernel preemption.&lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c#disp_lock_enter_high"&gt;disp_lock_enter_high()&lt;/a&gt;        acquires another dispatcher lock when the thread is already holding  a  dispatcher    lock.&lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c#disp_lock_exit_high"&gt;disp_lock_exit_high()&lt;/a&gt;        releases the top level dispatcher lock.&lt;br /&gt;            &lt;br /&gt;             Here are the facts about dispatcher locks :-&lt;br /&gt;            &lt;br /&gt;             (a) Being a spin lock which are acquired at high level, dispatcher     locks    should be acquired for a short duration and shouldn't make blocking     calls.&lt;br /&gt;             (b) While releasing dispatcher lock, you can be preempted if  &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/cpuvar.h"&gt;cpu_kprunrun&lt;/a&gt;        (kernel preemption) is set. You can use &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c#disp_lock_exit_nopreempt"&gt;disp_lock_exit_nopreempt()&lt;/a&gt;        if you don't want to be preempted.&lt;br /&gt;             (c) While holding dispatcher lock, you are not preemptible.&lt;br /&gt;             (d) Since dispatcher lock raises pil to DISP_LEVEL, the old pil   is  saved    in &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;t_oldspl&lt;/a&gt;        of the thread structure (&lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;kthread_t&lt;/a&gt;)&lt;br /&gt;            &lt;br /&gt;                          &lt;h3&gt;2. What's a thread lock&lt;/h3&gt;             &lt;br /&gt;             Thread lock is a per-thread entity which protects &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;t_state&lt;/a&gt;        and state-related flags of a kernel thread. Thread lock hangs off &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;kthread_t&lt;/a&gt;        as t_lockp. &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;t_lockp&lt;/a&gt;        is a pointer to thread dispatcher lock and the pointer is changed whenever        the state of the kernel thread is changed. One would acquire thread  lock     using &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c"&gt;thread_lock()&lt;/a&gt;        routine giving the kernel thread pointer. thread_lock() is responsible     for   getting the correct dispatcher lock for the thread.  The dance     done  by thread_lock() is interesting because t_lockp is pointer and can    get changed   during the course of spinning for a dispatcher lock. Hence   thread_lock()  saves t_lockp pointer and ensures that we acquire the right   thread lock.&lt;br /&gt;            &lt;br /&gt;             Now lets take a look at the interfaces in Solaris kernel which   are   described    in &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp_lock.c"&gt;disp_lock.c&lt;/a&gt;        and &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;thread.h&lt;/a&gt;&lt;br /&gt;            &lt;br /&gt;             thread_lock() is called to require thread lock.&lt;br /&gt;             thread_unlock() is called to release thread lock and it checks   for   kernel    preemption.&lt;br /&gt;             thread_lock_high() is called to acquire another thread lock while    holding    one.&lt;br /&gt;             thread_unlock_high() is called to release thread lock while holding     one.&lt;br /&gt;             thread_unlock_nopreempt() is called to release thread lock without     checking    for kernel preemption.&lt;br /&gt;            &lt;br /&gt;                          &lt;h3&gt;3. Various types of thread locks in Solaris Kernel&lt;/h3&gt;             &lt;br /&gt;             Now that I've described about thread lock, it's very important   for   us  to  understand what dispatcher locks are acquired depending upon   the  state  of  the thread. In order to find out this, you need to first  understand   the  one-to-one  mapping between the state of the thread and  it's corresponding     dispatcher lock:&lt;br /&gt;            &lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;TS_RUN&lt;/a&gt;         (runnable)             ---&gt;        &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/disp.h"&gt;disp_lock&lt;/a&gt;        of the dispatch queue in a CPU (&lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/cpuvar.h"&gt;cpu_t&lt;/a&gt;)        or global preemption queue of a CPU partition&lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;TS_ONPROC&lt;/a&gt;        (running )       ---&gt;   &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/cpuvar.h"&gt;cpu_thread_lock&lt;/a&gt;        in a CPU (cpu_t)&lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;TS_SLEEP&lt;/a&gt;        (sleep)                   ---&gt;     &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/sleepq.h"&gt;sleepq        bucket lock&lt;/a&gt; or &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/os/turnstile.c"&gt;turnstile        chain lock&lt;/a&gt;&lt;br /&gt;             &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;TS_STOPPED&lt;/a&gt;        (stopped)      ---&gt;   &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp.c"&gt;stop_lock&lt;/a&gt;        (a global dispatcher lock) for stopped threads.&lt;br /&gt;            &lt;br /&gt;             There're two global dispatcher locks: &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp.c"&gt;shuttle_lock&lt;/a&gt;        and &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp.c"&gt;transition_lock&lt;/a&gt;        in Solaris Kernel. When thread lock of a thread is pointing to shuttle_lock,        it means that the thread is sleeping on a door and when thread lock  points      to transition_lock, it means that thread is in transition to another state     (for instance when the state of the thread sleeping on a semaphore is changed   from TS_SLEEP to TS_RUN or during &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/syscall/yield.c"&gt;yield()&lt;/a&gt;).        transition_lock is always held and is never released.&lt;br /&gt;            &lt;br /&gt;                          &lt;h3&gt;4. Examples of thread lock&lt;/h3&gt;           Now lets understand what all thread locks will be involved from  wakeup    (or  unsleep) to onproc (running) of a thread.  Lets assume  that T1   (thread   1) is blocked on a condition variable CV1 and T2 (thread  2)    signals   T1 as part of wakeup.  First &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/os/condvar.c#cv_signal"&gt;cv_signal()&lt;/a&gt;         grabs sleepq bucket lock and decrements the waiters count on CV1. It   then    calls &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/os/sleepq.c#sleepq_wakeone_chan"&gt;sleepq_wakeone_chan()&lt;/a&gt;        to wakeup T1. &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/os/sleepq.c#sleepq_wakeone_chan"&gt;sleepq_wakeone_chan()'s&lt;/a&gt;        responsibility is to unlink T1 from the sleepq list (using t_link of   kthread_t)     and calls &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/class.h#CL_WAKEUP"&gt;CL_WAKEUP&lt;/a&gt;        (scheduling class specific wakeup routine). Assuming T1 is in time sharing      class (TS),  &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/ts.c#ts_wakeup"&gt;ts_wakeup()&lt;/a&gt;        gets called. Now &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/ts.c#ts_wakeup"&gt;ts_wakeup()&lt;/a&gt;         which in  turn calls dispatcher enqueue routine (setfrontdq() or   setbackdq()) changes the state of T1 thread to TS_RUN and  changes t_lockp   to point to disp_lock of the chosen CPU. At last sleepq_wakeone_chan()   drops    disp_lock of the dispatch queue and finally sleepq dispatcher    lock is also   released in &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/os/condvar.c#cv_signal"&gt;cv_signal()&lt;/a&gt;.        Once T1 is chosen to run, &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp.c#disp"&gt;disp()&lt;/a&gt;         removes T1 from the dispatch queue of the CPU and changes the state  to   TS_ONPROC   and t_lockp to cpu_thread_lock of the CPU.&lt;br /&gt;         &lt;br /&gt;                    &lt;pre&gt;void&lt;br /&gt;cv_signal(kcondvar_t *cvp)&lt;br /&gt;{&lt;br /&gt;       condvar_impl_t *cp = (condvar_impl_t *)cvp;&lt;br /&gt;&lt;br /&gt;       /* make sure the cv_waiters field looks sane */&lt;br /&gt;       ASSERT(cp-&gt;cv_waiters &lt;= CV_MAX_WAITERS);&lt;br /&gt;       if (cp-&gt;cv_waiters &gt; 0) {&lt;br /&gt;               sleepq_head_t *sqh = SQHASH(cp);&lt;br /&gt;               disp_lock_enter(&amp;amp;sqh-&gt;sq_lock);&lt;br /&gt;               ASSERT(CPU_ON_INTR(CPU) == 0);&lt;br /&gt;               if (cp-&gt;cv_waiters &amp;amp; CV_WAITERS_MASK) {&lt;br /&gt;                       kthread_t *t;&lt;br /&gt;                       cp-&gt;cv_waiters--;&lt;br /&gt;                       t = sleepq_wakeone_chan(&amp;amp;sqh-&gt;sq_queue, cp);&lt;br /&gt;                       /*&lt;br /&gt;                        * If cv_waiters is non-zero (and less than&lt;br /&gt;                        * CV_MAX_WAITERS) there should be a thread&lt;br /&gt;                        * in the queue.&lt;br /&gt;                        */&lt;br /&gt;                       ASSERT(t != NULL);&lt;br /&gt;               } else if (sleepq_wakeone_chan(&amp;amp;sqh-&gt;sq_queue, cp) == NULL) {&lt;br /&gt;                       cp-&gt;cv_waiters = 0;&lt;br /&gt;               }&lt;br /&gt;               disp_lock_exit(&amp;amp;sqh-&gt;sq_lock);&lt;br /&gt;       }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;          &lt;br /&gt;         &lt;br /&gt;          The second example is from the phase of preemption. We know that  there    are   two types of preemption in Solaris kernel viz. user preemption  (cpu_runrun)      and kernel preemption (cpu_kprunrun). Assume that T1 is  being preempted    in  favour of a high priority thread. As a result T1 will  call &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp.c#preempt"&gt;preempt()&lt;/a&gt;        once T1 realizes that it has to give up the CPU (there're hooks in Solaris       kernel to determine this). &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp.c#preempt"&gt;preempt()&lt;/a&gt;        first grabs thread lock effectively cpu_thread_lock on itself and calls      &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;THREAD_TRANSITION()&lt;/a&gt;        to change the t_lockp to transition_lock. Note that the state of T1  is   still   TS_ONPROC while t_lockp is pointing to transition_lock, because   T1  is in  transition phase (from TS_ONPROC -&gt; TS_RUN).  &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;THREAD_TRANSITION()&lt;/a&gt;        also releases previous dispatcher lock because transition_lock is always       held. &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/disp.c#preempt"&gt;preempt()&lt;/a&gt;        then calls CL_PREEMPT(), scheduling class specific preemption routine,     to  enqueue T1 on a particular CPU. From here on it's same as described   in  the  first example.&lt;br /&gt;            &lt;br /&gt;                    &lt;pre&gt;void&lt;br /&gt;preempt()&lt;br /&gt;{&lt;br /&gt;       kthread_t       *t = curthread;&lt;br /&gt;       klwp_t          *lwp = ttolwp(curthread);&lt;br /&gt;&lt;br /&gt;       if (panicstr)&lt;br /&gt;               return;&lt;br /&gt;&lt;br /&gt;       TRACE_0(TR_FAC_DISP, TR_PREEMPT_START, "preempt_start");&lt;br /&gt;&lt;br /&gt;       thread_lock(t);&lt;br /&gt;&lt;br /&gt;       if (t-&gt;t_state != TS_ONPROC || t-&gt;t_disp_queue != CPU-&gt;cpu_disp) {&lt;br /&gt;               /*&lt;br /&gt;                * this thread has already been chosen to be run on&lt;br /&gt;                * another CPU. Clear kprunrun on this CPU since we're&lt;br /&gt;                * already headed for swtch().&lt;br /&gt;                */&lt;br /&gt;               CPU-&gt;cpu_kprunrun = 0;&lt;br /&gt;               thread_unlock_nopreempt(t);&lt;br /&gt;               TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");&lt;br /&gt;       } else {&lt;br /&gt;               if (lwp != NULL)&lt;br /&gt;                       lwp-&gt;lwp_ru.nivcsw++;&lt;br /&gt;               CPU_STATS_ADDQ(CPU, sys, inv_swtch, 1);&lt;br /&gt;               THREAD_TRANSITION(t);&lt;br /&gt;               CL_PREEMPT(t);&lt;br /&gt;               DTRACE_SCHED(preempt);&lt;br /&gt;               thread_unlock_nopreempt(t);&lt;br /&gt;&lt;br /&gt;               TRACE_0(TR_FAC_DISP, TR_PREEMPT_END, "preempt_end");&lt;br /&gt;&lt;br /&gt;               swtch();                /* clears CPU-&gt;cpu_runrun via disp() */&lt;br /&gt;       }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;                           &lt;h3&gt;5. An example of a dispatcher lock and Bug &lt;a href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=5017148"&gt;5017148&lt;/a&gt;.&lt;/h3&gt;             &lt;br /&gt;             Apart from illustrating dispatcher lock, I'll also describe a  problem     which  I had found a while back. This's involves kernel door implementation     too.&lt;br /&gt;            &lt;br /&gt;             I usually begin with looking at what CPUs are doing whenever  I  take   a  look  at a crash dump from a system hang:&lt;br /&gt;                          &lt;p&gt;&gt; ::cpuinfo&lt;br /&gt;              ID     ADDR                    FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD                 PROC&lt;br /&gt;               0     0001041d2b0  1b              1              0            60         no        no                t-0             3001ba04900      cluster&lt;br /&gt;               1     30019fe4030   1d              2              0             101    no          no               t-0              3003d873a40       rgmd&lt;br /&gt;               2     3001a38aab8  1d              1            0            165       yes        yes           t-0                2a1003ebd20    sched&lt;br /&gt;               3     0001041b778  1d              2              0            60       yes         yes            t-0                 3004fac3c80       cluster&lt;br /&gt;            &lt;br /&gt;               &lt;br /&gt;          &lt;/p&gt;              CPU 0 is spinning for a mutex 0x30001d7cae0 which is held by  thread    0x3004fac3c80   running on CPU 3. Please note that thread will spin for  a  mutex only when   the owner is running and in this case owner of the   mutex happens to be  onproc on CPU 3.&lt;br /&gt;                             &lt;p&gt;&gt; 0x30001d7cae0$&lt;mutex&lt;br /&gt;             0x30001d7cae0:  owner/waiters&lt;br /&gt;                                     3004fac3c80    &lt;br /&gt;             &gt;&lt;/p&gt;                CPU 3 is our clock interrupt CPU (run ::cycinfo -v and figure   out   where   the clock handler is registered) and thread 0x3004fac3c80 on  CPU  3 seems  to be spinning in &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/os/condvar.c"&gt;cv_block()&lt;/a&gt;        for sleepq bucket lock (&lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/os/sleepq.c"&gt;sleepq_head[]&lt;/a&gt;).        In order to find out which sleepq bucket this thread is looking for,   we   can look at wait chanel &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/thread.h"&gt;t_wchan&lt;/a&gt;        (t_lwpchan.lc_wchan) and using the hash function &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/sys/sleepq.h"&gt;SQHASH()&lt;/a&gt;,        I found out the right bucket. Since we're already holding thread lock    (effectively   cpu_thread_lock of CPU 3) and looking for sleepq bucket lock,   this would  have blocked clock interrupts too. This can be verifyed from  the pending clock interrupts in ::cycinfo -v.&lt;br /&gt;         &lt;br /&gt;                Lets disassemble cv_block()  thread 3004fac3c80 is stuck&lt;br /&gt;                          &lt;p&gt;   cv_block+0x9c:                         add       %i2, 8, %i0&lt;br /&gt;                cv_block+0xa0:                         call      -0x460e0   &lt;disp_lock_enter_high&gt;&lt;br /&gt;                cv_block+0xa4:                         mov       %i0, %o0&lt;/p&gt;                            &lt;p&gt;&gt; 0x3004fac3c80::print kthread_t t_lockp&lt;br /&gt;                    t_lockp = cpu0+0xb8&lt;br /&gt;            &gt; cpu0=J&lt;br /&gt;                                  1041b778                                                  // CPU 3&lt;br /&gt;            &gt; 0x3004fac3c80::print kthread_t ! grep wchan&lt;br /&gt;                     lc_wchan = 0x3006fc52d20&lt;br /&gt;               &lt;br /&gt;          &lt;/p&gt;              And the sleepq bucket happens to be :-&lt;br /&gt;                    &lt;p&gt;     &lt;br /&gt;             &gt; 0x10471d88::print sleepq_head_t&lt;br /&gt;             {&lt;br /&gt;                    sq_queue = {&lt;br /&gt;                         sq_first     = 0x3001b476ee0&lt;br /&gt;                    }&lt;br /&gt;                    sq_lock = 0xff                  &lt;----- dispatcher lock is held&lt;br /&gt;             }&lt;br /&gt;               &lt;br /&gt;          &lt;/p&gt;              Thread 3003d873a40 running on CPU 1 is spinning in thread_lock_high().      &lt;br /&gt;                             &lt;p&gt;&gt; 3003d873a40::findstack&lt;br /&gt;             stack pointer for thread 3003d873a40: 2a1025964a1&lt;br /&gt;             [ 000002a1025964a1 panic_idle+0x1c() ]&lt;br /&gt;               000002a102596551 prom_rtt()&lt;br /&gt;               000002a1025966a1 thread_lock_high+0xc()&lt;br /&gt;               000002a102596751 sema_p+0x60()&lt;br /&gt;               000002a102596801 kobj_open+0x84()&lt;br /&gt;               000002a1025968d1 kobj_open_file+0x44()&lt;br /&gt;               [.]&lt;br /&gt;               000002a102597011 xdoor_proxy+0x20c()&lt;br /&gt;               000002a1025971f1 door_call+0x204()&lt;br /&gt;               000002a1025972f1 syscall_trap32+0xa8()&lt;br /&gt;             &gt;&lt;br /&gt;            &lt;br /&gt;          &lt;/p&gt;              Now this's an interesting stack. Looking at the sema_p() code,       we    see that we first grab the sleepq bucket lock and then try  to   grab thread  lock.&lt;br /&gt;                 &lt;br /&gt;          Since the hashing function SQHASH() would return the same index  for   0x3006fc52d20     and 0x300819f3118, we see that sema_p() getting stuck  on  the thread lock    which is held by thread running on CPU 3 and thread  running  on CPU 3 is  stuck  because sleep queue bucket lock is held by thread  running  on CPU 1.&lt;br /&gt;                    &lt;p&gt;     &lt;br /&gt;             &gt; 0x3003d873a40::print kthread_t t_lockp&lt;br /&gt;                    t_lockp = cpu0+0xb8&lt;br /&gt;             &gt; cpu0+0xb8/x&lt;br /&gt;                cpu0+0xb8:      ff00          &lt;br /&gt;               &lt;br /&gt;          &lt;/p&gt;              Now lets find out the real problem of this deadlock.  Lets   look   t_cpu  of thread 0x3003d873a40 and we see that thread 0x3003d873a40   running   on CPU  1 has t_lockp pointing to CPU 3's cpu_thread_lock. This's   really  nasty as  we would expect it to point to CPU 1's cpu_thread_lock.&lt;br /&gt;                             &lt;p&gt;&gt; 0x3003d873a40::print kthread_t ! grep cpu&lt;br /&gt;                    t_bound_cpu = 0&lt;br /&gt;                    t_cpu = 0x30019fe4030&lt;br /&gt;                    t_lockp = cpu0+0xb8                                // CPU 3's cpu_thread_lock&lt;br /&gt;                    t_disp_queue = cpu0+0x78&lt;br /&gt;                 &lt;/p&gt;            The cause of this problem is that the &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/fs/doorfs/door_sys.c"&gt;door_get_server()&lt;/a&gt;,    while doing  the  handoff to server thread, is getting preempted because   disp_lock_exit()  checks  for kernel preemption. &lt;pre&gt;static kthread_t *&lt;br /&gt;door_get_server(door_node_t *dp)&lt;br /&gt;{&lt;br /&gt; [.]&lt;br /&gt;               /*&lt;br /&gt;                * Mark the thread as ONPROC and take it off the list&lt;br /&gt;                * of available server threads. We are committed to&lt;br /&gt;                * resuming this thread now.&lt;br /&gt;                */&lt;br /&gt;               disp_lock_t *tlp = server_t-&gt;t_lockp;&lt;br /&gt;               cpu_t *cp = CPU;&lt;br /&gt;&lt;br /&gt;               pool-&gt;dp_threads = server_t-&gt;t_door-&gt;d_servers;&lt;br /&gt;               server_t-&gt;t_door-&gt;d_servers = NULL;&lt;br /&gt;               /*&lt;br /&gt;                * Setting t_disp_queue prevents erroneous preemptions&lt;br /&gt;                * if this thread is still in execution on another processor&lt;br /&gt;                */&lt;br /&gt;               server_t-&gt;t_disp_queue = cp-&gt;cpu_disp;&lt;br /&gt;               CL_ACTIVE(server_t);&lt;br /&gt;               /*&lt;br /&gt;                * We are calling thread_onproc() instead of&lt;br /&gt;                * THREAD_ONPROC() because compiler can reorder&lt;br /&gt;                * the two stores of t_state and t_lockp in&lt;br /&gt;                * THREAD_ONPROC().&lt;br /&gt;                */&lt;br /&gt;               thread_onproc(server_t, cp);&lt;br /&gt;               disp_lock_exit(tlp);&lt;br /&gt;               return (server_t);&lt;br /&gt; [.]&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;As a result server thread's  t_lockp points  to incorrect   cpu_thread_lock because client thread started  running on different  CPU when  client thread did &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/shuttle.c"&gt;shuttle_resume()&lt;/a&gt;    to server thread. We can see  that &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/fs/doorfs/door_calls.c"&gt;door_return()&lt;/a&gt;      (which return the results to the caller) releases dispatcher  lock without      getting preempted, so we didn't notice this problem in door_return().&lt;br /&gt;&lt;br /&gt;On the move for cracking another problem now...In fact we don't get sleep if we don't take a look at the crash dump :-)            &lt;p&gt; &lt;/p&gt;           &lt;hr /&gt; Technorati Tag: &lt;a href="http://www.technorati.com/tag/OpenSolaris" rel="tag"&gt;OpenSolaris&lt;/a&gt;   &lt;br /&gt;      Technorati Tag: &lt;a href="http://www.technorati.com/tag/Solaris" rel="tag"&gt;Solaris&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-4945677654085423383?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/4945677654085423383/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=4945677654085423383' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/4945677654085423383'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/4945677654085423383'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2005/10/dispatcher-locks-and-bug-5017148.html' title='Dispatcher locks and Bug 5017148'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-6131240597961225685</id><published>2005-10-14T14:55:00.000-07:00</published><updated>2009-08-24T14:56:15.661-07:00</updated><title type='text'>Compiler reordering problem</title><content type='html'>I'm going to write about a compiler reordering problem in &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/fs/doorfs/door_calls.c#door_return"&gt;door_return()&lt;/a&gt;  function which was observed in July 2002. The customer was able to reproduce the problem for us and it took me a  while to figure out that it was a compiler reordering problem. I must thank our customers for being so co-operative when we get such issues. I must have given instrumented kernels for at least five times before I found out the problem. It's bug  &lt;a href="http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=4699850"&gt;4699850&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The symptom was very clear. System used to panic in Solaris Kernel Dispatcher routines and one of the symptom was system panicing in dispdeq() while removing a kernel thread from the dispatch queue of a CPU.&lt;br /&gt;&lt;br /&gt;We know that compiler can reorder C statments if they are independent.  Assume this piece of C code:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;#define THREAD_SET_STATE(tp, state, lp) \&lt;br /&gt;               ((tp)-&gt;t_state = state, (tp)-&gt;t_lockp = lp)&lt;br /&gt;&lt;/pre&gt; &lt;br /&gt;t_lockp is a pointer to a dispatcher lock and we don't know whether lp is held or not. When a thread is made TS_ONPROC, the t_lockp of the corresponding thread points to cpu_thread_lock of CPU (cpu_t). In the above mentioned C code, these stores can be reordered can be re-ordered by compiler, so the lp should be held while calling setting the threads state.&lt;br /&gt;&lt;br /&gt;In &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/fs/doorfs/door_calls.c#door_return"&gt;door_return()&lt;/a&gt;, when server thread is about to handoff to client thread to return the results, it makes the client thread TS_ONPROC and calls shuttle_resume() on client thread. The responsibility of &lt;a href="http://cvs.opensolaris.org/source/xref/usr/src/uts/common/disp/shuttle.c#shuttle_resume"&gt;shuttle_resume()&lt;/a&gt; is to make client/server thread TS_ONPROC and the caller sleeps on shuttle_lock sync obj.&lt;br /&gt;&lt;br /&gt;While putting a thread onproc, dispatcher routines need not hold cpu_thread_lock and hence in door_return() if we call THREAD_ONPROC(), we effectively lost thread lock on the client thread.&lt;br /&gt;&lt;br /&gt;Now lets look at the two stores again. It t_lockp reaches global visibility before t_state, we can effectively lose thread lock on the thread. Assume another thread on different CPU is sending a signal to client door thread. Once the thread lock is lost on the client thread, the thread which is sending signal to client thread could see the old state of client thread (in this case it happens to be TS_SLEEP). Since the state is TS_SLEEP, eat_signal() will do setrun() on the client thread which enqueues client thread in the dispatch queue of the CPU. As a result, we can see some very strange things happening which also included dispdeq() panic.&lt;br /&gt;&lt;br /&gt;The following code in door_return() was faulty:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;int&lt;br /&gt;door_return(caddr_t data_ptr, size_t data_size,&lt;br /&gt;       door_desc_t *desc_ptr, uint_t desc_num, caddr_t sp)&lt;br /&gt;{&lt;br /&gt; [.]&lt;br /&gt;                       tlp = caller-&gt;t_lockp;&lt;br /&gt;                       /*&lt;br /&gt;                        * Setting t_disp_queue prevents erroneous preemptions&lt;br /&gt;                        * if this thread is still in execution on another&lt;br /&gt;                        * processor&lt;br /&gt;                        */&lt;br /&gt;                       caller-&gt;t_disp_queue = cp-&gt;cpu_disp;&lt;br /&gt;                       CL_ACTIVE(caller);&lt;br /&gt;                       /*&lt;br /&gt;                        * We are calling thread_onproc() instead of&lt;br /&gt;                        * THREAD_ONPROC() because compiler can reorder&lt;br /&gt;                        * the two stores of t_state and t_lockp in&lt;br /&gt;                        * THREAD_ONPROC().&lt;br /&gt;                        */&lt;br /&gt;                       thread_onproc(caller, cp);&lt;br /&gt;                       disp_lock_exit_high(tlp);&lt;br /&gt;                       shuttle_resume(caller, &amp;amp;door_knob);&lt;br /&gt; [.]&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;I had used TNF (trace normal form) for finding out this problem. But now we have a powerful tool to trace from userland to kernel and of course it's Dtrace.&lt;br /&gt;&lt;br /&gt;&lt;hr /&gt; Technorati Tag: &lt;a href="http://www.technorati.com/tag/OpenSolaris" rel="tag"&gt;OpenSolaris&lt;/a&gt;    &lt;br /&gt;        Technorati Tag: &lt;a href="http://www.technorati.com/tag/Solaris" rel="tag"&gt;Solaris&lt;/a&gt;                       &lt;br /&gt;Technorati Tag: &lt;a href="http://www.technorati.com/tag/DTrace" rel="tag"&gt;DTrace&lt;/a&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-6131240597961225685?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/6131240597961225685/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=6131240597961225685' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/6131240597961225685'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/6131240597961225685'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2005/10/compiler-reordering-problem.html' title='Compiler reordering problem'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-3918505966631925384</id><published>2005-07-21T14:56:00.000-07:00</published><updated>2009-08-24T14:57:38.033-07:00</updated><title type='text'>An interesting signal delivery related problem</title><content type='html'>&lt;pre wrap="true"&gt;Recently, we found an interesting performance problem using Dtrace. The program was when using Virtual timer created using setitimer(2). The interval passed was 10m (one clock tick) but SIGVTALRM signal used to arrive late and sometimes 6 ticks or more. Now how will you Dtrace the code and from where will you start tracing? I'll start tracing from signal generation to delivery. In Solaris kernel to post a signal we use sigtoproc() and eat_signal() is  called on the thread to make the thread on proc (TS_ONPROC) depending upon the state (TS_RUN, TS_SLEEP, TS_STOPPED). psig() is called we kernel finds a pending signal (for instance when returning from trap).&lt;br /&gt;&lt;br /&gt;The program spins in userland after setting up the timer. Since the state of thread would be TS_ONPROC, it would be required to poke the  target CPU if thread happens to be running on different CPU. So I started tracing following functions: sigtoproc(), eat_signal(), poke_cpu() and psig(). Now lets take a look at the Dtrace probes output:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;CPU Probe ID              Function&lt;br /&gt; 8  11263                 eat_signal:entry  1027637980027920 sig : 28&lt;br /&gt; 8   2981                   poke_cpu:entry  1027637980030560 cpu : 9&lt;br /&gt; 8  11263                 eat_signal:entry  1027637990025440 sig : 28&lt;br /&gt; 8   2981                   poke_cpu:entry  1027637990032160 cpu : 9&lt;br /&gt; 8  11263                 eat_signal:entry  1027638000036320 sig : 28&lt;br /&gt; 8   2981                   poke_cpu:entry  1027638000043600 cpu : 9&lt;br /&gt; 8  11263                 eat_signal:entry  1027638010025520 sig : 28&lt;br /&gt; 8   2981                   poke_cpu:entry  1027638010032240 cpu : 9&lt;br /&gt; 8  11263                 eat_signal:entry  1027638020023840 sig : 28&lt;br /&gt; 8   2981                   poke_cpu:entry  1027638020031280 cpu : 9&lt;br /&gt; 8  11263                 eat_signal:entry  1027638030028720 sig : 28&lt;br /&gt; 8   2981                   poke_cpu:entry  1027638030035920 cpu : 9&lt;br /&gt; 8  11263                 eat_signal:entry  1027638040024480 sig : 28&lt;br /&gt;[.]&lt;br /&gt; 9   8317                       psig:entry  1027638170086480 sig : 28&lt;br /&gt;&lt;br /&gt;If you calculate the difference (ie timestamp) between psig() and the first eat_signal(), you will notice that the difference is huge.&lt;br /&gt;&lt;br /&gt;1027638170086480-1027637980027920&lt;br /&gt;190058560 = 19 ticks (190 ms)&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We also noticed that CPU 8 (from where sigtoproc() is being called by clock_tick()) is poking CPU 9, however CPU 9 is not preempting the current running thread (program which is spinning). So why and how will it happen? In order to understand this, I'll first describe a bit on how preemption works in Solaris. In order to preempt a running thread, kernel sets t_astflag (using aston() macro) and also sets appropriate CPU preemption flag. There are two CPU preemption flags viz: cpu_runrun for user level preemptions and cpu_kprunrun for kernel level preemptions. RT threads can preempt TS or SYS or IA class threads since kernel level preemptions typically kicks off when current  running threads priority is &lt;= 100 (KPQPRI). For signal we don't set CPU level preemption flags. We just need to set t_sig_check and t_astflag followed by poke call.&lt;br /&gt;&lt;br /&gt;Since we are interested in user level preemption, we should know what happens when CPU 8 poked CPU 9 (using cross call). If the current running thread on CPU 9 is in userland, then we call user_rtt() which calls trap() if the checks for t_astflag succeeds. So lets check whether t_astflag would be set when we call eat_signal() or not. And that's where the problem was. If the target thread in eat_signal() is TS_ONPROC, we should set t_astflag and then poke the CPU. It will be clear from the following probe that the running thread on CPU 9&lt;br /&gt;was getting preempted because the time quantum finished and clock would have set t_astflag in cpu_surrender().&lt;br /&gt;&lt;br /&gt; 9  15055               post_syscall:entry  1027637970269440&lt;br /&gt; 8  11263                 eat_signal:entry  1027637980027920 sig : 28&lt;br /&gt; 8   2981                   poke_cpu:entry  1027637980030560 cpu : 9&lt;br /&gt;[.]&lt;br /&gt; 8  11263                 eat_signal:entry  1027638040024480 sig : 28&lt;br /&gt; 8   2981                   poke_cpu:entry  1027638040026800 cpu : 9&lt;br /&gt;[.]&lt;br /&gt; 8  11263                 eat_signal:entry  1027638110024160 sig : 28&lt;br /&gt; 8   2981                   poke_cpu:entry  1027638110026560 cpu : 9&lt;br /&gt; 8   2435              cpu_surrender:entry  1027638170024720 t:3001b7af3e0&lt;br /&gt; 8   2981                   poke_cpu:entry  1027638170027280 cpu : 9&lt;br /&gt; 8  11263                 eat_signal:entry  1027638170032720 sig : 28&lt;br /&gt; 9   2919              poke_cpu_intr:entry  1027638170033760&lt;br /&gt; 8   2981                   poke_cpu:entry  1027638170034400 cpu : 9&lt;br /&gt; 9   3390                       trap:entry  1027638170037840 type :512, pc: 10984, ast:1&lt;br /&gt; 8   2981                   poke_cpu:entry  1027638170038640 cpu : 9&lt;br /&gt; 9   2919              poke_cpu_intr:entry  1027638170045680&lt;br /&gt; 9   1497               trap_cleanup:entry  1027638170054880 0&lt;br /&gt; 9   8317                       psig:entry  1027638170086480 sig : 28&lt;br /&gt; 9   2278                   trap_rtt:entry  1027638170117440&lt;br /&gt; 9  15055               post_syscall:entry  1027638170143360&lt;br /&gt; 9   8317                       psig:entry  1027638170150880 sig : 2&lt;br /&gt;&lt;br /&gt;So Dtrace did help us in finding out where the problem could be. This is just once example. Happy Dtracing...&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-3918505966631925384?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/3918505966631925384/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=3918505966631925384' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/3918505966631925384'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/3918505966631925384'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2005/07/interesting-signal-delivery-related.html' title='An interesting signal delivery related problem'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-8048972185391819679</id><published>2005-07-19T14:57:00.000-07:00</published><updated>2009-08-24T14:58:15.127-07:00</updated><title type='text'>Dtrace rocks...</title><content type='html'>&lt;pre wrap="true"&gt;Sometime back I had a problem with my desktop and as a result it started crawling whenever Java ticker used to kick in. I think I must share this with the rest of the world. I'd also share a kernel problem that we cracked and it was related to performance. So Dtrace has helped in solving many problems so far.&lt;br /&gt;&lt;br /&gt;My desktop running Solaris 10 started crawling when I noticed that Xsun is eating up 68% of CPU. From prstat(1M)&lt;br /&gt;&lt;br /&gt;# prstat&lt;br /&gt;  PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP&lt;br /&gt;  594 ******     85M   78M run     30    0  14:03:19  68% Xsun/1&lt;br /&gt;  796 root       16M   13M sleep   59    0   1:06:25 5.8% stfontserverd/18&lt;br /&gt;[.]&lt;br /&gt;&lt;br /&gt;I then started Dtrac'ing Xsun and noticed that lwp_sigmask() syscall call is being made too frequently by Xsun. Here is the data :-&lt;br /&gt;&lt;br /&gt;# ./syscall.d&lt;br /&gt;^C&lt;br /&gt;Ran for 26 seconds&lt;br /&gt;&lt;br /&gt;&lt;br /&gt; writev                                                         2832&lt;br /&gt; pollsys                                                        3261&lt;br /&gt; read                                                           5910&lt;br /&gt; doorfs                                                        27199&lt;br /&gt; lwp_sigmask                                                  217592&lt;br /&gt;&lt;br /&gt;LWP ID     COUNT&lt;br /&gt;1          217592&lt;br /&gt;&lt;br /&gt;             libc.so.1`__systemcall6+0x20&lt;br /&gt;             libc.so.1`pthread_sigmask+0x1b4&lt;br /&gt;             libc.so.1`sigprocmask+0x20&lt;br /&gt;             libc.so.1`sighold+0x54&lt;br /&gt;             libST.so.1`fsexchange+0x78&lt;br /&gt;             libST.so.1`FSSessionDisposeFontInstance+0x8c&lt;br /&gt;            9063&lt;br /&gt;&lt;br /&gt;             libc.so.1`__systemcall6+0x20&lt;br /&gt;             libc.so.1`pthread_sigmask+0x1b4&lt;br /&gt;             libc.so.1`sigprocmask+0x20&lt;br /&gt;             libc.so.1`sigrelse+0x54&lt;br /&gt;             libST.so.1`fsexchange+0xc0&lt;br /&gt;             libST.so.1`FSSessionGetFontRenderingParams+0x8c&lt;br /&gt;&lt;br /&gt;...and many more such stack traces from libST.so.1`fsexchange().&lt;br /&gt;&lt;br /&gt;Infact the stack is like this:-&lt;br /&gt;&lt;br /&gt;             libc.so.1`__systemcall6+0x20&lt;br /&gt;             libc.so.1`pthread_sigmask+0x1b4&lt;br /&gt;             libc.so.1`sigprocmask+0x20&lt;br /&gt;             libc.so.1`sighold+0x54&lt;br /&gt;             libST.so.1`fsexchange+0x90&lt;br /&gt;             libST.so.1`FSSessionGetFontRenderingParams+0x8c&lt;br /&gt;             libST.so.1`GetRenderProps+0x344&lt;br /&gt;             libST.so.1`GlyphVectorRepQuery+0xf4&lt;br /&gt;             libST.so.1`STGlyphVectorQuery+0xd0&lt;br /&gt;             SUNWXst.so.1`_XSTUseCache+0x68&lt;br /&gt;&lt;br /&gt;Notice that in this stack trace, we are calling sighold() and sigrelse() too frequently. So this process is disabling and enabling signals for some reason. Looks like we are rendering characters, but why do we block and unblock signals in this path?. Here is the Dtrace script which was used :-&lt;br /&gt;&lt;br /&gt;#!/usr/sbin/dtrace -s&lt;br /&gt;&lt;br /&gt;#pragma D option quiet&lt;br /&gt;&lt;br /&gt;BEGIN&lt;br /&gt;{&lt;br /&gt;       start = timestamp;&lt;br /&gt;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;syscall:::entry&lt;br /&gt;/execname == "Xsun"/&lt;br /&gt;{&lt;br /&gt;       @s[probefunc] = count();&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;syscall::lwp_sigmask:entry&lt;br /&gt;/execname == "Xsun"/&lt;br /&gt;{&lt;br /&gt;       @c[curthread-&gt;t_tid] = count();&lt;br /&gt;       @st[ustack(6)] = count();&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;END&lt;br /&gt;{&lt;br /&gt;       printf("Ran for %d seconds\n\n", (timestamp - start) / 1000000000);&lt;br /&gt;&lt;br /&gt;       trunc(@s,5);&lt;br /&gt;       printa(@s);&lt;br /&gt;&lt;br /&gt;       printf("\n%-10s %-10s\n", "LWP ID", "COUNT");&lt;br /&gt;       printa("%-10d %@d\n", @c);&lt;br /&gt;&lt;br /&gt;       printa(@st);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;In fact Dtrace could help us in solving much more complex problems. Happy Dtrac'ing...&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-8048972185391819679?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/8048972185391819679/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=8048972185391819679' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/8048972185391819679'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/8048972185391819679'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2005/07/dtrace-rocks.html' title='Dtrace rocks...'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-1738936023073678916</id><published>2005-05-24T14:58:00.000-07:00</published><updated>2009-08-24T14:58:44.487-07:00</updated><title type='text'>::cpupart -v for mdb(1m)</title><content type='html'>&lt;pre wrap="true"&gt;Most of you would have used ::cpupart in mdb(1m) to determine the partitions you have on your system. For those who don't know what's partition, then it's an objecct (or kernel entity) which consists of set of CPUs and a global dispatch queue (or global preemption queue). In fact processor sets (which are created from userland using psrset(1M) are abstraction of CPU partitions.&lt;br /&gt;&lt;br /&gt;One of the thing which I'm currently working on is to introduce a new option to ::cpupart which will print all the runnable threads in the global dispatch queue of a CPU partition. It's very similar to what ::cpuinfo -v does. Here is the sample output :-&lt;br /&gt;&lt;br /&gt;On x86 :-&lt;br /&gt;---------&lt;br /&gt;&gt; ::cpupart -v&lt;br /&gt;ID     ADDR NRUN #CPU CPUS&lt;br /&gt; 0 fec2a1f8  298    2 0-1&lt;br /&gt;               |&lt;br /&gt;               +--&gt;  PRI THREAD   PROC&lt;br /&gt;                     100 d19b1000 sema&lt;br /&gt;                     100 d19aca00 sema&lt;br /&gt;                     100 d19ab200 sema&lt;br /&gt;                     100 d19a7000 sema&lt;br /&gt;                     100 d0b90a00 sema&lt;br /&gt;                     100 d19a5a00 sema&lt;br /&gt;                     100 d19a2e00 sema&lt;br /&gt;                     100 d19aea00 sema&lt;br /&gt;                     100 d19b4200 sema&lt;br /&gt;                     [.]&lt;br /&gt;&gt;&lt;br /&gt;&lt;br /&gt;On SPARC :-&lt;br /&gt;-----------&lt;br /&gt;&gt; ::cpupart -v&lt;br /&gt;ID             ADDR NRUN #CPU CPUS&lt;br /&gt; 0          18a8c50   25    8 4-11&lt;br /&gt;                       |&lt;br /&gt;                       +--&gt;  PRI THREAD      PROC&lt;br /&gt;                             100 3000a7b1660 sema&lt;br /&gt;                             100 3000a7c55e0 sema&lt;br /&gt;                             100 3000a7b0d00 sema&lt;br /&gt;                             100 3000a7c4960 sema&lt;br /&gt;                             100 3000a7b5c80 sema&lt;br /&gt;                             100 3000a7b4380 sema&lt;br /&gt;                             100 300084a2c80 sema-1&lt;br /&gt;                             100 3000826c3a0 sema-1&lt;br /&gt;                             [.]&lt;br /&gt;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-1738936023073678916?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/1738936023073678916/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=1738936023073678916' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/1738936023073678916'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/1738936023073678916'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2005/05/cpupart-v-for-mdb1m.html' title='::cpupart -v for mdb(1m)'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-7739475828515609843</id><published>2005-02-16T14:58:00.000-08:00</published><updated>2009-08-24T14:59:19.796-07:00</updated><title type='text'>Dtrace : calculating time spent in functions</title><content type='html'>&lt;pre wrap="true"&gt;The other day (10th Feb 2005), I was giving a demo of &lt;a href="http://www.sun.com/2004-0518/feature/index.html"&gt;Dtrace&lt;/a&gt; to BEA India folks in ITPL Bangalore and thought that having a &lt;a href="http://www.sun.com/2004-0518/feature/index.html"&gt;Dtrace&lt;/a&gt; script which could demonstrate following things would be nice :-&lt;br /&gt;&lt;br /&gt;(a) time spent in functions called&lt;br /&gt;(b) shows both user and kernel level tracing capabilities in the same function call context.&lt;br /&gt;(c) uses $target and -c to demonstrate tracing when a process is launched.&lt;br /&gt;&lt;br /&gt;I hope people would find it interesting. Here is the output from script. Please note that the time spent is expressed in nanoseconds.&lt;br /&gt;&lt;br /&gt;# which w&lt;br /&gt;/usr/bin/w&lt;br /&gt;# dtrace -c w -Fs ./userfunc.target.d main&lt;br /&gt;[.]&lt;br /&gt; 5  -&gt; malloc                                        0&lt;br /&gt; 5    -&gt; assert_no_libc_locks_held                   0&lt;br /&gt; 5    &lt;- assert_no_libc_locks_held                        6880&lt;br /&gt; 5    -&gt; lmutex_lock                                 0&lt;br /&gt; 5    &lt;- lmutex_lock                                      7840&lt;br /&gt; 5    -&gt; _malloc_unlocked                            0&lt;br /&gt; 5      -&gt; cleanfree                                 0&lt;br /&gt; 5      &lt;- cleanfree                                      7600&lt;br /&gt; 5      -&gt; realfree                                  0&lt;br /&gt; 5      &lt;- realfree                                       8000&lt;br /&gt; 5    &lt;- _malloc_unlocked                                35360&lt;br /&gt; 5    -&gt; lmutex_unlock                               0&lt;br /&gt; 5    &lt;- lmutex_unlock                                    8240&lt;br /&gt; 5  &lt;- malloc                                            89200&lt;br /&gt; 5  -&gt; sysinfo                                       0              &lt;-------- getting into kernel&lt;br /&gt; 5    -&gt; pre_syscall                                 0&lt;br /&gt; 5      -&gt; syscall_mstate                            0&lt;br /&gt; 5      &lt;- syscall_mstate                                 4560&lt;br /&gt; 5    &lt;- pre_syscall                                     13120&lt;br /&gt; 5    -&gt; systeminfo                                  0&lt;br /&gt; 5    &lt;- systeminfo                                       5760&lt;br /&gt; 5    -&gt; post_syscall                                0&lt;br /&gt; 5      -&gt; clear_stale_fd                            0&lt;br /&gt; 5      &lt;- clear_stale_fd                                 5200&lt;br /&gt; 5      -&gt; syscall_mstate                            0&lt;br /&gt; 5      &lt;- syscall_mstate                                 4000&lt;br /&gt; 5    &lt;- post_syscall                                    21680&lt;br /&gt;&lt;br /&gt; 5  &lt;- sysinfo                                           60720&lt;br /&gt;[.]&lt;br /&gt;&lt;br /&gt;# cat userfunc.target.d&lt;br /&gt;/*&lt;br /&gt;* This script calculates time spent in all the functions (userland &amp;amp; kernel)&lt;br /&gt;* called once a particular function is traced.&lt;br /&gt;*&lt;br /&gt;* Please see /usr/demo/dtrace/userfunc.d also.&lt;br /&gt;*&lt;br /&gt;* Usage:&lt;br /&gt;* # dtrace -c &lt;command&gt; -Fs ./userfunc.target.d &lt;userland_function&gt;&lt;br /&gt;*/&lt;br /&gt;&lt;br /&gt;BEGIN&lt;br /&gt;{&lt;br /&gt;       self-&gt;depth = 0;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;pid$target::$1:entry&lt;br /&gt;{&lt;br /&gt;       self-&gt;trace = 1;&lt;br /&gt;       self-&gt;depth = 0;&lt;br /&gt;       self-&gt;timestamp[self-&gt;depth++] = timestamp;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;pid$target::$1:return&lt;br /&gt;/self-&gt;trace/&lt;br /&gt;{&lt;br /&gt;       self-&gt;trace = 0;&lt;br /&gt;       trace(timestamp - self-&gt;timestamp[0]);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;fbt:::entry,&lt;br /&gt;pid$target:::entry&lt;br /&gt;/self-&gt;trace &amp;amp;&amp;amp; self-&gt;timestamp[self-&gt;depth - 1]/&lt;br /&gt;{&lt;br /&gt;       self-&gt;timestamp[self-&gt;depth++] = timestamp;&lt;br /&gt;       trace(0);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;fbt:::return,&lt;br /&gt;pid$target:::return&lt;br /&gt;/self-&gt;trace &amp;amp;&amp;amp; self-&gt;timestamp[self-&gt;depth - 1]/&lt;br /&gt;{&lt;br /&gt;       self-&gt;depth--;&lt;br /&gt;       trace(timestamp - self-&gt;timestamp[self-&gt;depth]);&lt;br /&gt;       self-&gt;timestamp[self-&gt;depth] = 0;&lt;br /&gt;}&lt;br /&gt;&lt;/userland_function&gt;&lt;/command&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-7739475828515609843?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/7739475828515609843/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=7739475828515609843' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/7739475828515609843'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/7739475828515609843'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2005/02/dtrace-calculating-time-spent-in.html' title='Dtrace : calculating time spent in functions'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-5963259819817819652</id><published>2005-02-03T14:59:00.000-08:00</published><updated>2009-08-24T14:59:55.454-07:00</updated><title type='text'>Small demo of Resource Management, Contracts and Service Management Framework</title><content type='html'>&lt;pre wrap="true"&gt;I thought I should share these small demos with you folks. I've started playing with Resource Management, Zones,&lt;br /&gt;CPU-shares,  Service Management Framework and Contract. I hope you will find it a bit interesting.&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;&lt;a href="http://docs.sun.com/app/docs/doc/817-1592/6mhahuoh6?a=view"&gt;&lt;br /&gt;Projects &lt;/a&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I first created two projects and following resource controls were added to these projects. projadd(1m) is used to add a project and  projects(1) is used to list projects in the system.&lt;br /&gt;&lt;br /&gt;# projadd -G other -c "Project 1" -K "project.cpu-shares=(privileged,1,none)" proj1&lt;br /&gt;# projmod -a -K "task.max-lwps=(privileged,3,deny)" proj1&lt;br /&gt;# projmod -a -K "rcap.max-rss=20971520" proj1&lt;br /&gt;# projadd -G other -c "Project 2" -K "project.cpu-shares=(privileged,3,none)" -K "rcap.max-rss=10485760" proj2&lt;br /&gt;&lt;br /&gt;Now lets assign proj1 to user1 and proj2 to user2 so that whenever user1 and/or user2 login, the enforcement of these resource control  ome into effect.&lt;br /&gt;&lt;br /&gt;# projmod -U user1 proj1&lt;br /&gt;# projmod -U user2 proj2&lt;br /&gt;&lt;br /&gt;We need to add these two lines into /etc/user_attr :-&lt;br /&gt;&lt;br /&gt;# diff /etc/user_attr.org /etc/user_attr&lt;br /&gt;12a13,14&lt;br /&gt;&gt; user1::::project=proj1&lt;br /&gt;&gt; user2::::project=proj2&lt;br /&gt;&lt;br /&gt;So this's how our two projects look like :-&lt;br /&gt;&lt;br /&gt;# projects -l&lt;br /&gt;[.]&lt;br /&gt;proj1&lt;br /&gt;       projid : 100&lt;br /&gt;       comment: "Project 1"&lt;br /&gt;       users  : user1&lt;br /&gt;       groups : other&lt;br /&gt;       attribs: project.cpu-shares=(privileged,1,none)&lt;br /&gt;                rcap.max-rss=20971520&lt;br /&gt;                task.max-lwps=(privileged,3,deny)&lt;br /&gt;proj2&lt;br /&gt;       projid : 101&lt;br /&gt;       comment: "Project 2"&lt;br /&gt;       users  : user2&lt;br /&gt;       groups : other&lt;br /&gt;       attribs: project.cpu-shares=(privileged,3,none)&lt;br /&gt;                rcap.max-rss=10485760&lt;br /&gt;#&lt;br /&gt;&lt;br /&gt;proj1 specifies three things&lt;br /&gt;&lt;br /&gt;(a) It assign CPU-shares as 1&lt;br /&gt;(b) resident set size (RSS) can go upto 20m&lt;br /&gt;(c) maximum number of lwps we can have in this project is 3&lt;br /&gt;&lt;br /&gt;proj2 specifies two things&lt;br /&gt;&lt;br /&gt;(a) It assign CPU-shares as 3&lt;br /&gt;(b) resident set size can go upto 10m&lt;br /&gt;&lt;br /&gt;What's the indent of this demo on CPU-shares -- I'd like to show that 25% utilization in proj1 and 75% utilization in proj2 will be enforced when there is competition from other projects. Since I use same workload which is to spin and hog CPU, it's little easy for me to demostrate this fact here.&lt;br /&gt;&lt;br /&gt;# prstat -J&lt;br /&gt;[.]&lt;br /&gt;PROJID    NPROC  SIZE   RSS MEMORY      TIME  CPU PROJECT                    &lt;br /&gt;  101        2 2408K 1816K   0.0%   0:01:33  74% proj2                      &lt;br /&gt;  100        2 2408K 1816K   0.0%   0:00:56  25% proj1                      &lt;br /&gt;    1        2 7504K 6352K   0.0%   0:00:00 0.2% user.root                  &lt;br /&gt;    0       67  336M  183M   0.3%   0:00:27 0.1% system                     &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Total: 73 processes, 247 lwps, load averages: 1.75, 0.77, 0.32&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Now, lets change the CPU-shares of proj2 to 2 and see what happens :-&lt;br /&gt;&lt;br /&gt;# prctl -n project.cpu-shares -r -v 2 -i project proj2&lt;br /&gt;# prctl -n project.cpu-shares -i project proj2&lt;br /&gt;project: 101: proj2&lt;br /&gt;NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT&lt;br /&gt;project.cpu-shares&lt;br /&gt;       privileged          2       -   none                                 -&lt;br /&gt;       system          65.5K     max   none                                 -&lt;br /&gt;# prstat -J&lt;br /&gt;[.]&lt;br /&gt;PROJID    NPROC  SIZE   RSS MEMORY      TIME  CPU PROJECT                    &lt;br /&gt;  101        2 2408K 1816K   0.0%   0:03:24  65% proj2                      &lt;br /&gt;  100        2 2408K 1816K   0.0%   0:01:44  33% proj1                      &lt;br /&gt;    1        2 7504K 6344K   0.0%   0:00:00 0.2% user.root                  &lt;br /&gt;    0       67  336M  183M   0.3%   0:00:27 0.1% system                  &lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size:180%;"&gt;&lt;a href="http://docs.sun.com/app/docs/doc/817-1592/6mhahuosu?a=view" style="font-weight: bold;"&gt;Resource Management in Zones&lt;/a&gt;&lt;span style="font-weight: bold;"&gt; &lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Here I show how to create a simply zone (without a logical interface attached to it) and subsequently we will assign shares to newly created zones.&lt;br /&gt;&lt;br /&gt;# zonecfg -z myzone1&lt;br /&gt;myzone1: No such zone configured&lt;br /&gt;Use 'create' to begin configuring a new zone.&lt;br /&gt;zonecfg:myzone1&gt; create&lt;br /&gt;zonecfg:myzone1&gt; add fs&lt;br /&gt;zonecfg:myzone1:fs&gt; set dir=/mnt/local&lt;br /&gt;zonecfg:myzone1:fs&gt; set special=/opt/sfw&lt;br /&gt;zonecfg:myzone1:fs&gt; set type=lofs&lt;br /&gt;zonecfg:myzone1:fs&gt; set options=[ro,nodevices]&lt;br /&gt;zonecfg:myzone1:fs&gt; end&lt;br /&gt;zonecfg:myzone1&gt; add rctl&lt;br /&gt;zonecfg:myzone1:rctl&gt; set name=zone.cpu-shares&lt;br /&gt;zonecfg:myzone1:rctl&gt; add value (priv=privileged,limit=1,action=none)&lt;br /&gt;zonecfg:myzone1:rctl&gt; end&lt;br /&gt;zonecfg:myzone1&gt; add attr&lt;br /&gt;zonecfg:myzone1:attr&gt; set name=comment&lt;br /&gt;zonecfg:myzone1:attr&gt; set type=string&lt;br /&gt;zonecfg:myzone1:attr&gt; set value="first zone"&lt;br /&gt;zonecfg:myzone1:attr&gt; end&lt;br /&gt;zonecfg:myzone1&gt; set autoboot=true&lt;br /&gt;zonecfg:myzone1&gt; set zonepath=/export/home/zone/myzone1&lt;br /&gt;zonecfg:myzone1&gt; verify&lt;br /&gt;zonecfg:myzone1&gt; info&lt;br /&gt;zonepath: /export/home/zone/myzone1&lt;br /&gt;autoboot: true&lt;br /&gt;pool:&lt;br /&gt;inherit-pkg-dir:&lt;br /&gt; dir: /lib&lt;br /&gt;inherit-pkg-dir:&lt;br /&gt; dir: /platform&lt;br /&gt;inherit-pkg-dir:&lt;br /&gt; dir: /sbin&lt;br /&gt;inherit-pkg-dir:&lt;br /&gt; dir: /usr&lt;br /&gt;fs:&lt;br /&gt; dir: /mnt/local&lt;br /&gt; special: /opt/sfw&lt;br /&gt; raw not specified&lt;br /&gt; type: lofs&lt;br /&gt; options: [ro,nodevices]&lt;br /&gt;rctl:&lt;br /&gt; name: zone.cpu-shares&lt;br /&gt; value: (priv=privileged,limit=1,action=none)&lt;br /&gt;attr:&lt;br /&gt; name: comment&lt;br /&gt; type: string&lt;br /&gt; value: "first zone"&lt;br /&gt;zonecfg:myzone1&gt;&lt;br /&gt;&lt;br /&gt;Now lets install and boot the zone. Remember you can boot, reboot and halt zones independently.&lt;br /&gt;&lt;br /&gt;# zoneadm -z myzone1 install&lt;br /&gt;# zoneadm -z myzone1 boot&lt;br /&gt;&lt;br /&gt;Repeat these steps to create myzone2, but assign 3 shares to myzone2.&lt;br /&gt;&lt;br /&gt;myzone2 will look like this :-&lt;br /&gt;&lt;br /&gt;# zonecfg -z myzone2&lt;br /&gt;zonecfg:myzone2&gt; info&lt;br /&gt;zonepath: /export/home/zone/myzone2&lt;br /&gt;autoboot: true&lt;br /&gt;pool:&lt;br /&gt;inherit-pkg-dir:&lt;br /&gt; dir: /lib&lt;br /&gt;inherit-pkg-dir:&lt;br /&gt; dir: /platform&lt;br /&gt;inherit-pkg-dir:&lt;br /&gt; dir: /sbin&lt;br /&gt;inherit-pkg-dir:&lt;br /&gt; dir: /usr&lt;br /&gt;fs:&lt;br /&gt; dir: /mnt/local&lt;br /&gt; special: /opt/sfw&lt;br /&gt; raw not specified&lt;br /&gt; type: lofs&lt;br /&gt; options: [ro,nodevices]&lt;br /&gt;rctl:&lt;br /&gt; name: zone.cpu-shares&lt;br /&gt; value: (priv=privileged,limit=3,action=none)&lt;br /&gt;attr:&lt;br /&gt; name: comment&lt;br /&gt; type: string&lt;br /&gt; value: "Second Zone"&lt;br /&gt;zonecfg:myzone2&gt;&lt;br /&gt;&lt;br /&gt;Now I run spin program from /mnt/local to put load on each zone.&lt;br /&gt;&lt;br /&gt;# prstat -Z&lt;br /&gt;ZONEID    NPROC  SIZE   RSS MEMORY      TIME  CPU ZONE                       &lt;br /&gt;    3        9   27M   22M   0.1%   0:11:57  74% myzone2                    &lt;br /&gt;    2        9   27M   21M   0.1%   0:04:17  25% myzone1                    &lt;br /&gt;    0       50  188M  123M   0.3%   0:01:17 0.3% global                     &lt;br /&gt;    1        6   24M   19M   0.1%   0:00:10 0.0% myzone3                    &lt;br /&gt;&lt;br /&gt;Total: 74 processes, 256 lwps, load averages: 2.02, 2.00, 1.73&lt;br /&gt;&lt;br /&gt;# prctl -n zone.cpu-shares -i zone myzone1&lt;br /&gt;zone: 2: myzone1&lt;br /&gt;NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT&lt;br /&gt;zone.cpu-shares&lt;br /&gt;       privileged          1       -   none                                 -&lt;br /&gt;       system          65.5K     max   none                                 -&lt;br /&gt;# prctl -n zone.cpu-shares -i zone myzone2&lt;br /&gt;zone: 3: myzone2&lt;br /&gt;NAME    PRIVILEGE       VALUE    FLAG   ACTION                       RECIPIENT&lt;br /&gt;zone.cpu-shares&lt;br /&gt;       privileged          3       -   none                                 -&lt;br /&gt;       system          65.5K     max   none                                 -&lt;br /&gt;#&lt;br /&gt;&lt;br /&gt;You can use following command to see the status of zones&lt;br /&gt;&lt;br /&gt;# zoneadm list -cv&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;font-size:180%;" &gt;&lt;span style="text-decoration: underline;"&gt;&lt;br /&gt;Contracts&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;ctrun(1) can be used for restarting the command again in the event of signal, core, hardware error. You don't require ps(1m) and other  onitering scripts to see what's happening to your command.  See contrat(4) for more details. Here is a small demo of ctrun(1) :-&lt;br /&gt;&lt;br /&gt;# ctrun -f signal -r 0 /spin&lt;br /&gt;# ps -eaf | grep spin&lt;br /&gt;   root  1175  1083   0 13:27:52 pts/1       0:00 ctrun -f signal -r 0 /spin&lt;br /&gt;   root  1177     1  20 13:27:52 pts/1       0:05 /spin&lt;br /&gt;# kill -9 1177&lt;br /&gt;# ps -eaf | grep spin&lt;br /&gt;   root  1175  1083   0 13:27:52 pts/1       0:00 ctrun -f signal -r 0 /spin&lt;br /&gt;   root  1181     1  15 13:28:04 pts/1       0:04 /spin&lt;br /&gt;#&lt;br /&gt;&lt;br /&gt;In case of SIGKILL, ctrun takes care of re-starting the command automatically. Option -r is to specify the re-try count and zero means  indefinitely.&lt;br /&gt;&lt;br /&gt;You can also use -i option to see what events are taking places. For instace if there is a process getting added to the contract or  exiting the contract, the newly created contract would post those events.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://docs.sun.com/app/docs/doc/817-1985/6mhm8o5n0?a=view"&gt;&lt;span style="font-size:180%;"&gt;&lt;span style="text-decoration: underline;"&gt;Service Management Framework&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;This is my first service which I created using Service Management Framewrok. This demonstrates how a simple service can be created and  what will happen in an event of service dieing.&lt;br /&gt;&lt;br /&gt;# svccfg&lt;br /&gt;svc:&gt; import /&lt;a href="http://blog.sun.com/roller/resources/saurabh_mishra/spin-service.xml"&gt;sample.xml&lt;/a&gt;&lt;br /&gt;svc:&gt; quit&lt;br /&gt;# svcs -p spin&lt;br /&gt;STATE          STIME    FMRI&lt;br /&gt;online         13:41:45 svc:/system/spin:default&lt;br /&gt;              13:41:45     1257 spin&lt;br /&gt;# svcs -l spin&lt;br /&gt;fmri         svc:/system/spin:default&lt;br /&gt;name         First SMF service&lt;br /&gt;enabled      true&lt;br /&gt;state        online&lt;br /&gt;next_state   none&lt;br /&gt;state_time   Tue Feb 01 13:41:45 2005&lt;br /&gt;logfile      /var/svc/log/system-spin:default.log&lt;br /&gt;restarter    svc:/system/svc/restarter:default&lt;br /&gt;contract_id  181&lt;br /&gt;dependency   require_all/none svc:/system/filesystem/local (online)&lt;br /&gt;# ps -eaf | grep spin&lt;br /&gt;   root  1257     1  54 13:41:46 ?           0:17 /spin&lt;br /&gt;# kill -9 1257&lt;br /&gt;# svcs -p spin&lt;br /&gt;STATE          STIME    FMRI&lt;br /&gt;online         13:42:08 svc:/system/spin:default&lt;br /&gt;              13:42:08     1263 spin&lt;br /&gt;# ps -eaf | grep spin&lt;br /&gt;   root  1263     1  45 13:42:08 ?           0:13 /spin&lt;br /&gt;# svccfg&lt;br /&gt;svc:&gt; select spin&lt;br /&gt;svc:/system/spin&gt; listprop&lt;br /&gt;general                   framework&lt;br /&gt;general/entity_stability  astring  Unstable&lt;br /&gt;general/single_instance   boolean  true&lt;br /&gt;fs                        dependency&lt;br /&gt;fs/entities               fmri     svc:/system/filesystem/local&lt;br /&gt;fs/grouping               astring  require_all&lt;br /&gt;fs/restart_on             astring  none&lt;br /&gt;fs/type                   astring  service&lt;br /&gt;start                     method&lt;br /&gt;start/exec                astring  /spin&lt;br /&gt;start/project             astring  :default&lt;br /&gt;start/resource_pool       astring  :default&lt;br /&gt;start/timeout_seconds     count    60&lt;br /&gt;start/type                astring  method&lt;br /&gt;start/working_directory   astring  :default&lt;br /&gt;application               framework&lt;br /&gt;application/auto_enable   boolean  true&lt;br /&gt;application/stability     astring  Evolving&lt;br /&gt;tm_common_name            template&lt;br /&gt;tm_common_name/C          ustring  "First SMF service"&lt;br /&gt;svc:/system/spin&gt;&lt;br /&gt;&lt;br /&gt;Isn't that great? No need to have fancy scripts to moniter the daemons. Moreover you can specify dependencies between the services. For  example if you want to make sure that network service is up and running, you would specify this dependency in the dependency property. You can also add methods like stop (whenever a service is stopped; could be using svcadm disable) or  refresh (whenever the service configration is read again).&lt;br /&gt;&lt;br /&gt;These two programs were used for the demo.&lt;br /&gt;&lt;br /&gt;spin.c&lt;br /&gt;------&lt;br /&gt;main()&lt;br /&gt;{&lt;br /&gt; for(;;) ;&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;spin-service.c&lt;br /&gt;--------------&lt;br /&gt;int main()&lt;br /&gt;{&lt;br /&gt; if (fork() == 0) {&lt;br /&gt;  /*&lt;br /&gt;   * Child process and let it spin&lt;br /&gt;   */&lt;br /&gt;  for(;;) ;&lt;br /&gt; } &lt;br /&gt; return (0);&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;Here are few good links which would come handy&lt;br /&gt;(a) &lt;a href="http://docs.sun.com/app/docs/doc/817-1592/6mhahuomo?a=view"&gt;Resource Management Configuration Example&lt;/a&gt;&lt;br /&gt;(b) &lt;a href="http://docs.sun.com/app/docs/doc/817-1592/6mhahuojf?a=view"&gt;CPU-shares&lt;/a&gt;&lt;br /&gt;(c) &lt;a href="http://docs.sun.com/app/docs/doc/817-1592/6mhahuoif?a=view"&gt;Resource Controls&lt;/a&gt;&lt;br /&gt;(d) &lt;a href="http://docs.sun.com/app/docs/doc/817-1592/6mhahuoh6?a=view"&gt;Overview on Projects and Tasks&lt;/a&gt;&lt;br /&gt;(e) &lt;a href="http://docs.sun.com/app/docs/doc/817-1592/6mhahuoka?a=view"&gt;Resource Caping&lt;/a&gt;&lt;br /&gt;(f) &lt;a href="http://docs.sun.com/app/docs/doc/817-1592/6mhahuosu?a=view"&gt;Resource Management in Solaris Zones&lt;/a&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-5963259819817819652?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/5963259819817819652/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=5963259819817819652' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/5963259819817819652'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/5963259819817819652'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2005/02/small-demo-of-resource-management.html' title='Small demo of Resource Management, Contracts and Service Management Framework'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-6233165540849526211</id><published>2005-01-06T15:00:00.000-08:00</published><updated>2009-08-24T15:00:27.915-07:00</updated><title type='text'>inter-process mutex</title><content type='html'>Recently, Surya Prakki (colleague of mine in Kernel Sustaining Group) and I worked on an interesting problem. We initially thought that it would be a random corruption, but later we figured out that it was an application problem. Customer also told us that his application was working fine on Solaris 9 and when they migrated to Solaris 10, they started seeing their application dumping core.&lt;br /&gt;&lt;br /&gt;Owner field of mutex_t was getting corrupted with an invalid address which was resulting in application dumping core (SIGSEGV). This problem was reproducible easily on single CPU machine. The application was dieing in mutex_trylock_adaptive() routine when it tried to dereference the owner. The owner field of the mutex had strange data, but when you look at the core dump the owner field was zero. So this surprised us a lot and we believed that there is some race.&lt;br /&gt;&lt;br /&gt;All sorts of thoughts came to our mind including corruption of registers when cpu switches context to another thread or some other thread overwriting the member in the mutex due to over/under run of an array.&lt;br /&gt;&lt;br /&gt;We first started debugging this problem using procfs watchpoint. We first set a watchpoint on the virtual address of the owner field of mutex_t structure. We used mdb's &lt;addr&gt;:w macro for this purpose. Watchpoint used to fire frequently because application had two threads which contend for lock quite often. So we decided to use a script having "$c, $r, :c" in it. But whenever corruption happened, target process never got the corresponding  watchpoint trap.  So it surprised us a lot and we started wondering how this would happen.&lt;br /&gt;&lt;br /&gt;We then started using Dtrace and truss to figure what is happening. We were trying to find out what is happening from the point mutex_unlock() clears the owner field till the process dumps core. In this process, we started ruling out the things which we thought in the beginning. We were running out of ideas now when we carefully looked at the mutex_t members and noticed that magic number is correct and type of the mutex is USYNC_THREAD. We then started using Dtrace probes when we context switch to another thread. We wanted to figure out whether context switching is playing any role here or not. During this course, we noticed that another process was getting on to the CPU after the process which dumped core released the mutex. This rang the bell in our mind. We also noticed that the mutex address (virtual address) was same when this context switch happened.&lt;br /&gt;&lt;br /&gt;We took a look at the pmap(1) output and noticed that the mutex is from shared memory segment. The other process had used the same key (see shmget(2) system call). What it means is the mutex was used between the processes. We noticed that the corrupted value was a valid address in the other process (a thread address in fact). This surprised us again because we had seen USYNC_THREAD as the type of the mutex and we had *believed* that this mutex is being used between the threads of the same process. This disappointed us a lot. If the mutex is to be used between processes, then the type of the mutex has to be USYNC_PROCESS because one can't really dereference the owner when the mutex is being used between the processes (inter-process mutex).&lt;br /&gt;&lt;br /&gt;From the man pages of mutex_init(3THR)&lt;br /&gt;&lt;br /&gt;    USYNC_THREAD&lt;br /&gt;          The mutex can synchronize threads only  in  this  pro-&lt;br /&gt;          cess. arg is ignored.&lt;br /&gt;&lt;br /&gt;    USYNC_PROCESS&lt;br /&gt;          The mutex can synchronize threads in this process  and&lt;br /&gt;          other  processes.  arg is ignored. The object initial-&lt;br /&gt;          ized with this attribute must be allocated  in  memory&lt;br /&gt;          shared  between  processes,  either in System V shared&lt;br /&gt;          memory (see  shmop(2)) or in memory mapped to  a  file&lt;br /&gt;          (see  mmap(2)). If the object is not allocated in such&lt;br /&gt;          shared  memory,  it  will  not   be   shared   between&lt;br /&gt;          processes.&lt;br /&gt;&lt;br /&gt;We asked the submitter of the bug to make this modification and it all worked fine. Customer came back saying our diagnosis is correct and he modified the application accordingly.&lt;br /&gt;&lt;br /&gt;Having spent a week or so, the bottom line is that don't take things for granted :)&lt;/addr&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-6233165540849526211?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/6233165540849526211/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=6233165540849526211' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/6233165540849526211'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/6233165540849526211'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2005/01/inter-process-mutex.html' title='inter-process mutex'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-4566775550372839380</id><published>2004-12-13T15:00:00.000-08:00</published><updated>2009-08-24T15:00:59.422-07:00</updated><title type='text'>Solaris 10 presentation to ISV's</title><content type='html'>&lt;pre wrap="true"&gt;The other day, I presented cool features of Solaris 10 to ISV's in Mumbai (on 9th Dec) and New Delhi (on 10th Dec). This technology show was organized by Sun for ISV's and most of the people those who came there were Java Developers. It was very well received. People didn't ask much questions because of lack of time, but most of the questions that came were on Solaris containers and Solaris x86. I couldn't show the demo of Dtrace, Zones and ZFS because of lack of time. Since this talk was meant for ISV's only, the number of people came for the presentation was around 80 only. We showed the demo of 3-D looking glass as well. People were very happy to see the demo. The other talks were on J2EE and Java. Thanks.&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-4566775550372839380?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/4566775550372839380/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=4566775550372839380' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/4566775550372839380'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/4566775550372839380'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2004/12/solaris-10-presentation-to-isvs.html' title='Solaris 10 presentation to ISV&apos;s'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3121096342469843106.post-7768939789918634010</id><published>2004-10-11T15:01:00.000-07:00</published><updated>2009-08-24T15:01:35.830-07:00</updated><title type='text'>Solaris 10 Presentation</title><content type='html'>&lt;pre wrap="true"&gt;Yesterday (11th October), Pramod Batni and I presented cool features of Solaris 10 and Dtrace to Hughes software  (www.hssworld.com). It was well received by the folks there and they were very keen in deploying Solaris Zones (http://wwws.sun.com/software/solaris/10/inside.jsp). They also wanted to use Dtrace (http://wwws.sun.com/software/solaris/10/inside.jsp) for improving performance of their application. Though there were few people who could manage to attend Solaris 10 and Dtrace talk, but people there were willing to adopt new technologies in Solaris 10 and were very keen on more detailed talk on Solaris 10 especially on Solaris Zones and Dtrace. Girish (GSO folk) also gave a presentation on our Processor Road map, Sun Cluster, and Volume Server Products.&lt;br /&gt;&lt;br /&gt;This is not our first presentation to an Indian customer. We have presented cool features of Solaris 10 to loads of companies in India. I have presented cool features of Solaris 10 at places like Sun Developer Days (New Delhi), Sun Technology Conference (Bangalore), Infosys (Finacle Division), and Nucleus Software. It had been a great experience for me to present on Solaris 10 at such places. Infact we gave a demo of Dtrace to Infosys folks (developers of Finacle software). Infosys folks were very impressed about Dtrace and wants to use Dtrace for java programs as well.&lt;br /&gt;&lt;br /&gt;We are looking for more such companies and developers to adopt Solaris 10.&lt;br /&gt;&lt;br /&gt;Thanks,&lt;br /&gt;--&lt;br /&gt;Saurabh Mishra, Solaris Kernel Sustaining and Engineering.&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3121096342469843106-7768939789918634010?l=mishrasdiary.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://mishrasdiary.blogspot.com/feeds/7768939789918634010/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3121096342469843106&amp;postID=7768939789918634010' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/7768939789918634010'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/3121096342469843106/posts/default/7768939789918634010'/><link rel='alternate' type='text/html' href='http://mishrasdiary.blogspot.com/2004/10/solaris-10-presentation.html' title='Solaris 10 Presentation'/><author><name>Saurabh Mishra</name><uri>http://www.blogger.com/profile/02167304494816941944</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='30' height='32' src='http://2.bp.blogspot.com/_6UgirKk5E5c/SX6dSzcCqUI/AAAAAAAAN2U/hjGoSLVtFQ4/S220/saurabh1.jpg'/></author><thr:total>0</thr:total></entry></feed>
