Wednesday, September 2, 2009

Writing a new Ethernet device driver for Solaris

This blog entry goes into describing what all you should keep in mind while writing a new Ethernet device driver for Solaris. What we will not go into are LSO, HW checksum and supporting multiple RX rings as I have not written code for these features.

Most Ethernet controllers will have descriptor based TX and RX. The starting point for writing a new device driver is getting attach() and detach() working. Well that's fairly easy but mostly we would want to do following things in attach() :

- Get the vendor/device-id and make sure we have correct chip by looking at the revision.

- Pre-allocate all DMA buffers for TX. You will have to anyway pre-allocate all RX buffers. This is the simplest model you can think off but will require bcopy (an extra copy during TX/RX). But hey you are just starting...

- Allocate interrupts, Register MAC and MII.

- Reset PHY if required and do it before starting MII (mii_start() function). Reset the device too...

- You must enable device interrupts before returning from attach() and this should be the last operation before returning from attach().

- MII layer in Solaris will take care of PHY operations and dladm link properties too. So you need to have getprop and setprop in MAC callback (m_callback). MII can also take care of some common Statistics and ndd. You need to implement PHY read/write/reset operations which are PHY specific.

One noticeable thing I'd like to point out here is that have one DMA alloc and free function to allocate and free a DMA handle/memory. It simplifies code a lot. The same function can be used to allocate TX/RX descriptor ring, DMA buffers for TX/RX and memory for statistics or control block. You need to pass DMA attribute structure and a flag (DMA Read/Write flag). A typical example of such a function will look like this :-

typedef struct xxxx_dma_data {
ddi_dma_handle_t hdl;
ddi_acc_handle_t acchdl;
ddi_dma_cookie_t cookie;
caddr_t addr;
size_t len;
uint_t count;
} xxxx_dma_t;


xxxx_dma_t *
xxxx_alloc_a_dma_blk(xxxx_t *xxxxp, ddi_dma_attr_t *attr, int size, int flag)
{

 int err;
xxxx_dma_t *dma;

dma = kmem_zalloc(sizeof (xxxx_dma_t), KM_SLEEP);

err = ddi_dma_alloc_handle(xxxxp->xxxx_dip, attr,
DDI_DMA_SLEEP, NULL, &dma->hdl);

if (err != DDI_SUCCESS) {
goto fail;
}

err = ddi_dma_mem_alloc(dma->hdl,
size, &xxxx_mem_attr, DDI_DMA_CONSISTENT, DDI_DMA_SLEEP, NULL,
&dma->addr, &dma->len, &dma->acchdl);

if (err != DDI_SUCCESS) {
ddi_dma_free_handle(&dma->hdl);
goto fail;
}

err = ddi_dma_addr_bind_handle(dma->hdl, NULL, dma->addr,
dma->len, flag | DDI_DMA_CONSISTENT, DDI_DMA_SLEEP,
NULL, &dma->cookie, &dma->count);

if (err != DDI_SUCCESS) {
ddi_dma_mem_free(&dma->acchdl);
ddi_dma_free_handle(&dma->hdl);
goto fail;
}

return (dma);
fail:
kmem_free(dma, sizeof (xxxx_dma_t));
return (NULL);

}

void
xxxx_free_a_dma_blk(xxxx_dma_t *dma)
{

 if (dma != NULL) {
(void) ddi_dma_unbind_handle(dma->hdl);
ddi_dma_mem_free(&dma->acchdl);
ddi_dma_free_handle(&dma->hdl);
kmem_free(dma, sizeof (xxxx_dma_t));
}

}


Some of the corner cases you must take care:

- Test the code path when there are no more TX descriptors available for the driver to send a pkt. You must call mac_tx_update() once a descriptor is reclaimed. Some drivers start reclaiming once threshold is reached.

- Make sure you handle RX FIFO overflow interrupt properly. The driver may not have enough RX descriptor to receive pkts further and hence you must consume posted RX descriptors. Some chips require reset during RX FIFO.

General things that you may want to consider:

- Call mac_tx_update() outside lock.

- Try to raise a software interrupt whenever a hardware interrupt is raised. Don't spend too much time processing pkts in the hardware interrupt context.

- Make sure chip is quiesced when detach is called.

- Use DDI's ddi_periodic_add(9F) instead of timeout(9F).

- Test suspend/resume and quiesce (for fast reboot to work).

- I think most the Multicast filters are hash-based but I have seen a CAM (Content Addressable Memory) based filter too. It can get tricky to support multicasting and in that case just enable ALL multicast. Hash-based multicast filter are easy to implement. You can have a reference count for every bit in the 64-bit variable. Once the reference count for the bit reaches zero, you make the bit zero. Otherwise it should remain set.

- Make sure you handle link status change properly and re-program the MAC register if required at different link speed/duplex.

- Look for memory leaks (enable kmem_flags = 0xf in /etc/system and take crash dump; then run ::findleaks in mdb)


You can use NICDRV or HCTS for testing and NICDRV will stress test most of the components in your driver including MAXQ, FTP, Ping with different payloads, load/unload of the driver, Multicast, dladm(1m) features, VLAN, VNIC etc.