T his post continues with the kernel driver development of “advanced” examples of AXI Timer and AXI DMA . In fact, they are no as advanced as they could be, though they should be sufficient to begin with. A more advanced usage would be the Scatter-Gather (SG) mode or a Cyclic mode or even a full customized RTL code with its own drivers. Nevertheless, writing an RTL trigger or timer is not as hard. Furthermore, its always a good idea to have a reference when in doubt. A good reference might be Xilinx DMA drivers on github.

Note: This post was originally password-protected, but in the end (Late 2021), I have decided to remove the protection 😇 Thank you for your support and have fun reading and coding!

AXI Timer with simple IRQ Handler

Now that we know how to build a kernel module, lets see how to implement an interrupt handler. For this purpose,we are going to use the mentioned AXI Timer IP core from Xilinx. The simplest option to use it is to instantiate the component in the IP Integrator (Block Diagram) containing already the PS-part of the processor (Either Zynq or ZynqMP processing system). We do need to access the AXI4-L interface of the IP as well as to route the interrupt line into the PS. The IRQ_F2P port needs to be enabled in the IP configuration to allow the PL to interrupt the PS. The IP Integrator’s Address map needs to assign the AXI4-L address range for the timer either through GP_0 or GP_1 (For Zynq7000). The Choice is irrelevant as is the address. The address range must comply with the IP requirements. The minimum working scheme is shown below (pdf here):

It is important to note the concat block, which by default uses “auto” width for all ports. I HIGHLY recommend to not do this and specify instead manually the width of each port of the concat IP. The reason is, that if you connect your interrupt so that there will be some unused ports (IE to your interrupt uses index 3, but index 2 is unused and configured to auto), then you are likely going to have a headache if you will modify the device tree by yourself and not by the tool. Because the tool knows that index 2 is unused and connects your interrupt to port 2 instead of port 3. Therefore manually specifying in DeviceTree to use port 3 will never work. Before we dwell into interrupt handler and DeviceTree, lets see how interrupts are defined there:

  • DT-Node: interrupts = <0 31 4>;´ <X Y Z>
  • X: Defines the interrupt to be either SPI or Non SPI. Note that SPI = Shared Peripheral Interrupt.
  • Y: Interrupt Line Identification
  • Z: Interrupt Sensitivity

Interrupt definition (X):

  • 0 – Non-SPI Interrupt
  • 1 – SPI Interrupt

Interrupt sensitivity (Z):

  • 1 (rising edge) – Interrupt will be triggered only during transition from low to high.
  • 4 (Level sensitive) – Interrupt will be continuously triggered as long as the interrupt remains high.

Interrupt Line (Y):

This is the most important number, double check it always. Lookup the corresponding TRM and find the Interrupt mappings into the GIC (Generic Interrupt Controller). You will find that: IRQPF2P[7:0] maps to [68:61] and IRQPF2P[15:8] maps to [91:84] (For Zynq7000).  These “are” the numbers you are looking for. Well only in case if you specify the interrupts as SPI [1]. The truth is however that they are not shared, so that you have to specify 0 for them. As a result, the correct interrupt identification is “-32”. IE: IRQF2P[7:0] maps NOT to [68:61], but [36:29], 29 being the pl index 0. The Vivado tool and flow handles this for you, but in case you want to write it for yourself, you have to take care.

Before moving onto the DT, one last important thing: The Interrupt handler we are going to write later on MUST clear the interrupt flag. Lets see how the overall device tree entry looks like for the AXI Timer:

Whats is most important: Again, the Interrupt identification and the “compatible” node, which says,which driver is compatible with the device. I did manually modified the default value (compatible = “xlnx,axi-timer-2.0″, “xlnx,xps-timer-1.00.a“;) to “xlnx,beechwood-irq-driver” so that the xilinx’s driver for their core is not used (The kernel will not know what driver to use for the DT node until we load our module). Also note the “reg” property, which defines at which address is the AXI4-L interface located. For this example, I have chosen 0x80020000, which is GP1 port of the Zynq PS (Yes, unlike in the previous picture) and has a dedicated range of 64KB (0x10000). Do not however try to load the device driver in case the memory address is not accessible (until the FPGA is programmed with the AXI Timer bitstream) – the system will freeze. Other node properties are not extra important for us. You could remove the clocking section completely, as we not going to disable/enable/change the clock frequencies upon loading/removal of the kernel driver. 

I have tried to make the code as self-explanatory, as possible, but anyway, the most important things. The module uses the platform drivers. Among the code, most important part is the table, which defines the compatibility options. Upon loading the driver, the kernel goes through the device tree and tries to find a compatible node, which in our case is the “beechwood-irq-driver“. The module must have a “probe” and “remove” functions associated. In the probe function, we register the interrupt handler by looking at the corresponding device tree node (irq_of_parse_and_map and request_irq functions). After registering the interrupt handler routine, we configure the AXI Timer to generate interrupts. In this example,I have chosen the interrupt period to be 0.25sec based on the frequency the timer is running on. This in my case is the FCLK_CLK0 PL fabric clock, which needs to have a frequency of 100MHz. Here comes the part that you should know how the FSBL has configured your platform in order to have this period. In case the period differs, then the FCLK0 clock is different than 100 MHz. The code that setups the period is based on PG079. A small hint here: The Xilinx’s timer generates an interrupt on the overflow of the 32b counter. Therefore you have to properly adjust the load value so that the overflow will occur in the required time (0.25s).  

Upon successful module compilation, FPGA programming with a new bitstream, modification of the DeviceTree and insertion of the module, you should see a kernel log via dmesg similar to the following output (Note that the timestamp differences should be 0.25sec). Important note: Make sure the interrupt gets cleared in the handler. If not, then interrupts will be generated continuously (for level sensitive interrupt definition) so that the system will freeze! Also don’t forget to unregister the interrupt handler in the “remove” function.

Congratulations! Now we are able to use the AXI Timer, modify the FPGA, DeviceTree and register some interrupts.

DMA With Interrupt and Cache Coherency

 

In order to play with a DMA and transfer some pseudo random bytes on 100MHz clock, lets say a few notes before: For simplicity, we are going to use the AXI DMA IP (PG021), both S2MM and MM2S channels and only in Direct register mode. This simplifies the “testbench” as using Scatter-Gather would be more complicated. Furthermore, we assume that the S_AXI_HP0 port has a pre-configured width of 32b by the FSBL (Zynq has options to support 32b/64b. ZynqMP also supports 128b width). If we would use the hardware cache coherency ports (S_AXI_ACP for Zynq7000 and or S_AXI_HPC[0:1] for ZynqMP), we wouldn’t also need to worry about coherency and just write some basic user-space driver for the DMA, but because the intention is to not make things so simple, we will use the S_AXI_HP0 port of the Zynq PS. Lets start with the IP Integrator (pdf here):

Shown is the simplest diagram we can use. There is one AXI4 interconnect for the DMA’s AXI4-L interface connected to GP0 and another AXI4 interconnect before the S_AXI_HP0 port (Make sure your platform was initialized with the correct bitwidth – this would otherwise cause problems). All the clocks are synchronous in the design and the S2MM (Stream to Memory) and MM2S (Memory to Stream) interrupts are routed to the IRQ_F2P ports with indexes 0 and 1. What is not shown however in this diagram is how the S2MM and MM2S streams interact. This is external to the diagram and basically there is an additional Xilinx FIFO and a simple RTL code, which increments each incoming 32b word by 1:

As you can see, the FIFO is there only because I didn’t wanted to bother myself with proper handling of the tready signaling, so that instead a threshold of 500 out of 512 samples is considered a criterion when tready get deasserted for the MM2S. Anyway, its worth to mention that when an AXI DMA MM2S channel is reset, the reset is propagated into the FIFO (mm2s_rst_n and s2mm_rst_n signals). This is a good practice to reset the FIFO before starting additional transfers. For some application,it might be necessary to use FIFOs for both channels. However 1 FIFO is sufficient here. Now that we have the RTL ready, lets move onto the device tree:

The reg property is as always based on the IP integrator’s address editor and this time represents the GP_1 port of the Zynq PS. Also the interrupt numbers are not identical to the picture provided. Correct would be to use <0 29 4>,<0 30 4>. The Reg property is useful in case we are going to parse the address from a device tree. In that case platform_get_resource and devm_ioremap_resource functions could be used. Since we however know the exact addresses (Well in fact also the interrupt numbers), the reg property is not of extra use for us and we can just use request_mem_region and ioremap_nocache. In fact writing an entire driver without the devicetree modification is an option, but its somewhat convenient to make the kernel module parse the parameters and modify any settings in the DT thereafter. Whats important is again the compatible property, which is set to  “xlnx,beechwood-dma-driver“. This can be really an arbitrary string as long as the node matches the compatible property value in the module table.

I have also used an LCG generator in order to generate some pseudo random data and verify that there are no issues with the coherency and transfers (This is a PRN sequence generator, once data are generated with a seed, we can regenerate them again using the same seed – this is used in the code with the following command: Seed = Seed_Orig;). Quite obviously, we also need 2 buffers to for RD and WR operations (Well theoretically a single buffer could as well be used for this purpose). The reset sequence of the DMA channels doesn’t matter, but I do recommend to start the channels in the following order: (S2MM and then MM2S). This way, you can be sure, that when data are coming from MM2S, the S2MM should be ready to accept them, since it has been configured previously and waits for valid transfers. Also there are 2 separate Interrupt handlers for the channels. Since the correct usage of Interrupts wasnt the goal of this exercise, they just mirror the “Polling” mechanism (which is by default omitted using IFDEF macros) and the only thing they does is that they clear the interrupt flags of the AXI DMA. The most important thing is however the following piece of code:

  • dma_alloc_coherent

This function actually uses the DMA framework and guarantees that any CPU accesses to the allocated region are considered as “cache miss” – therefore accessing directly the memory region. On the other hand, this also implies that accesses to this memory are costly. It is however necessary for DMA operations to allocate the memory this way. The correct usage of AXI DMA is all based on the PG021 document. Additional information on controlling the DMA may be found here: Lauri’sBlog. Additional note on the configuration: CHAN_CR_IOC_IrqEn is the required flag in order for the DMA to trigger the interrupt. CHAN_CR_Err_IrqEn is optional.

Again, after FPGA bitstream update, device tree changes, module compilation and insertion, you should see a similar output in the kernel log:

Overall, I would say that writing Linux kernel drivers is not that hard, once you know the basics and limitations. The biggest problem is I believe finding a good source of information mostly due to the fact, the the majority of the embedded Linux developers and users just forward you to the code documentation, which is in most cases useless. I really don’t understand this behavior. What is actually sad, that the internet is full of “half-usefull” answers and posts such as i described. That was primarily the reason why I have decided to write this post (Except for the purpose of gaining more experience with driver development). I Hope that my next post regarding the kernel drivers will also include some form of PCIe functionality.