Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integration/soc/add_sdcard: switch to edge-triggered cmd-done irq #1787

Closed
wants to merge 1 commit into from

Conversation

gsomlo
Copy link
Collaborator

@gsomlo gsomlo commented Sep 25, 2023

The current level-triggered command completion irq stays asserted until a new command is launched, requiring the software driver to disable and re-enable the cmd-done interrupt for each request.

Switching to an edge-triggered cmd-done interrupt will allow for a more elegantly written software driver that can also easily use data/dma completion interrupts.

Updates: e6765f8

Depends on enjoy-digital/litesdcard#31

The current level-triggered command completion irq stays asserted
until a new command is launched, requiring the software driver to
disable and re-enable the cmd-done interrupt for each request.

Switching to an edge-triggered cmd-done interrupt will allow for
a more elegantly written software driver that can also easily
use data/dma completion interrupts.

Updates: e6765f8

Signed-off-by: Gabriel Somlo <gsomlo@gmail.com>
@gsomlo
Copy link
Collaborator Author

gsomlo commented Sep 25, 2023

@paulusmack After talking to @Dolu1990 I realized the current Linux driver for LiteSDCard only uses interrupt support for command completion, and still polls for data/DMA completion.

In my attempt to figure out how to also support DMA completion IRQs for LiteSDCard, I realized that the level-triggered nature of the current interrupt, along with the fact that it stays asserted, in effect, "forever" (from completion until the setup of the next command, however long that is), causes the current driver to have to do contorted stuff like this.

Since there's no clear rule about which of a LiteSDCard command and its associated concurrent DMA data transfer would finish first, adding and removing "currently enabled" IRQs is very race-y, with lots of opportunity for TOCTOU bugs to happen. So IMHO the only reasonable way forward is to switch the command-completion IRQ to edge-triggered, which will then allow us to have all IRQs (command, dma-read, dma-write) enabled all the time, and rewrite the driver (see litex-hub/linux#15) to take advantage of all IRQs provided by the gateware.

Please LMK what you think. Once the gateware updates (litesdcard, litex) are in place, I'm planning to submit litex-hub/linux#15 to LKML and push dma irq support into mainline.

Any feedback much appreciated -- Thanks!

@paulusmack
Copy link
Contributor

[Replying here as well as in email]

@paulusmack After talking to @Dolu1990 I realized the current Linux driver for LiteSDCard only uses interrupt support for command completion, and still polls for data/DMA completion.

I'll try to find some time to instrument the driver to find out how long we are typically spending doing the polling. (It may of course depend on the specific SD card being used.) I think what I remember from looking at traces of the SD bus is that the card usually didn't signal command completion until the data transfer was complete or almost complete, but I may be wrong (that was a couple of years ago).

In my attempt to figure out how to also support DMA completion IRQs for LiteSDCard, I realized that the level-triggered nature of the current interrupt, along with the fact that it stays asserted, in effect, "forever" (from completion until the setup of the next command, however long that is), causes the current driver to have to do contorted stuff like this.

Like I said in email, as far as I can see it's a single register write either way (either to clear the pending bit for edge-triggered, or to disable the interrupt if level-triggered), so I don't see how that is "contorted".

Since there's no clear rule about which of a LiteSDCard command and its associated concurrent DMA data transfer would finish first, adding and removing "currently enabled" IRQs is very race-y, with lots of opportunity for TOCTOU bugs to happen.

Sorry, I don't get this. There would be a possible race if the updates to the IRQ_ENABLE register were done in a read-modify-write fashion, but they aren't; we just write one of two values to it. There is also a potential race between disabling/enabling interrupts and the event occurrence for edge-triggered interrupts, but not for level-triggered interrupts. That's one of the reasons why I prefer level-triggered interrupts. With edge-triggered interrupts you always have to check after enabling interrupts whether the event occurred while you had interrupts disabled, whereas you don't have to check for level-triggered interrupts. And the check is generally a register read, which is usually slow.

I also don't get the reference to TOCTOU. That's generally a security-related term, and I don't see any security implications here.

So IMHO the only reasonable way forward is to switch the command-completion IRQ to edge-triggered, which will then allow us to have all IRQs (command, dma-read, dma-write) enabled all the time, and rewrite the driver (see litex-hub/linux#15) to take advantage of all IRQs provided by the gateware.

I would want to see performance measurements to check that we aren't actually slowing things down by always using interrupts.

My great preference is to provide as much backwards and forwards compatibility as is reasonably possible without overly complicating things. As I said in email, I'm not at all a fan of the model where everything has to be kept in perfect sync by compiling the entire stack from top to bottom every time anything anywhere gets changed.

Please LMK what you think. Once the gateware updates (litesdcard, litex) are in place, I'm planning to submit litex-hub/linux#15 to LKML and push dma irq support into mainline.

My experience has generally been that edge-triggered interrupts usually make drivers harder to get right, not easier. (And that experience does go back quite a way, over many different kinds of systems.) The alternative to level-triggered interrupts that does seem to work well and efficiently is to use a message-based system that has in-memory queues of commands and responses, with effectively edge-triggered interrupts on message delivery. That avoids races by having a record in memory of what events have occurred. But that kind of system tends to be quite a lot more complicated in hardware than litesdcard is.

@gsomlo
Copy link
Collaborator Author

gsomlo commented Oct 13, 2023

I'll try to find some time to instrument the driver to find out how long we are typically spending doing the polling. (It may of course depend on the specific SD card being used.) I think what I remember from looking at traces of the SD bus is that the card usually didn't signal command completion until the data transfer was complete or almost complete, but I may be wrong (that was a couple of years ago).

I ran some quick tests, using nexys-video, litex, and a 4-core rocket-chip running at 50MHz. I read a 32.4M file I had handy on the sdcard, and wrote it back, to get a rough idea of my read and write speeds:

mount /dev/mmcblk0p1 /mnt; date; cp /mnt/Image ./foo; date; umount /mnt

for read speed, and

mount /dev/mmcblk0p1 /mnt; date; cp foo /mnt/; umount /mnt; date

for write speed.

The results (in KBytes/second):

mode read write comments
cmd irq only 737.28 650.54 current setup, select commands using irq, no dma irqs
dma irqs 663.55 705.91 dma-completion irq support, still only select commands using irq
dma + all cmds 691.20 650.54 dma-completion and command completion for all commands

In my attempt to figure out how to also support DMA completion IRQs for LiteSDCard, I realized that the level-triggered nature of the current interrupt, along with the fact that it stays asserted, in effect, "forever" (from completion until the setup of the next command, however long that is), causes the current driver to have to do contorted stuff like this.

Like I said in email, as far as I can see it's a single register write either way (either to clear the pending bit for edge-triggered, or to disable the interrupt if level-triggered), so I don't see how that is "contorted".

If all you care about is command-completion IRQ support, then yes, turning it on and off is as easy as "enabling"
CARD_DETECT only vs. CARD_DETECT | COMMAND. But if you try to add support for edge-triggered DMA read and write IRQs, things get sort-of ugly. IMO it's much cleaner after applying commits 5 and 6 in https://github.com/litex-hub/linux/pull/15/commits (as a matter of principle, if nothing else :D)

Since there's no clear rule about which of a LiteSDCard command and its associated concurrent DMA data transfer would finish first, adding and removing "currently enabled" IRQs is very race-y, with lots of opportunity for TOCTOU bugs to happen.

Sorry, I don't get this. There would be a possible race if the updates to the IRQ_ENABLE register were done in a read-modify-write fashion, but they aren't; we just write one of two values to it. There is also a potential race between disabling/enabling interrupts and the event occurrence for edge-triggered interrupts, but not for level-triggered interrupts. That's one of the reasons why I prefer level-triggered interrupts. With edge-triggered interrupts you always have to check after enabling interrupts whether the event occurred while you had interrupts disabled, whereas you don't have to check for level-triggered interrupts. And the check is generally a register read, which is usually slow.

That's why we should prefer enabling all interrupts at the time the driver is loaded, and keep them enabled throughout. If we re-arm our completion before programming the sdcard gateware to initiate a command and/or data transfer, and wait for completion after doing that, we won't ever miss anything, and won't have to worry about what happens if we mix and match edge and level triggered interrupts: see patch 5/10 and patch 6/10 of the proposed series.

I also don't get the reference to TOCTOU. That's generally a security-related term, and I don't see any security implications here.

It's time-of-check vs. time-of-use (I was thinking about having to read whether the command IRQ is enabled or not before I write to the enable register to turn on other IRQs, such as dma-read or dma-write). If you want to turn something on or off, you need to know whether other things are themselves on or off, to avoid turning them off when they're supposed to be on or vice versa. That means reading the enable register, and-ing (or or-ing it) with something, then writing it back.

Alternatively, we could maintain a variable that contains the current set of enabled IRQs and modify, then write it out whenever we want to change what IRQs are enabled, but once again, I humbly propose that my way is just cleaner... :)

So IMHO the only reasonable way forward is to switch the command-completion IRQ to edge-triggered, which will then allow us to have all IRQs (command, dma-read, dma-write) enabled all the time, and rewrite the driver (see litex-hub/linux#15) to take advantage of all IRQs provided by the gateware.

I would want to see performance measurements to check that we aren't actually slowing things down by always using interrupts.

So in my case (50MHz rv64gc rocket-chip) we're not slowing them down significantly, they're in the margin of error.
I think the irq-handling overhead is constant in terms of cpu clock cycles, so the faster we get the CPU, the less wall-clock time irq handling would take in absolute terms, and therefore it would go down as a percentage of total sdcard transfer time. Of course, if one cares about maximum transfer rates at low cpu clock speeds, one should definitely turn off IRQ support (e.g., via DTS) -- the driver will keep being just as fast in polling mode as it was before this change.

My great preference is to provide as much backwards and forwards compatibility as is reasonably possible without overly complicating things. As I said in email, I'm not at all a fan of the model where everything has to be kept in perfect sync by compiling the entire stack from top to bottom every time anything anywhere gets changed.

So as far as I'm concerned, the choices are:

  1. drop this whole affair, leave everything as-is, do not support dma completion interrupts, keep using level-triggered command completion IRQs only
  2. make the change, update the linux driver (making it much cleaner IMHO), and incur the wrath of whomever will have to recompile their bitstream to get it working with the updated kernel driver :)

I'm definitely NOT up for maintaining two LiteSDCard drivers, or something that tries to work with both "versions" of the gateware (It's F/OSS gateware, after all, and I expect the set of people both unwilling to update the gateware and adamant about running the latest kernel to be extremely small :)

My experience has generally been that edge-triggered interrupts usually make drivers harder to get right, not easier. (And that experience does go back quite a way, over many different kinds of systems.) The alternative to level-triggered interrupts that does seem to work well and efficiently is to use a message-based system that has in-memory queues of commands and responses, with effectively edge-triggered interrupts on message delivery. That avoids races by having a record in memory of what events have occurred. But that kind of system tends to be quite a lot more complicated in hardware than litesdcard is.

I am definitely familiar with your significant history with Linux (have seen your email and name fly by numerous times as I was slowly working my way up to submit my first patch over the course of many years) :) But I think after 5/10 and 6/10 (as linked above) are applied (after litex is modified to generate edge irqs for commands), the driver remains rock solid by

  1. leaving IRQs enabled all the time and
  2. re-arming the completion queue before the gateware is configured to do anything that would generate the next interrupt.

Could I ask you for (yet another) huge favor, to actually have a look at the proposed Linux driver changes, and to LMK what you think I'm missing when I claim they're making it all cleaner than it was ? :)

Many thanks,
--Gabriel

@gsomlo
Copy link
Collaborator Author

gsomlo commented Oct 13, 2023

I forgot to include actual polling results (for 50MHz 4-core rocket on nexys video, results in KBytes/second):

mode read write comments
cmd irq only 737.28 650.54 current setup, select commands using irq, no dma irqs
dma irqs 663.55 705.91 dma-completion irq support, still only select commands using irq
dma + all cmds 691.20 650.54 dma-completion and command completion for all commands
poll 5us 372.78 182.29 with current SD_SLEEP_US = 5, lots of cmd. errors & retries
poll 0us 850.70 947.93 with modified SD_SLEEP_US = 0, lots of cmd. errors & retries

With the current value of SD_SLEEP_US of 5us, the values are actually worse compared to using IRQs. When SD_SLEEP_US is set to 0 (i.e., aggressive polling), we get the fastest transfer speeds. However, polling-mode also causes lots of command errors and retries to be reported by the driver.

I've written a whole novel in my previous comment, so please allow me to re-iterate the TL;DR part of my point re. edge-triggered interrupts:

If they're always on, and our driver operates like this:

1. if irq then reinit_completion
2. set up the gateware for the operation that will generate the next interrupt
3. if irq then wait_for_completion
4. polling loop

Then there's no way to miss an interrupt, and the polling loop will consist of a single, successful check. If irq is off, the polling loop will spin until done. but there's no path through which there could be an unexpected IRQ, since we're ready for it (reinit_completion) even before the hardware is told to do anything that could generate one.

Thanks,
--G

@Dolu1990
Copy link
Collaborator

50MHz 4-core rocket on nexys video

Hi @gsomlo
Do you know the configuration of that SoC ? (l1 i$ d$ size, internal L2 $size)

Thanks ^^
Charles

@gsomlo
Copy link
Collaborator Author

gsomlo commented Oct 13, 2023 via email

@gsomlo
Copy link
Collaborator Author

gsomlo commented Oct 14, 2023

@paulusmack I've re-worked the linux patches here: litex-hub/linux#16, with 5/10 and 6/10 being even more precise and clear in what they're trying to accomplish.

I guess the other argument to be made is that all other interrupts generated by LiteX blocks (including the dma completion ones for LiteSDCard) are edge-triggered, so whether level-triggered IRQs are philosophically better in the big scheme of things, having just one "stick out" by bucking the trend is a complication we might be better off avoiding :)

Once again, I value your opinion a lot, and would really like to be able to convince you... :D

Thanks,
--G

@paulusmack
Copy link
Contributor

@paulusmack I've re-worked the linux patches here: litex-hub/linux#16, with 5/10 and 6/10 being even more precise and clear in what they're trying to accomplish.

I'll take a look, and I am also instrumenting the driver to let me look at the distribution of how long commands take and how long we are polling for data transfer/DMA completion. It will probably take a few days before I can comment in more detail, since I do have some other things occupying my time at the moment.

I guess the other argument to be made is that all other interrupts generated by LiteX blocks (including the dma completion ones for LiteSDCard) are edge-triggered, so whether level-triggered IRQs are philosophically better in the big scheme of things, having just one "stick out" by bucking the trend is a complication we might be better off avoiding :)

If the way it works is that there is an interrupt pending bit that is set by the occurrence of an event (such as command completion or transfer finished), and the device persists in interrupting until the pending bit is cleared, then I am happy with that. You can call it "edge triggered" on the event, and I can call it "level triggered" on the pending bit, and we're both happy. :)

@gsomlo
Copy link
Collaborator Author

gsomlo commented Oct 16, 2023

If the way it works is that there is an interrupt pending bit that is set by the occurrence of an event (such as command completion or transfer finished), and the device persists in interrupting until the pending bit is cleared, then I am happy with that. You can call it "edge triggered" on the event, and I can call it "level triggered" on the pending bit, and we're both happy. :)

Yes, unless one acknowledges the specific "edge" interrupt by writing to the respective bit of the pending register
(e.g., here), the interrupt will keep firing until acknowledged.

In the end, that can be done with a single register write for all interrupts handled during one invocation of the handler, like this

I very much appreciate your "pre-review" and advance feedback (before final decision to actually make the change in litex and then submit all this driver stuff to LKML) -- Thanks!

@paulusmack
Copy link
Contributor

Here are some observations from instrumenting litex_mmc_send_cmd(). I added MMIO writes to set and clear some GPIO bits at various points and observed the GPIO outputs, along with the interrupt request from the litesdcard interface, with an oscilloscope.

I did the same tests that you mentioned above using a 32MiB file (32768 kiB), copying the file to and from an ext2 partition (12GiB in size) on a Sandisk 32GB "high endurance" micro-SD card. The CPU is a 100MHz microwatt CPU with 8kiB icache and 8kiB dcache (both 2-way set associative). The SD card clock is 25MHz.

Reading the 32768kiB consistently takes 11 seconds (to the nearest second, which is the best one can do using the date command), i.e. 3000 kiB/s.

Writing the 32768kiB takes about 25 seconds, i.e. 1300 kiB/s.

Of the commands that don't use interrupts, i.e. those for which transfer == SD_CTL_DATA_XFER_NONE and response_len != SD_CTL_RESP_SHORT_BUSY, some are very quick, taking only a few microseconds, but most of them take between 75 and 90 microseconds.

The commands that do use interrupts take between 400 and 700 microseconds when reading and between about 1.75 and 2 ms when writing. This is from when the command is issued to the point when the interrupt request is raised.

The latency from when the interrupt request is raised to when litex_mmc_send_cmd gets to the point just after the first litex_mmc_sdcard_wait_done() call is mostly about 120 microseconds, but varies and can be as high as 500 microseconds.

The second call to litex_mmc_sdcard_wait_done() (the one waiting on LITEX_CORE_DATAEVT) and the call to readx_poll_timeout() to wait for DMA completion always take 2 to 3 microseconds each. I never saw any case where they took a long time (i.e. 10 microseconds or more).

Given the latencies that I'm seeing here, I think that using interrupts more aggressively is going to slow things down. That could be dependent on the particular SD card and CPU speed that I'm using, but with a slower CPU (such as you are using) I would think that adding more trips through an interrupt handler and the scheduler would have a bigger impact.

Anyway, I'll try your new version of the driver and see what happens to the transfer rates.

@Dolu1990
Copy link
Collaborator

@paulusmack One question about the linux kernel configuration, CONFIG_SMP is off ? If i remember well, it was introducing quite some overhead to have it on when using network/filesystem, but i tested it a very long time ago XD

Maybe the best would be to just add a dts configuration to tell if :

  • Always do a active pulling
  • Active pulling on small things, interrupt on big thing
  • Always interrupt

Then there would be no compromise <3

@paulusmack
Copy link
Contributor

CONFIG_SMP is off in my setup, since I only have one core.

Regarding using the DTS to configure the behaviour, we can do that if necessary, but the disadvantage is that it does increase the amount of testing needed.

@paulusmack
Copy link
Contributor

@gsomlo with your proposed change, how does the command-done irq pending bit work? In particular, does anything happen to it when the irq enable bit is cleared? It seems to me that if the behaviour was such that the pending bit is cleared when a zero is written to the enable bit, then the old driver will continue to work. (I mean that the pending bit is cleared once at the point where the write to the enable register is done, not that the pending bit is forced to zero while ever the enable bit in the register is zero.)

@gsomlo
Copy link
Collaborator Author

gsomlo commented Oct 17, 2023

@gsomlo with your proposed change, how does the command-done irq pending bit work? In particular, does anything happen to it when the irq enable bit is cleared? It seems to me that if the behaviour was such that the pending bit is cleared when a zero is written to the enable bit, then the old driver will continue to work.

The old linux driver does indeed keep working after the gateware is switched to edge-irq for sdcard command completion (for a very "tolerant" definition of "working", that is: doing e.g. an md5 on a large file will result in lots of 110 errors and retries on cmd17 & cmd18, but will get the right md5sum in the end :)

So if you care about running older kernels on up-to-date gateware, that should be possible (with the above caveat).

What's not going to work is using the new kernel driver on gateware that's still using "level" IRQ for sdcard command completion -- no amount of writing to the pending bit for command-done will "shut up" the asserted "level", the only way to do so is to disable the interrupt (i.e., write 0 to the enable register for that bit).

I guess it would be possible to treat the command IRQ the "old" way (level) and use edge-mode on all the rest, by applying this final patch on top of my entire series:

diff --git a/drivers/mmc/host/litex_mmc.c b/drivers/mmc/host/litex_mmc.c
index baa0f62ddc01..ed829067e90a 100644
--- a/drivers/mmc/host/litex_mmc.c
+++ b/drivers/mmc/host/litex_mmc.c
@@ -137,8 +137,12 @@ static int litex_mmc_send_cmd(struct litex_mmc_host *host,
                      cmd << 8 | transfer << 5 | response_len);
        litex_write8(host->sdcore + LITEX_CORE_CMDSND, 1);
 
-       if (host->irq > 0)
+       if (host->irq > 0) {
+               /* enable CMD_DONE irq *AFTER* setting up command,
+                * which clears the *previous* command's asserted LEVEL
+                */
+               litex_write32(host->sdirq + LITEX_IRQ_ENABLE,
+                       SDIRQ_CARD_DETECT | SDIRQ_CMD_DONE |
+                       SDIRQ_SD_TO_MEM_DONE | SDIRQ_MEM_TO_SD_DONE);
                wait_for_completion(&host->cmd_done);
+       }
 
        ret = litex_mmc_sdcard_wait_done(host->sdcore + LITEX_CORE_CMDEVT, dev);
        if (ret) {
@@ -262,6 +266,9 @@ static irqreturn_t litex_mmc_interrupt(int irq, void *arg)
        /* Check for command completed */
        if (pending & SDIRQ_CMD_DONE) {
                handled |= SDIRQ_CMD_DONE;
+               /* Disable (level-triggered) CMD_DONE IRQ to prevent it from firing */
+               litex_write32(host->sdirq + LITEX_IRQ_ENABLE,
+                       SDIRQ_CARD_DETECT |
+                       SDIRQ_SD_TO_MEM_DONE | SDIRQ_MEM_TO_SD_DONE);
                complete(&host->cmd_done);
        }
 
@@ -534,10 +541,10 @@ static int litex_mmc_irq_init(struct platform_device *pdev,
 
        /* Clear & enable interrupts */
        litex_write32(host->sdirq + LITEX_IRQ_PENDING,
-                       SDIRQ_CARD_DETECT | SDIRQ_CMD_DONE |
+                       SDIRQ_CARD_DETECT |
                        SDIRQ_SD_TO_MEM_DONE | SDIRQ_MEM_TO_SD_DONE);
        litex_write32(host->sdirq + LITEX_IRQ_ENABLE,
-                       SDIRQ_CARD_DETECT | SDIRQ_CMD_DONE |
+                       SDIRQ_CARD_DETECT |
                        SDIRQ_SD_TO_MEM_DONE | SDIRQ_MEM_TO_SD_DONE);
 
        init_completion(&host->cmd_done);

We absolutely can't enable the command-done IRQ before setting up the command (because the previous command's irq level is still high right up until the new command is launched!!!).

So with this patch we'd be bending over backwards to keep the command-done IRQ handling compatible with a self-inflicted earlier design decision... :) I'd much rather see us clean it up and have a nice, elegant, uniform way of handling interrupts...

@gsomlo
Copy link
Collaborator Author

gsomlo commented Oct 20, 2023

Closing w/o merging: this change was motivated by my belief that using DMA completion IRQs in addition to the existing command completion ones would improve performance. However, data transfers are usually complete before the command completes, so IRQ handling and gateware behavior can (and therefore probably should) be kept as-is :)

@gsomlo gsomlo closed this Oct 20, 2023
@gsomlo gsomlo deleted the gls-cmd-edge-irq branch October 20, 2023 19:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants