-
Notifications
You must be signed in to change notification settings - Fork 22
Simultaneous interrupts #2
Comments
@jrrk the entire waveform shows that PLIC generates correct eip to the target from 0 to the end of the waveform. You mentioned that the CPU stop responding to the interrupt (I assume it doesn't claim and complete). To help me understand clear, could you please share the interrupt handling code? I am worrying that it might happen to clear mie and never turn it on at the corner case. |
This file should have everything you need. The CPU sits in wait_for_interrupt() when idle, the handler begins in do_IRQ(). The source code is inline, but if needed, is available at linux-4.20-rc2, with driver patches here: https://github.com/lowRISC/ariane-sdk/blob/rng-tools/configs/0099-lowrisc-ethernet.patch |
I don't think the mie theory can be correct because the CPU continues to receive Ethernet interrupts after the MMC has locked up. Unbalanced claim and complete statements perhaps could cause the same issue. |
@zarubaf Could you please show the connection in Ariane? I am seeing bit 1 and bit 2 of
The behavior looks correct but I doubt that the core can complete interrupt routine that quickly. It took just a cycle to complete ISR. Is the waveform from dedicated simulation environment not using the linux image attached above? |
The waveform only shows the cycles where the PLIC is accessed. There can be a delay of thousands of cycles in between. |
I see. It was long time ago to seeing VCD. OK then, I will wait until Florian confirms the connection. Based on the source at Ariane ( https://github.com/pulp-platform/ariane/blob/master/tb/ariane_peripherals.sv ) It looks like bit 1 is for SPI and bit 2 is for Ethernet. So I cannot connect the relationship between the waveform and the interrupt sources. |
OK then bit1 is MMC. The waveform shows it kept receiving the interrupt from bit 1. last beat of the waveform shows it completes the interrupt. Sorry but I cannot find anything wrong inside PLIC. |
As you noticed bit 0 is uart, bit1 is SD and bit 2 is Ethernet. Since that waveform has been uploaded I have generated others (attached). The behaviour might be more interesting (i.e. faulty) |
This may be observed by opening ilaplic3.vcd in gtkwave and the previously provided simult.gtkw and scrolling to the extreme right (6157 samples). Here we can see that Ethernet interrupts are being serviced but SD interrupts are not, even though SD is asking for service. Since you have a trace of what registers have been written I would hope we can establish if this is a software or hardware bug. |
I see the value of eip_targets_o, ip are changed to 0 at the end of ilaplic3.vcd. Meaning, it is sucessfully claimed but not completed. Indeed it looks like a software bug.
|
@zarubaf OK, this seems to be race hazard in the specification of the PLiC itself. Because some bright spark made reading the claim register self-clearing, there is the possibility that an interrupt may be claimed and appear on the local register bus, before being accepted by the AXI interface. I'm speculating that this could happen if an instruction is issued but not committed due to an interrupt or other exception. This theory could be tested by making the PLIC claim routine a critical section if it isn't already. If somebody can tell me I am wrong about this, that would be a relief. |
I'm not sure I follow the theory. If an issued but uncommitted instruction generates read/writes on a memory mapped peripheral that would be a processor bug surely. Could you elaborate on what you mean by the "local register bus". My understanding is that the software will:
|
By definition all the data for a load instruction must be available before it can be committed. Therefore the interrupt that has been claimed will be cancelled at the PLIC end before the instruction is committed. Therefore memory-mapped peripherals that have side-effects on read are deprecated in modern multiple-issue environments. |
I meant if the instruction is issued, uncommitted, and is never committed. |
@jrrk Yes, your observation is absolutely correct. The problem is that the |
Thanks a ton for all your hard debugging work! |
It's actually the idempotency PMA which Ariane needs to implement and respect (either set dynamically in some platform defined way, or fixed at design time). This is separate to the PMPs. |
Even if PMAs were implemented, loads are at the start of the pipeline and commits are at the end so there is a vulnerability period of several cycles every time you check for an interrupt. Any chance that using an atomic instruction could help ? |
Yeah, I meant |
Given the prevalence of existing peripherals with side-effects on loads, and that whenever the Unix platform spec actually exists it's almost certain to require support for side-effecting MMIO loads via idempotency PMAs or by never issuing loads speculatively, I think supporting the PMAs is probably the sensible path forwards. Shall we close this issue for now given that the current hypothesis is that this isn't actually a PLIC problem? |
I think we need to demonstrate that if we work around this issue by whatever means, then all is well and there are no other lurking reliability issues. Perhaps the previous PLIC failed for exactly the same reason. |
Would this idea work for Ariane: convert loads to atomic loads if an address below the cacheable threshold is detected, and upgrade peripherals to support the atomic extensions? I believe we should observe occasionally missed characters on the UART for the same reason. I have already converted it from VHDL to Verilog to make life easier for Verilator.
…Sent from my iPhone
On 5 Apr 2019, at 15:49, Florian Zaruba ***@***.***> wrote:
Yeah, I meant PMAs. Always confusing this with PMPs. I've changed it in the comment above. @jrrk Yeah it would mean some hacky way around and disabling all sources of speculation (shortly disabling the interrupts and draining the pipeline before issuing the load). I also thought about using atomics for this, but then that would mean that the plic needs to support atomic transactions (not super difficult to implement). Tbh, that is a rather messy situation the hardware has to deal with, I think the "RISC-V" way would be to support PMAs in the core.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@zarubaf how about this for an idea to keep us going, generate a handshake pulse from the commit stage of Ariane for the specific case of a non-idempotent load (at first just recognising the address), and feed this via the user signal over AXI bus to the PLIC, which can then be modified to only mask the interrupt when it gets the acknowledgement. |
I am very (very) much in favor of cleaning it up completely e.g. pretty much as Chris explained in his answer. It is really interesting that I never seem to saw a missed character though. I hope I can prioritize this. Unfortunately, we have a conference deadline this week. |
@zarubaf The Ethernet typically goes wrong with a half-life of about 10MBytes. We aren't seeing that degree of traffic on the UART, and you might not even notice at that level. I have seen the UART lockup completely by the same PLIC mechanism though. |
Another theory to be investigated, perhaps the claiming of the interrupt is directly related to timing of the next interrupt. If we add a deliberate delay in the PLIC before asserting interrupts again, it might improve safety. |
I notice that reads from the PLIC under Linux are qualified afterward with
|
I have implemented the fence i,r as an acknowledge to the PLIC. Of course this is a temporary hack. The Debian chroot works quite well now. The SD-Card interface still locks up after about 46000 interrupts, but I think this is a driver problem and nothing to do with the PLIC. Onward and upward. |
I am glad to hear that it is working better for you now. Sorry for my radio silence. I've been implementing the non-speculative loads (here: openhwgroup/cva6#213). I think it will still require some bug hunting as the change was unfortunately quite intrusive. |
On Ariane, the CPU ceases to respond to interrupts after a while, and I am testing the hypothesis that it is the simultaneous arrival of multiple interrupts that is causing the problem. I attach a waveform captured from the hardware under this circumstance for further study. To save memory, only the portion of the waveform with req_i.valid high is captured. Let me know if an alternative arrangement would be more useful.
plic.zip
@zarubaf @eunchan @asb
The text was updated successfully, but these errors were encountered: