Simultaneous interrupts #2

jrrk · 2019-04-04T10:20:25Z

On Ariane, the CPU ceases to respond to interrupts after a while, and I am testing the hypothesis that it is the simultaneous arrival of multiple interrupts that is causing the problem. I attach a waveform captured from the hardware under this circumstance for further study. To save memory, only the portion of the waveform with req_i.valid high is captured. Let me know if an alternative arrangement would be more useful.

plic.zip

@zarubaf @eunchan @asb

eunchan · 2019-04-04T16:24:06Z

@jrrk the entire waveform shows that PLIC generates correct eip to the target from 0 to the end of the waveform.

You mentioned that the CPU stop responding to the interrupt (I assume it doesn't claim and complete). To help me understand clear, could you please share the interrupt handling code? I am worrying that it might happen to clear mie and never turn it on at the corner case.

jrrk · 2019-04-04T16:42:21Z

vmlinux.dis.zip

This file should have everything you need. The CPU sits in wait_for_interrupt() when idle, the handler begins in do_IRQ(). The source code is inline, but if needed, is available at linux-4.20-rc2, with driver patches here: https://github.com/lowRISC/ariane-sdk/blob/rng-tools/configs/0099-lowrisc-ethernet.patch

jrrk · 2019-04-04T16:50:06Z

I don't think the mie theory can be correct because the CPU continues to receive Ethernet interrupts after the MMC has locked up. Unbalanced claim and complete statements perhaps could cause the same issue.

eunchan · 2019-04-04T17:13:51Z

@zarubaf Could you please show the connection in Ariane? I am seeing bit 1 and bit 2 of irq_sources_i were set but don't know which bit is for which module (MMC, ethernet).

In the above waveform, bit 1 interrupt is asserted following bit 2. So irq_sources_i was 0x6.
This is the first & last interrupt for bit 2 (ID 3h). It is claimed and completed correctly. And never new interrupt for bit2 arrived.

The behavior looks correct but I doubt that the core can complete interrupt routine that quickly. It took just a cycle to complete ISR. Is the waveform from dedicated simulation environment not using the linux image attached above?

jrrk · 2019-04-04T17:27:37Z

The waveform only shows the cycles where the PLIC is accessed. There can be a delay of thousands of cycles in between.

eunchan · 2019-04-04T17:32:35Z

I see. It was long time ago to seeing VCD. OK then, I will wait until Florian confirms the connection. Based on the source at Ariane ( https://github.com/pulp-platform/ariane/blob/master/tb/ariane_peripherals.sv ) It looks like bit 1 is for SPI and bit 2 is for Ethernet. So I cannot connect the relationship between the waveform and the interrupt sources.

zarubaf · 2019-04-04T17:57:26Z

I think the trace is from a version of @jrrk, not the upstream one. @jrrk can you please shine some light on what is connected where?

jrrk · 2019-04-04T18:00:48Z

My version here

eunchan · 2019-04-04T18:06:03Z

OK then bit1 is MMC. The waveform shows it kept receiving the interrupt from bit 1. last beat of the waveform shows it completes the interrupt. Sorry but I cannot find anything wrong inside PLIC.

jrrk · 2019-04-04T18:14:09Z

As you noticed bit 0 is uart, bit1 is SD and bit 2 is Ethernet. Since that waveform has been uploaded I have generated others (attached). The behaviour might be more interesting (i.e. faulty)

plic2.zip

jrrk · 2019-04-04T18:23:19Z

This may be observed by opening ilaplic3.vcd in gtkwave and the previously provided simult.gtkw and scrolling to the extreme right (6157 samples). Here we can see that Ethernet interrupts are being serviced but SD interrupts are not, even though SD is asking for service. Since you have a trace of what registers have been written I would hope we can establish if this is a software or hardware bug.

eunchan · 2019-04-04T18:36:24Z

I see the value of eip_targets_o, ip are changed to 0 at the end of ilaplic3.vcd. Meaning, it is sucessfully claimed but not completed. Indeed it looks like a software bug.

$$var reg 2 ') i_ariane_peripherals/i_plic/eip_targets_o [1:0] $end 

... SIM ongoing ...

#6157                                                                                                                                                                                            
b0 ?#
b0 2%
b0 /&
b0 O&
b0 ')

jrrk · 2019-04-05T13:02:15Z

@zarubaf OK, this seems to be race hazard in the specification of the PLiC itself. Because some bright spark made reading the claim register self-clearing, there is the possibility that an interrupt may be claimed and appear on the local register bus, before being accepted by the AXI interface. I'm speculating that this could happen if an instruction is issued but not committed due to an interrupt or other exception. This theory could be tested by making the PLIC claim routine a critical section if it isn't already. If somebody can tell me I am wrong about this, that would be a relief.

asb · 2019-04-05T13:25:36Z

I'm not sure I follow the theory. If an issued but uncommitted instruction generates read/writes on a memory mapped peripheral that would be a processor bug surely.

Could you elaborate on what you mean by the "local register bus". My understanding is that the software will:

Read the memory-mapped claim register. If non-zero, they successfully claimed an interrupt.
Service the interrupt
Signal completion by writing the claimed id back to the memory-mapped claim register

jrrk · 2019-04-05T13:31:33Z

By definition all the data for a load instruction must be available before it can be committed. Therefore the interrupt that has been claimed will be cancelled at the PLIC end before the instruction is committed. Therefore memory-mapped peripherals that have side-effects on read are deprecated in modern multiple-issue environments.

asb · 2019-04-05T13:54:01Z

I meant if the instruction is issued, uncommitted, and is never committed.

zarubaf · 2019-04-05T14:24:40Z

@jrrk Yes, your observation is absolutely correct. The problem is that the claim register is not idempotent but the core issues loads speculatively. Ariane does not implement PMAs which I think should govern that. Quickly discussed a fix with @msfschaffner, but this will probably take a bit longer than I can afford today.

zarubaf · 2019-04-05T14:25:02Z

Thanks a ton for all your hard debugging work!

asb · 2019-04-05T14:36:07Z

It's actually the idempotency PMA which Ariane needs to implement and respect (either set dynamically in some platform defined way, or fixed at design time). This is separate to the PMPs.

jrrk · 2019-04-05T14:36:43Z

Even if PMAs were implemented, loads are at the start of the pipeline and commits are at the end so there is a vulnerability period of several cycles every time you check for an interrupt. Any chance that using an atomic instruction could help ?

zarubaf · 2019-04-05T14:49:32Z

Yeah, I meant PMAs. Always confusing this with PMPs. I've changed it in the comment above. @jrrk Yeah it would mean some hacky way around and disabling all sources of speculation (shortly disabling the interrupts and draining the pipeline before issuing the load). I also thought about using atomics for this, but then that would mean that the plic needs to support atomic transactions (not super difficult to implement). Tbh, that is a rather messy situation the hardware has to deal with, I think the "RISC-V" way would be to support PMAs in the core.

asb · 2019-04-05T18:28:49Z

Given the prevalence of existing peripherals with side-effects on loads, and that whenever the Unix platform spec actually exists it's almost certain to require support for side-effecting MMIO loads via idempotency PMAs or by never issuing loads speculatively, I think supporting the PMAs is probably the sensible path forwards.

Shall we close this issue for now given that the current hypothesis is that this isn't actually a PLIC problem?

jrrk · 2019-04-05T21:36:07Z

I think we need to demonstrate that if we work around this issue by whatever means, then all is well and there are no other lurking reliability issues. Perhaps the previous PLIC failed for exactly the same reason.

jrrk · 2019-04-06T06:26:33Z

Would this idea work for Ariane: convert loads to atomic loads if an address below the cacheable threshold is detected, and upgrade peripherals to support the atomic extensions? I believe we should observe occasionally missed characters on the UART for the same reason. I have already converted it from VHDL to Verilog to make life easier for Verilator.

…

Sent from my iPhone

On 5 Apr 2019, at 15:49, Florian Zaruba ***@***.***> wrote: Yeah, I meant PMAs. Always confusing this with PMPs. I've changed it in the comment above. @jrrk Yeah it would mean some hacky way around and disabling all sources of speculation (shortly disabling the interrupts and draining the pipeline before issuing the load). I also thought about using atomics for this, but then that would mean that the plic needs to support atomic transactions (not super difficult to implement). Tbh, that is a rather messy situation the hardware has to deal with, I think the "RISC-V" way would be to support PMAs in the core. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jrrk · 2019-04-07T06:01:55Z

@zarubaf how about this for an idea to keep us going, generate a handshake pulse from the commit stage of Ariane for the specific case of a non-idempotent load (at first just recognising the address), and feed this via the user signal over AXI bus to the PLIC, which can then be modified to only mask the interrupt when it gets the acknowledgement.

zarubaf · 2019-04-07T07:51:30Z

I am very (very) much in favor of cleaning it up completely e.g. pretty much as Chris explained in his answer. It is really interesting that I never seem to saw a missed character though. I hope I can prioritize this. Unfortunately, we have a conference deadline this week.

jrrk · 2019-04-07T08:05:01Z

@zarubaf The Ethernet typically goes wrong with a half-life of about 10MBytes. We aren't seeing that degree of traffic on the UART, and you might not even notice at that level. I have seen the UART lockup completely by the same PLIC mechanism though.

jrrk · 2019-04-07T11:58:36Z

Another theory to be investigated, perhaps the claiming of the interrupt is directly related to timing of the next interrupt. If we add a deliberate delay in the PLIC before asserting interrupts again, it might improve safety.

jrrk · 2019-04-08T14:08:24Z

I notice that reads from the PLIC under Linux are qualified afterward with

fence i,r
At the moment Ariane does not decode this combination and interprets it as a plain fence (flushing the D$). Can we use this to safely do a pipeline flush instead, and would that help ?

jrrk · 2019-04-09T09:18:01Z

I have implemented the fence i,r as an acknowledge to the PLIC. Of course this is a temporary hack. The Debian chroot works quite well now. The SD-Card interface still locks up after about 46000 interrupts, but I think this is a driver problem and nothing to do with the PLIC. Onward and upward.

zarubaf · 2019-04-09T09:22:43Z

I am glad to hear that it is working better for you now. Sorry for my radio silence. I've been implementing the non-speculative loads (here: openhwgroup/cva6#213). I think it will still require some bug hunting as the change was unfortunately quite intrusive.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simultaneous interrupts #2

Simultaneous interrupts #2

jrrk commented Apr 4, 2019

eunchan commented Apr 4, 2019

jrrk commented Apr 4, 2019 •

edited

Loading

jrrk commented Apr 4, 2019

eunchan commented Apr 4, 2019

jrrk commented Apr 4, 2019 •

edited

Loading

eunchan commented Apr 4, 2019

zarubaf commented Apr 4, 2019 •

edited

Loading

jrrk commented Apr 4, 2019

eunchan commented Apr 4, 2019

jrrk commented Apr 4, 2019

jrrk commented Apr 4, 2019

eunchan commented Apr 4, 2019

jrrk commented Apr 5, 2019

asb commented Apr 5, 2019 •

edited

Loading

jrrk commented Apr 5, 2019

asb commented Apr 5, 2019

zarubaf commented Apr 5, 2019 •

edited

Loading

zarubaf commented Apr 5, 2019

asb commented Apr 5, 2019

jrrk commented Apr 5, 2019 •

edited

Loading

zarubaf commented Apr 5, 2019

asb commented Apr 5, 2019

jrrk commented Apr 5, 2019

jrrk commented Apr 6, 2019 via email

jrrk commented Apr 7, 2019

zarubaf commented Apr 7, 2019

jrrk commented Apr 7, 2019

jrrk commented Apr 7, 2019

jrrk commented Apr 8, 2019

jrrk commented Apr 9, 2019

zarubaf commented Apr 9, 2019

Simultaneous interrupts #2

Simultaneous interrupts #2

Comments

jrrk commented Apr 4, 2019

eunchan commented Apr 4, 2019

jrrk commented Apr 4, 2019 • edited Loading

jrrk commented Apr 4, 2019

eunchan commented Apr 4, 2019

jrrk commented Apr 4, 2019 • edited Loading

eunchan commented Apr 4, 2019

zarubaf commented Apr 4, 2019 • edited Loading

jrrk commented Apr 4, 2019

eunchan commented Apr 4, 2019

jrrk commented Apr 4, 2019

jrrk commented Apr 4, 2019

eunchan commented Apr 4, 2019

jrrk commented Apr 5, 2019

asb commented Apr 5, 2019 • edited Loading

jrrk commented Apr 5, 2019

asb commented Apr 5, 2019

zarubaf commented Apr 5, 2019 • edited Loading

zarubaf commented Apr 5, 2019

asb commented Apr 5, 2019

jrrk commented Apr 5, 2019 • edited Loading

zarubaf commented Apr 5, 2019

asb commented Apr 5, 2019

jrrk commented Apr 5, 2019

jrrk commented Apr 6, 2019 via email

jrrk commented Apr 7, 2019

zarubaf commented Apr 7, 2019

jrrk commented Apr 7, 2019

jrrk commented Apr 7, 2019

jrrk commented Apr 8, 2019

jrrk commented Apr 9, 2019

zarubaf commented Apr 9, 2019

jrrk commented Apr 4, 2019 •

edited

Loading

jrrk commented Apr 4, 2019 •

edited

Loading

zarubaf commented Apr 4, 2019 •

edited

Loading

asb commented Apr 5, 2019 •

edited

Loading

zarubaf commented Apr 5, 2019 •

edited

Loading

jrrk commented Apr 5, 2019 •

edited

Loading