Skip to content
This repository has been archived by the owner on Jun 30, 2021. It is now read-only.

Simultaneous interrupts #2

Open
jrrk opened this issue Apr 4, 2019 · 31 comments
Open

Simultaneous interrupts #2

jrrk opened this issue Apr 4, 2019 · 31 comments

Comments

@jrrk
Copy link
Member

jrrk commented Apr 4, 2019

On Ariane, the CPU ceases to respond to interrupts after a while, and I am testing the hypothesis that it is the simultaneous arrival of multiple interrupts that is causing the problem. I attach a waveform captured from the hardware under this circumstance for further study. To save memory, only the portion of the waveform with req_i.valid high is captured. Let me know if an alternative arrangement would be more useful.

plic.zip

@zarubaf @eunchan @asb

@eunchan
Copy link

eunchan commented Apr 4, 2019

@jrrk the entire waveform shows that PLIC generates correct eip to the target from 0 to the end of the waveform.

You mentioned that the CPU stop responding to the interrupt (I assume it doesn't claim and complete). To help me understand clear, could you please share the interrupt handling code? I am worrying that it might happen to clear mie and never turn it on at the corner case.

@jrrk
Copy link
Member Author

jrrk commented Apr 4, 2019

vmlinux.dis.zip

This file should have everything you need. The CPU sits in wait_for_interrupt() when idle, the handler begins in do_IRQ(). The source code is inline, but if needed, is available at linux-4.20-rc2, with driver patches here: https://github.com/lowRISC/ariane-sdk/blob/rng-tools/configs/0099-lowrisc-ethernet.patch

@jrrk
Copy link
Member Author

jrrk commented Apr 4, 2019

I don't think the mie theory can be correct because the CPU continues to receive Ethernet interrupts after the MMC has locked up. Unbalanced claim and complete statements perhaps could cause the same issue.

@eunchan
Copy link

eunchan commented Apr 4, 2019

@zarubaf Could you please show the connection in Ariane? I am seeing bit 1 and bit 2 of irq_sources_i were set but don't know which bit is for which module (MMC, ethernet).

image
In the above waveform, bit 1 interrupt is asserted following bit 2. So irq_sources_i was 0x6.
This is the first & last interrupt for bit 2 (ID 3h). It is claimed and completed correctly. And never new interrupt for bit2 arrived.

The behavior looks correct but I doubt that the core can complete interrupt routine that quickly. It took just a cycle to complete ISR. Is the waveform from dedicated simulation environment not using the linux image attached above?

@jrrk
Copy link
Member Author

jrrk commented Apr 4, 2019

The waveform only shows the cycles where the PLIC is accessed. There can be a delay of thousands of cycles in between.

@eunchan
Copy link

eunchan commented Apr 4, 2019

I see. It was long time ago to seeing VCD. OK then, I will wait until Florian confirms the connection. Based on the source at Ariane ( https://github.com/pulp-platform/ariane/blob/master/tb/ariane_peripherals.sv ) It looks like bit 1 is for SPI and bit 2 is for Ethernet. So I cannot connect the relationship between the waveform and the interrupt sources.

@zarubaf
Copy link

zarubaf commented Apr 4, 2019

I think the trace is from a version of @jrrk, not the upstream one. @jrrk can you please shine some light on what is connected where?

@jrrk
Copy link
Member Author

jrrk commented Apr 4, 2019

My version here

@eunchan
Copy link

eunchan commented Apr 4, 2019

OK then bit1 is MMC. The waveform shows it kept receiving the interrupt from bit 1. last beat of the waveform shows it completes the interrupt. Sorry but I cannot find anything wrong inside PLIC.

@jrrk
Copy link
Member Author

jrrk commented Apr 4, 2019

As you noticed bit 0 is uart, bit1 is SD and bit 2 is Ethernet. Since that waveform has been uploaded I have generated others (attached). The behaviour might be more interesting (i.e. faulty)

plic2.zip

@jrrk
Copy link
Member Author

jrrk commented Apr 4, 2019

Screenshot from ilaplic3 vcd 2019-04-04 19-18-50

This may be observed by opening ilaplic3.vcd in gtkwave and the previously provided simult.gtkw and scrolling to the extreme right (6157 samples). Here we can see that Ethernet interrupts are being serviced but SD interrupts are not, even though SD is asking for service. Since you have a trace of what registers have been written I would hope we can establish if this is a software or hardware bug.

@eunchan
Copy link

eunchan commented Apr 4, 2019

I see the value of eip_targets_o, ip are changed to 0 at the end of ilaplic3.vcd. Meaning, it is sucessfully claimed but not completed. Indeed it looks like a software bug.

$$var reg 2 ') i_ariane_peripherals/i_plic/eip_targets_o [1:0] $end 

... SIM ongoing ...

#6157                                                                                                                                                                                            
b0 ?#
b0 2%
b0 /&
b0 O&
b0 ')

@jrrk
Copy link
Member Author

jrrk commented Apr 5, 2019

@zarubaf OK, this seems to be race hazard in the specification of the PLiC itself. Because some bright spark made reading the claim register self-clearing, there is the possibility that an interrupt may be claimed and appear on the local register bus, before being accepted by the AXI interface. I'm speculating that this could happen if an instruction is issued but not committed due to an interrupt or other exception. This theory could be tested by making the PLIC claim routine a critical section if it isn't already. If somebody can tell me I am wrong about this, that would be a relief.

@asb
Copy link
Member

asb commented Apr 5, 2019

I'm not sure I follow the theory. If an issued but uncommitted instruction generates read/writes on a memory mapped peripheral that would be a processor bug surely.

Could you elaborate on what you mean by the "local register bus". My understanding is that the software will:

  • Read the memory-mapped claim register. If non-zero, they successfully claimed an interrupt.
  • Service the interrupt
  • Signal completion by writing the claimed id back to the memory-mapped claim register

@jrrk
Copy link
Member Author

jrrk commented Apr 5, 2019

By definition all the data for a load instruction must be available before it can be committed. Therefore the interrupt that has been claimed will be cancelled at the PLIC end before the instruction is committed. Therefore memory-mapped peripherals that have side-effects on read are deprecated in modern multiple-issue environments.

@asb
Copy link
Member

asb commented Apr 5, 2019

I meant if the instruction is issued, uncommitted, and is never committed.

@zarubaf
Copy link

zarubaf commented Apr 5, 2019

@jrrk Yes, your observation is absolutely correct. The problem is that the claim register is not idempotent but the core issues loads speculatively. Ariane does not implement PMAs which I think should govern that. Quickly discussed a fix with @msfschaffner, but this will probably take a bit longer than I can afford today.

@zarubaf
Copy link

zarubaf commented Apr 5, 2019

Thanks a ton for all your hard debugging work!

@asb
Copy link
Member

asb commented Apr 5, 2019

It's actually the idempotency PMA which Ariane needs to implement and respect (either set dynamically in some platform defined way, or fixed at design time). This is separate to the PMPs.

@jrrk
Copy link
Member Author

jrrk commented Apr 5, 2019

Even if PMAs were implemented, loads are at the start of the pipeline and commits are at the end so there is a vulnerability period of several cycles every time you check for an interrupt. Any chance that using an atomic instruction could help ?

@zarubaf
Copy link

zarubaf commented Apr 5, 2019

Yeah, I meant PMAs. Always confusing this with PMPs. I've changed it in the comment above. @jrrk Yeah it would mean some hacky way around and disabling all sources of speculation (shortly disabling the interrupts and draining the pipeline before issuing the load). I also thought about using atomics for this, but then that would mean that the plic needs to support atomic transactions (not super difficult to implement). Tbh, that is a rather messy situation the hardware has to deal with, I think the "RISC-V" way would be to support PMAs in the core.

@asb
Copy link
Member

asb commented Apr 5, 2019

Given the prevalence of existing peripherals with side-effects on loads, and that whenever the Unix platform spec actually exists it's almost certain to require support for side-effecting MMIO loads via idempotency PMAs or by never issuing loads speculatively, I think supporting the PMAs is probably the sensible path forwards.

Shall we close this issue for now given that the current hypothesis is that this isn't actually a PLIC problem?

@jrrk
Copy link
Member Author

jrrk commented Apr 5, 2019

I think we need to demonstrate that if we work around this issue by whatever means, then all is well and there are no other lurking reliability issues. Perhaps the previous PLIC failed for exactly the same reason.

@jrrk
Copy link
Member Author

jrrk commented Apr 6, 2019 via email

@jrrk
Copy link
Member Author

jrrk commented Apr 7, 2019

@zarubaf how about this for an idea to keep us going, generate a handshake pulse from the commit stage of Ariane for the specific case of a non-idempotent load (at first just recognising the address), and feed this via the user signal over AXI bus to the PLIC, which can then be modified to only mask the interrupt when it gets the acknowledgement.

@zarubaf
Copy link

zarubaf commented Apr 7, 2019

I am very (very) much in favor of cleaning it up completely e.g. pretty much as Chris explained in his answer. It is really interesting that I never seem to saw a missed character though. I hope I can prioritize this. Unfortunately, we have a conference deadline this week.

@jrrk
Copy link
Member Author

jrrk commented Apr 7, 2019

@zarubaf The Ethernet typically goes wrong with a half-life of about 10MBytes. We aren't seeing that degree of traffic on the UART, and you might not even notice at that level. I have seen the UART lockup completely by the same PLIC mechanism though.

@jrrk
Copy link
Member Author

jrrk commented Apr 7, 2019

Another theory to be investigated, perhaps the claiming of the interrupt is directly related to timing of the next interrupt. If we add a deliberate delay in the PLIC before asserting interrupts again, it might improve safety.

@jrrk
Copy link
Member Author

jrrk commented Apr 8, 2019

I notice that reads from the PLIC under Linux are qualified afterward with

fence i,r
At the moment Ariane does not decode this combination and interprets it as a plain fence (flushing the D$). Can we use this to safely do a pipeline flush instead, and would that help ?

@jrrk
Copy link
Member Author

jrrk commented Apr 9, 2019

I have implemented the fence i,r as an acknowledge to the PLIC. Of course this is a temporary hack. The Debian chroot works quite well now. The SD-Card interface still locks up after about 46000 interrupts, but I think this is a driver problem and nothing to do with the PLIC. Onward and upward.

@zarubaf
Copy link

zarubaf commented Apr 9, 2019

I am glad to hear that it is working better for you now. Sorry for my radio silence. I've been implementing the non-speculative loads (here: openhwgroup/cva6#213). I think it will still require some bug hunting as the change was unfortunately quite intrusive.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants