DMA for pulse sequences #553

dhslichter · 2016-09-07T17:28:33Z

@sbourdeauducq @jordens @whitequark @dleibrandt @r-srinivas @dtcallcock I'm posting this here to provide a central place for discussions regarding how one might use direct memory access (DMA) to program pulse sequences that for whatever reason are undesirable to implement in the typical way (processor calculating and pushing timestamps to RTIO core one by one).

For long pulse sequences where the average time between RTIO output events is less than the time it takes for the processor to compute and push timestamps (roughly 1 us each), the slack will be steadily reduced and will eventually cause an RTIO underflow. The solution to this is to calculate timestamps farther in advance. The simplest way to do this is to make sufficiently deep FIFOs for the relevant channels, and then set the initial slack sufficiently large (e.g. using a single long delay) to enable the processor to calculate and push all the timestamps with enough slack remaining at the end.

This method may not be desirable in some cases (e.g. where experimental duty cycle is important), so another solution would be to precompute RTIO timestamps on the PC, load them in to RAM on the core device, and have the processor call for the timestamps to be pushed directly from RAM to the output FIFOs at the appropriate time.

There are numerous technical/implementation questions to be discussed. I list a few below, some with suggested answers and some without.

how are timestamps to put in RAM generated? One would want to be able to write standard ARTIQ code, probably with some kind of decorator or with statement, that the compiler recognizes as something that needs to be converted to timestamps on the PC and uploaded to RAM.
how to precompute timestamps on the PC without knowledge of the value of the RTIO counter at the appropriate point in the experiment? One suggestion here would be to have DMA store only relative timestamps, referenced to a particular timestamp that the core device will provide at runtime. One could then have the DMA engine add this offset timestamp to every timestamp it pulls from memory before it pushes them to the RTIO FIFOs. The first timestamp in any DMA sequence would always be "0", all others relative to that timestamp. Thus, if you want to run 100 repetitions of the same pulse sequence, you only have to compute and store one copy of the relative timestamps for a single sequence in RAM.
how does the memory controller handle memory usage requests from the processor while a pulse sequence is being sent out to the FIFOs using DMA? If this is blocking, the processor can't do much while waiting for the DMA to finish.
if the processor, after calling a DMA pulse sequence, is tasked with putting additional RTIO events on an output FIFO that is also getting events from the DMA process, how can we ensure that the DMA has completed (and thus that the processor doesn't put its event on in the middle of the DMA stream, resulting in an RTIOSequenceError?
what is the steady-state average rate that timestamps can be DMA'ed from RAM into the RTIO FIFOs? Is there a requirement from the physics end on what this needs to be?

The text was updated successfully, but these errors were encountered:

jordens · 2016-09-07T22:48:37Z

Yea. You are describing the design that has been thrown around for a few years now.

DMA sequences need to be persistent. Otherwise they seem almost pointless.
There are two use cases: generating the RTIO event sequences at compile time or at runtime. Runtime seems more powerful ang generic.
Timestamps are relative. But purely additive: no scaling of timestamps/durations.
The first timestamp is not "0" but whatever delay it has w.r.t. to the logical start of the segment.
The arbiter is round-robin. You will get a slow-down due to sharing of the bandwidth and DRAM dynamics but no (inherent) starvation. The slow-down seems irrelevant since the DRAM can easily outpace the CPU.
DMA().play() would be blocking. Otherwise, interlocking access seems problematic.
DMA should be able to saturate a few FIFOs and the PHYs. That is one event per coarse RTIO cycle for a few channels in parallel: 125 MHz for a few channels.

my_burst = DMA("my_burst")
with my_burst.record():
    delay(10*ns)
    ttl0.pulse(20*ns)
    for i in range(100):
        dds2.pulse(300*MHz + i*1*MHz, 220*ns)
# timeline is unaltered and rewound to before the `with`

# ... new experiment, new kernel

my_pulse = DMA("my_burst")
t = now_mu()
for i in range(100)
    ttl2.pulse(3*us)
    my_pulse.play()
    # timeline advanced by length of my_pulse
assert t + seconds_to_mu(100*(3*us + 250*ns)) == now_mu()

dhslichter · 2016-09-08T15:32:02Z

DMA().play() would be blocking. Otherwise, interlocking access seems problematic.

Since the DMA'ed events, and thus the full list of all channels to which RTIO events will be sent by DMA, would necessarily be known at compile time, would it be possible to have DMA().play() be blocking only if a processor instruction is reached that requests read/write from one of the channels involved in the DMA? In this way, the processor could continue to put RTIO events onto other channels' FIFOs while the DMA is occurring (or perform other processing tasks, such as counting pulses from different channels or calculating a Bayesian update of some sort), which will help preserve slack, especially if the DMA'ed sequence contains a very large number of very short pulses (such that the spec given above for DMA bandwidth is not enough to keep the slack from reducing).

whitequark · 2016-09-08T16:11:29Z

Since the DMA'ed events, and thus the full list of all channels to which RTIO events will be sent by DMA, would necessarily be known at compile time

Not necessarily.

if ttl0.input():
  dma0.play()
else:
  dma1.play()

dhslichter · 2016-09-08T16:19:18Z

OK, but the full list of channels for both dma0 and dma1 are known, even though they may be different. For example:

if ttl0.input():
    dma0.play()
    ttl4.pulse(10*us)
else:
    dma1.play()
    ttl4.pulse(10*us)

If ttl4 is involved in dma0, but not in dma1, it should be possible at compile time to know that the processor should block on the dma0.play() but not on dma1.play().

If another channel, ttl7, is involved in both dma0 and dma1, then the code below would block on dma0, but would allow the ttl4 pulse to be issued while dma1 occurs but then block until dma1 completes before issuing the ttl7 pulse.

if ttl0.input():
    dma0.play()
    ttl4.pulse(10*us)
    ttl7.pulse(2*us)
else:
    dma1.play()
    ttl4.pulse(10*us)
    ttl7.pulse(2*us)

whitequark · 2016-09-08T16:24:56Z

If ttl4 is involved in dma0, but not in dma1, it should be possible at compile time to know that the processor should block on the dma0.play() but not on dma1.play().

Yes, we could add the list of touched RTIO channels into function signatures, much like there is iodelay in the compiler-assisted interleaving.

jordens · 2016-09-08T17:21:14Z

How would that RTIO channel access be enforced? And by which component? Would that component then havve to traverse the entire "in-use-by-DMA" channel list on every access by the CPU?
This would also break the current design where the (local) RTIO channels are all accessed through the same interface. Also I am calculating channel numbers for Sayma dynamically (even though the compiler can fold it). Why do you think you can predict them?
Large DMA sequences are mostly stalled by FIFO depth and won't be limited by DMA. In other cases they will compete for memory bandwidth and slow down the CPU.
Coroutines or threads will need similar interlocking anyway. And once that lands, you would write

with concurrent:  # instead of `parallel`
    my_burst.play()
    ttl2.pulse()

and it should do that with "true" parallelism. But trying to hack concurrent DMA and CPU RTIO access into this right now before having CPU concurrency seems misguided to me.

AFAICT we would arbitrate and interlock access to the entire RTIO interface and not to the individual channels.
This design will also need to foresee distributed DMA. And then concurrency becomes yet a bit trickier. The arbitration would be at the remote end and would need reject output submissions similar to RTIOOverflow right now (to be attempted again).

jordens · 2016-09-08T17:32:18Z

Not RTIOOverflow but the behavior on RTIO FIFO full.

dhslichter · 2016-09-08T20:30:02Z

Good points here @jordens. Seems like the subtleties on this are pretty problematic in the version discussed above.

One variant: how about allowing the CPU to perform RTIO reads, but not writes, during the DMA, as well as any operations not involving the RTIO core (calculations, RPCs, etc)? This would enable one to profitably use the DMA download time for tasks like count() on the data from the previous iteration of an experiment, for example, or computing Bayesian updates to parameters. I am thinking here more about something like a clock experiment, where overall duty cycle is important and being able to make use of as much of the time as possible for performing calculations on the soft CPU would be very useful.

As a variant, one could block on any RTIO commands (read or write) but allow all other types of CPU operations to proceed during the DMA.

sbourdeauducq · 2016-11-16T15:36:58Z

Funded by Oxford.

dhslichter · 2016-11-16T16:43:24Z

Awesome! Will the specification be posted somewhere?

jordens · 2016-11-16T17:42:43Z

Sure. Currently the specification is the (virtual/perceived) consensus of this issue plus a bunch of details from IRC. It'll probably also change a bit when we take a step back and see DRTIO in its full glory and have a clear perspective how to hook DMA to it.
Do you have specific questions?

dhslichter · 2016-11-16T17:58:18Z

No specific questions, just wanted to know if there would be a public place where the current "vision" is maintained.

jordens · 2016-11-16T18:01:36Z

Yeah. That's something we want to do.

jordens · 2016-11-18T14:20:37Z

https://github.com/m-labs/artiq/wiki/DMA

sbourdeauducq · 2016-12-08T10:12:12Z

Basic gateware written, works in functional simulation.

vontell · 2017-01-12T03:23:51Z

Has a timeline been posted for the completion of this?

sbourdeauducq · 2017-01-12T16:16:42Z

Work on this will begin as soon as the network stack fiasco is resolved, and it shouldn't take long to get the basic functionality working (~week at most) unless there is another series of obscure bugs.

jbqubit · 2017-01-26T17:08:49Z

It would be helpful to have a development checklist for keeping track of what the steps are and how things are progressing. Based on wiki.

sbourdeauducq · 2017-04-05T16:25:56Z

There was another series of obscure bugs, that are now dealt with. Only #700 remains.

jordens added area:gateware area:compiler type:feature type:needs-funding labels Sep 7, 2016

jordens mentioned this issue Sep 26, 2016

phaser2 features sinara-hw/sinara#2

Closed

19 tasks

sbourdeauducq removed the type:needs-funding label Nov 16, 2016

sbourdeauducq assigned sbourdeauducq and whitequark Dec 8, 2016

This was referenced Jan 26, 2017

timeline WUT hardware 0.1 -> 0.2 -> 1.0 sinara-hw/sinara#131

Closed

Sinara board support: m-labs sinara-hw/sinara#139

Closed

sbourdeauducq added this to the 3.0 milestone Jan 29, 2017

sbourdeauducq closed this as completed Apr 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DMA for pulse sequences #553

DMA for pulse sequences #553

dhslichter commented Sep 7, 2016

jordens commented Sep 7, 2016

dhslichter commented Sep 8, 2016 •

edited

Loading

whitequark commented Sep 8, 2016

dhslichter commented Sep 8, 2016

whitequark commented Sep 8, 2016

jordens commented Sep 8, 2016

jordens commented Sep 8, 2016

dhslichter commented Sep 8, 2016

sbourdeauducq commented Nov 16, 2016

dhslichter commented Nov 16, 2016

jordens commented Nov 16, 2016

dhslichter commented Nov 16, 2016

jordens commented Nov 16, 2016

jordens commented Nov 18, 2016

sbourdeauducq commented Dec 8, 2016

vontell commented Jan 12, 2017

sbourdeauducq commented Jan 12, 2017

jbqubit commented Jan 26, 2017

sbourdeauducq commented Apr 5, 2017

DMA for pulse sequences #553

DMA for pulse sequences #553

Comments

dhslichter commented Sep 7, 2016

jordens commented Sep 7, 2016

dhslichter commented Sep 8, 2016 • edited Loading

whitequark commented Sep 8, 2016

dhslichter commented Sep 8, 2016

whitequark commented Sep 8, 2016

jordens commented Sep 8, 2016

jordens commented Sep 8, 2016

dhslichter commented Sep 8, 2016

sbourdeauducq commented Nov 16, 2016

dhslichter commented Nov 16, 2016

jordens commented Nov 16, 2016

dhslichter commented Nov 16, 2016

jordens commented Nov 16, 2016

jordens commented Nov 18, 2016

sbourdeauducq commented Dec 8, 2016

vontell commented Jan 12, 2017

sbourdeauducq commented Jan 12, 2017

jbqubit commented Jan 26, 2017

sbourdeauducq commented Apr 5, 2017

dhslichter commented Sep 8, 2016 •

edited

Loading