Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DMA for pulse sequences #553

Closed
dhslichter opened this issue Sep 7, 2016 · 19 comments
Closed

DMA for pulse sequences #553

dhslichter opened this issue Sep 7, 2016 · 19 comments

Comments

@dhslichter
Copy link
Contributor

@sbourdeauducq @jordens @whitequark @dleibrandt @r-srinivas @dtcallcock I'm posting this here to provide a central place for discussions regarding how one might use direct memory access (DMA) to program pulse sequences that for whatever reason are undesirable to implement in the typical way (processor calculating and pushing timestamps to RTIO core one by one).

For long pulse sequences where the average time between RTIO output events is less than the time it takes for the processor to compute and push timestamps (roughly 1 us each), the slack will be steadily reduced and will eventually cause an RTIO underflow. The solution to this is to calculate timestamps farther in advance. The simplest way to do this is to make sufficiently deep FIFOs for the relevant channels, and then set the initial slack sufficiently large (e.g. using a single long delay) to enable the processor to calculate and push all the timestamps with enough slack remaining at the end.

This method may not be desirable in some cases (e.g. where experimental duty cycle is important), so another solution would be to precompute RTIO timestamps on the PC, load them in to RAM on the core device, and have the processor call for the timestamps to be pushed directly from RAM to the output FIFOs at the appropriate time.

There are numerous technical/implementation questions to be discussed. I list a few below, some with suggested answers and some without.

  • how are timestamps to put in RAM generated? One would want to be able to write standard ARTIQ code, probably with some kind of decorator or with statement, that the compiler recognizes as something that needs to be converted to timestamps on the PC and uploaded to RAM.
  • how to precompute timestamps on the PC without knowledge of the value of the RTIO counter at the appropriate point in the experiment? One suggestion here would be to have DMA store only relative timestamps, referenced to a particular timestamp that the core device will provide at runtime. One could then have the DMA engine add this offset timestamp to every timestamp it pulls from memory before it pushes them to the RTIO FIFOs. The first timestamp in any DMA sequence would always be "0", all others relative to that timestamp. Thus, if you want to run 100 repetitions of the same pulse sequence, you only have to compute and store one copy of the relative timestamps for a single sequence in RAM.
  • how does the memory controller handle memory usage requests from the processor while a pulse sequence is being sent out to the FIFOs using DMA? If this is blocking, the processor can't do much while waiting for the DMA to finish.
  • if the processor, after calling a DMA pulse sequence, is tasked with putting additional RTIO events on an output FIFO that is also getting events from the DMA process, how can we ensure that the DMA has completed (and thus that the processor doesn't put its event on in the middle of the DMA stream, resulting in an RTIOSequenceError?
  • what is the steady-state average rate that timestamps can be DMA'ed from RAM into the RTIO FIFOs? Is there a requirement from the physics end on what this needs to be?
@jordens
Copy link
Member

jordens commented Sep 7, 2016

Yea. You are describing the design that has been thrown around for a few years now.

  • DMA sequences need to be persistent. Otherwise they seem almost pointless.
  • There are two use cases: generating the RTIO event sequences at compile time or at runtime. Runtime seems more powerful ang generic.
  • Timestamps are relative. But purely additive: no scaling of timestamps/durations.
  • The first timestamp is not "0" but whatever delay it has w.r.t. to the logical start of the segment.
  • The arbiter is round-robin. You will get a slow-down due to sharing of the bandwidth and DRAM dynamics but no (inherent) starvation. The slow-down seems irrelevant since the DRAM can easily outpace the CPU.
  • DMA().play() would be blocking. Otherwise, interlocking access seems problematic.
  • DMA should be able to saturate a few FIFOs and the PHYs. That is one event per coarse RTIO cycle for a few channels in parallel: 125 MHz for a few channels.
my_burst = DMA("my_burst")
with my_burst.record():
    delay(10*ns)
    ttl0.pulse(20*ns)
    for i in range(100):
        dds2.pulse(300*MHz + i*1*MHz, 220*ns)
# timeline is unaltered and rewound to before the `with`

# ... new experiment, new kernel

my_pulse = DMA("my_burst")
t = now_mu()
for i in range(100)
    ttl2.pulse(3*us)
    my_pulse.play()
    # timeline advanced by length of my_pulse
assert t + seconds_to_mu(100*(3*us + 250*ns)) == now_mu() 

@dhslichter
Copy link
Contributor Author

dhslichter commented Sep 8, 2016

DMA().play() would be blocking. Otherwise, interlocking access seems problematic.

Since the DMA'ed events, and thus the full list of all channels to which RTIO events will be sent by DMA, would necessarily be known at compile time, would it be possible to have DMA().play() be blocking only if a processor instruction is reached that requests read/write from one of the channels involved in the DMA? In this way, the processor could continue to put RTIO events onto other channels' FIFOs while the DMA is occurring (or perform other processing tasks, such as counting pulses from different channels or calculating a Bayesian update of some sort), which will help preserve slack, especially if the DMA'ed sequence contains a very large number of very short pulses (such that the spec given above for DMA bandwidth is not enough to keep the slack from reducing).

@whitequark
Copy link
Contributor

Since the DMA'ed events, and thus the full list of all channels to which RTIO events will be sent by DMA, would necessarily be known at compile time

Not necessarily.

if ttl0.input():
  dma0.play()
else:
  dma1.play()

@dhslichter
Copy link
Contributor Author

OK, but the full list of channels for both dma0 and dma1 are known, even though they may be different. For example:

if ttl0.input():
    dma0.play()
    ttl4.pulse(10*us)
else:
    dma1.play()
    ttl4.pulse(10*us)

If ttl4 is involved in dma0, but not in dma1, it should be possible at compile time to know that the processor should block on the dma0.play() but not on dma1.play().

If another channel, ttl7, is involved in both dma0 and dma1, then the code below would block on dma0, but would allow the ttl4 pulse to be issued while dma1 occurs but then block until dma1 completes before issuing the ttl7 pulse.

if ttl0.input():
    dma0.play()
    ttl4.pulse(10*us)
    ttl7.pulse(2*us)
else:
    dma1.play()
    ttl4.pulse(10*us)
    ttl7.pulse(2*us)

@whitequark
Copy link
Contributor

If ttl4 is involved in dma0, but not in dma1, it should be possible at compile time to know that the processor should block on the dma0.play() but not on dma1.play().

Yes, we could add the list of touched RTIO channels into function signatures, much like there is iodelay in the compiler-assisted interleaving.

@jordens
Copy link
Member

jordens commented Sep 8, 2016

  • How would that RTIO channel access be enforced? And by which component? Would that component then havve to traverse the entire "in-use-by-DMA" channel list on every access by the CPU?
  • This would also break the current design where the (local) RTIO channels are all accessed through the same interface. Also I am calculating channel numbers for Sayma dynamically (even though the compiler can fold it). Why do you think you can predict them?
  • Large DMA sequences are mostly stalled by FIFO depth and won't be limited by DMA. In other cases they will compete for memory bandwidth and slow down the CPU.
  • Coroutines or threads will need similar interlocking anyway. And once that lands, you would write
with concurrent:  # instead of `parallel`
    my_burst.play()
    ttl2.pulse()

and it should do that with "true" parallelism. But trying to hack concurrent DMA and CPU RTIO access into this right now before having CPU concurrency seems misguided to me.

  • AFAICT we would arbitrate and interlock access to the entire RTIO interface and not to the individual channels.
  • This design will also need to foresee distributed DMA. And then concurrency becomes yet a bit trickier. The arbitration would be at the remote end and would need reject output submissions similar to RTIOOverflow right now (to be attempted again).

@jordens
Copy link
Member

jordens commented Sep 8, 2016

Not RTIOOverflow but the behavior on RTIO FIFO full.

@dhslichter
Copy link
Contributor Author

Good points here @jordens. Seems like the subtleties on this are pretty problematic in the version discussed above.

One variant: how about allowing the CPU to perform RTIO reads, but not writes, during the DMA, as well as any operations not involving the RTIO core (calculations, RPCs, etc)? This would enable one to profitably use the DMA download time for tasks like count() on the data from the previous iteration of an experiment, for example, or computing Bayesian updates to parameters. I am thinking here more about something like a clock experiment, where overall duty cycle is important and being able to make use of as much of the time as possible for performing calculations on the soft CPU would be very useful.

As a variant, one could block on any RTIO commands (read or write) but allow all other types of CPU operations to proceed during the DMA.

@sbourdeauducq
Copy link
Member

Funded by Oxford.

@dhslichter
Copy link
Contributor Author

Awesome! Will the specification be posted somewhere?

@jordens
Copy link
Member

jordens commented Nov 16, 2016

Sure. Currently the specification is the (virtual/perceived) consensus of this issue plus a bunch of details from IRC. It'll probably also change a bit when we take a step back and see DRTIO in its full glory and have a clear perspective how to hook DMA to it.
Do you have specific questions?

@dhslichter
Copy link
Contributor Author

No specific questions, just wanted to know if there would be a public place where the current "vision" is maintained.

@jordens
Copy link
Member

jordens commented Nov 16, 2016

Yeah. That's something we want to do.

@jordens
Copy link
Member

jordens commented Nov 18, 2016

@sbourdeauducq
Copy link
Member

Basic gateware written, works in functional simulation.

@vontell
Copy link

vontell commented Jan 12, 2017

Has a timeline been posted for the completion of this?

@sbourdeauducq
Copy link
Member

Work on this will begin as soon as the network stack fiasco is resolved, and it shouldn't take long to get the basic functionality working (~week at most) unless there is another series of obscure bugs.

@jbqubit
Copy link
Contributor

jbqubit commented Jan 26, 2017

It would be helpful to have a development checklist for keeping track of what the steps are and how things are progressing. Based on wiki.

@sbourdeauducq sbourdeauducq added this to the 3.0 milestone Jan 29, 2017
@sbourdeauducq
Copy link
Member

There was another series of obscure bugs, that are now dealt with. Only #700 remains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants