Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: free_callback passed into rdma_get_pages() never called #169

Open
BrendanCunningham opened this issue Aug 14, 2024 · 17 comments
Open

Comments

@BrendanCunningham
Copy link

Problem Description

TLDR

amdgpu does not call the free callback that our Linux kernel driver passes into rdma_get_pages(). Results in stale ROCm-buffer:DMA-mapped-page cache entries and wrong cache entry lookup. Ultimately, wrong pages are used in constructing packet and wrong data sent.

Additionally, our driver requires assurances that once acquired, DMA addresses to AMD GPU pages remain valid until either driver calls rdma_put_pages() or only after our page invalidation callback has been called and we can delay DMA-unmapping until our outstanding I/O operations with those pages have completed.

Detailed description

Our Linux kernel driver (hfi1) for our HPC fabric interface card (HFI) calls rdma_get_pages() in amdgpu to get DMA-mapped AMD GPU pages for a ROCm buffer. It does this so it can pass the DMA addresses into the HFI's packet construction engine (SDMA) to fill the packet payloads using DMA.

To avoid repeated rdma_get_pages() calls for the same buffer, hfi1 maintains a cache of ROCm-virtual-address-range:DMA-mapped-AMD-GPU-pages ("VA:DMA" entry or mapping). hfi1 has a cache per user context (userspace process). This is hfi1’s AMD-DMA cache.

hfi1 provides rdma_get_pages() with a free callback function.

hfi1 evict entries from the DMA cache in two cases:

  1. When the cache is full. In this case, hfi1 calls amdgpu's rdma_put_pages().
  2. When hfi1’s free callback is called, presumably because the ROCm buffer has been freed.
    • In this case, our free callback waits for outstanding I/O involving those pages to complete before returning.

In our experience however, hfi1's free callback is never called. As a result, hfi1 cannot remove the VA:DMA entry for a ROCm buffer when the ROCm buffer is freed.

This causes a problem: When a new ROCm buffer gets a virtual address that overlaps with the VA range of a previously allocated-and-freed ROCm buffer that was passed into hfi1, hfi1 will find the old ROCm buffer's VA:DMA entry in its cache. hfi1 will pass the stale entry's DMA address into the HFI's SDMA engine. The adapter will then construct the packet from the wrong pages with the wrong data.

We are able to reproduce this problem in testing using osu_multi_lat --warmup 0 --validation-warmup 0 -i 1 -c -m 1:4200000 H D. Our driver with AMD-DMA support is not yet public.

We can work around this by disabling hfi1's AMD-DMA cache. With hfi1's AMD-DMA cache disabled, we do an rdma_put_pages() as soon as we have detected that the packet was sent. However this may come at a performance cost.

Additionally, hfi1 needs assurances DMA addresses for DMA-mapped pages will remain valid until either:

  • hfi1 calls rdma_put_pages(), OR
  • after amdgpu calls hfi1’s free callback for the GPU pages and hfi1 returns from that callback.
    • It is important that hfi1 be able to delay returning until the outstanding I/O has completed.

Operating System

Red hat Enterprise Linux 9.4 (Plow)

CPU

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.2.0

ROCm Component

ROCK-Kernel-Driver

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

�[37mROCk module version 6.8.5 is loaded�[0m
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.14
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            44                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    32286776(0x1eca838) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    32286776(0x1eca838) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    32286776(0x1eca838) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3600                               
  BDFID:                   0                                  
  Internal Node ID:        1                                  
  Compute Unit:            44                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Memory Properties:       
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    33015296(0x1f7c600) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    33015296(0x1f7c600) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33015296(0x1f7c600) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 3                  
*******                  
  Name:                    gfx908                             
  Uuid:                    GPU-95386651081add54               
  Marketing Name:          AMD Instinct MI100                 
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    2                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
    L2:                      8192(0x2000) KB                    
  Chip ID:                 29580(0x738c)                      
  ASIC Revision:           1(0x1)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1502                               
  BDFID:                   1280                               
  Internal Node ID:        2                                  
  Compute Unit:            120                                
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   1                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Memory Properties:       
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          64(0x40)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        40(0x28)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 67                                 
  SDMA engine uCode::      18                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    33538048(0x1ffc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Recommended Granule:2048KB                             
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Recommended Granule:0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Additional Information

No response

@ppanchad-amd
Copy link

@BrendanCunningham Internal ticket has been created to investigate this issue. Thanks!

@schung-amd
Copy link

Hi @BrendanCunningham, while the driver is not yet public, is it possible to provide some code with your caching logic that reproduces the issue? I've reached out to our kernel driver team, and my understanding is that the buffer should not be freed before rdma_put_pages is called, so this should be working in theory. Of course, this would mean that the free_callback is vestigial, so I'd like to figure out if this buffer can be freed somewhere else, as a search through our codebase does not show anywhere this would be used.

@BrendanCunningham
Copy link
Author

@schung-amd

Hi @BrendanCunningham, while the driver is not yet public, is it possible to provide some code with your caching logic that reproduces the issue?

Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.

I've reached out to our kernel driver team, and my understanding is that the buffer should not be freed before rdma_put_pages is called, so this should be working in theory.

Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages() are guaranteed to remain valid until after we explicitly call rdma_put_pages()?

Of course, this would mean that the free_callback is vestigial, so I'd like to figure out if this buffer can be freed somewhere else, as a search through our codebase does not show anywhere this would be used.

Does amdgpu have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?

@schung-amd
Copy link

Thanks for the quick response!

Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.

Perfect, thanks. I'll try to reproduce what you're seeing here.

Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages() are guaranteed to remain valid until after we explicitly call rdma_put_pages()?

Yes, this is what I have been told. When calling rdma_get_pages(), the amdgpu VRAM manager pins VRAM pages for the user which are not released until rdma_put_pages() is called. When there is VRAM pressure, the VRAM manager stops user processes and migrates pages to relieve the pressure, but pages pinned by rdma_get_pages() are exempt and will not be evicted by the VRAM manager, so your cached addresses should remain valid. However, as this does not appear to match your observations, I'm discussing with the internal team for further understanding on this.

Does amdgpu have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?

Do you mean a cache that can be accessed by the user to retrieve these pages, to eliminate the need for a cache on your end? I'll discuss this with the internal team and update you when I have more information.

@BrendanCunningham
Copy link
Author

Thanks for the quick response!

Yes. I misspoke before; our AMD DMA code is not ready but it is public. hfi1/pin_amd.c has our send-side AMD DMA support, i.e. construct packets using DMA instead of CPU-driven copies.

Perfect, thanks. I'll try to reproduce what you're seeing here.

I have a workaround in pin_amd.c that disables the cache. This workaround is not public yet. It makes two changes:

  1. In insert_amd_pinning(), do not insert struct amd_pintree_node objects into the pintree.
    1. Makes it so our driver calls rdma_get_pages() for every ROCm VA range passed into it.
  2. Manage struct amd_pintree_node lifetime with struct kref and suitable kref_put() destructors instead of the atomic_t refcount based code now.
    1. Makes it so our driver does rdma_put_pages() as soon as our hardware signals packet send completion. As opposed to waiting for cache eviction.

With this workaround, we no longer send incorrect data.

Does that mean that DMA addresses for DMA-mapped AMD pages obtained with rdma_get_pages() are guaranteed to remain valid until after we explicitly call rdma_put_pages()?

Yes, this is what I have been told. When calling rdma_get_pages(), the amdgpu VRAM manager pins VRAM pages for the user which are not released until rdma_put_pages() is called. When there is VRAM pressure, the VRAM manager stops user processes and migrates pages to relieve the pressure, but pages pinned by rdma_get_pages() are exempt and will not be evicted by the VRAM manager, so your cached addresses should remain valid. However, as this does not appear to match your observations, I'm discussing with the internal team for further understanding on this.

To be absolutely clear, we are not having problems with pages obtained with rdma_get_pages() being migrated. Just that it would be a problem for us if DMA-mapped pages were migrated underneath us or the DMA addresses otherwise became invalid.

The problem that we do have is that since our free_callback is not called, we can't maintain our VA-range:DMA-mapping (struct amd_p2p_info) cache properly and end up with stale cache entries. Which can then lead to us using the wrong DMA addresses.

Does amdgpu have pin+DMA-mapped page cache to avoid unpinning and repinning frequently DMA-mappeed pages, within the context of a user process?

Do you mean a cache that can be accessed by the user to retrieve these pages, to eliminate the need for a cache on your end? I'll discuss this with the internal team and update you when I have more information.

I mean a cache in the amdgpu driver for other drivers like ours, not for userspace. I.e. is there a cache behind rdma_get_pages()? But yes, to eliminate the need for a cache on our end.

@BrendanCunningham
Copy link
Author

BrendanCunningham commented Aug 29, 2024

@schung-amd Also, is it safe to call rdma_put_pages() from an interrupt context? My workaround involves calling rdma_put_pages() in our hardware completion handler, which is called via interrupt.

@schung-amd
Copy link

Hi @BrendanCunningham, thanks for following up on this! Sorry for the delay, I'm trying to collect more information from our internal teams before providing answers because I don't have a complete understanding of DMA, so some of these points and followup questions may be unclear; feel free to correct me or clarify on anything. So far my understanding is:

  • We guarantee that pages pinned by rdma_get_pages() are resident until rdma_put_pages() is called;
  • free_callback() is never called, as you have already observed in testing and also in the codebase, and is probably not needed;
  • I don't think there is a cache in the driver you can use, but still inquiring about this; and
  • I don't know how your cached addresses are becoming invalid.

Are you relying purely on the callback to inform when to remove entries from your cache, even when invoking rdma_put_pages()? In this case, the callback will never be called so the cache will still have stale entries. Or are the addresses changing underneath the hood somewhere, and you are expecting free_callback to be invoked? From your initial post, it sounds like pages are being freed without being handled by the cache, but if my understanding is correct then this should not be happening. Again, my understanding of DMA is incomplete, so there may be something obvious here I'm missing.

I will inquire regarding calling rdma_put_pages() in an interrupt context. I suspect it is not safe to do so, but I'll let you know what the internal teams suggest as soon as I have that information.

@BrendanCunningham
Copy link
Author

BrendanCunningham commented Aug 29, 2024

Hi @BrendanCunningham, thanks for following up on this! Sorry for the delay, I'm trying to collect more information from our internal teams before providing answers because I don't have a complete understanding of DMA, so some of these points and followup questions may be unclear; feel free to correct me or clarify on anything. So far my understanding is:

* We guarantee that pages pinned by rdma_get_pages() are resident until rdma_put_pages() is called;

* free_callback() is never called, as you have already observed in testing and also in the codebase, and is probably not needed;

* I don't think there is a cache in the driver you can use, but still inquiring about this; and

* I don't know how your cached addresses are becoming invalid.

Are you relying purely on the callback to inform when to remove entries from your cache, even when invoking rdma_put_pages()? In this case, the callback will never be called so the cache will still have stale entries.

Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.

Or are the addresses changing underneath the hood somewhere, and you are expecting free_callback to be invoked?

Edit: No, the DMA addresses are not changing underneath us; I only asked if the addresses were guaranteed stable to rule that out as a source of error.

From your initial post, it sounds like pages are being freed without being handled by the cache, but if my understanding is correct then this should not be happening. Again, my understanding of DMA is incomplete, so there may be something obvious here I'm missing.

I will inquire regarding calling rdma_put_pages() in an interrupt context. I suspect it is not safe to do so, but I'll let you know what the internal teams suggest as soon as I have that information.

Okay, thanks.

@schung-amd
Copy link

@BrendanCunningham Still gathering information re: calling rdma_put_pages() from an interrupt context; the internal team initially recommends against calling it, but is digging into the code to check.

Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.

Is your driver in control of freeing the buffer, or is something else freeing the buffer? Is it possible to place the free_callback logic to modify your cache entries directly where you are calling rdma_put_pages()?

@BrendanCunningham
Copy link
Author

@BrendanCunningham Still gathering information re: calling rdma_put_pages() from an interrupt context; the internal team initially recommends against calling it, but is digging into the code to check.

Correct; that is the problem. That we are given a ROCm buffer, we add a cache entry, then at some point that buffer is freed and a new ROCm buffer with different physical pages but the same or overlapping virtual address range as the old buffer is passed into our driver and we fetch the stale cache entry.

Is your driver in control of freeing the buffer, or is something else freeing the buffer? Is it possible to place the free_callback logic to modify your cache entries directly where you are calling rdma_put_pages()?

Our driver is not in control of freeing the buffer; we can only react to others (userspace or amdgpu) freeing the buffers and only if those other entities call free_callback.

Absent free_callback, our only good/safe option is disabling our caching code.

@schung-amd
Copy link

According to the internal team, your driver should have full control over the lifetime of the buffers; amdgpu guarantees that your buffers are resident until the driver calls rdma_put_pages() on them, and by design userspace should have no way of freeing them safely outside of your driver's control. Do you have any logs showing how/when your buffers are being freed without the driver knowing?

@ddalessa
Copy link

We will update with logs soon.

@BrendanCunningham
Copy link
Author

Here are pr_debug() printout logs from two hosts in a 2 rank, 2 host job with our driver (hfi1) with our AMD DMA cache enabled:

Note the absence of invalidate_sdma_pages_gpu lines in either log; those events would only be printed from our free_callback function, which is not called. unpin_sdma_pages_gpu lines are not the same as invalidate_sdma_pages_gpu. In these logs, unpin_sdma_pages_gpu lines are printed when the user file descriptor is torn down and any entries in our SDMA cache are freed.

Here is the job log:

The job fails on a data validation error because our driver cannot maintain correct DMA cache entries.

Here is the script I ran on both hosts after loading our driver:

#!/bin/bash
set -x
alias ddcmd='echo $* > /proc/dynamic_debug/control'
shopt -s expand_aliases
dmesg -n debug
dmesg -C
ddcmd '-p; module hfi1 file pin_amd.c +p'
ddcmd 'module amdgpu +p'

@schung-amd
Copy link

Thanks for the logs! I'll pass them on to the internal team for more insight. As discussed, I wouldn't expect the callback to be called anywhere, as the internal team has stated that it is not used, so the absence of invalidate_sdma_pages_gpu is expected. Apologies if these seem like trivial questions, but to confirm: where unpin_sdma_pages_gpu is being called, is this happening outside of your control (i.e. from amdgpu, which should be guaranteed to not happen, or from userspace somehow)? Is it possible to modify your cache inside unpin_sdma_pages_gpu?

@BrendanCunningham
Copy link
Author

Thanks for the logs! I'll pass them on to the internal team for more insight. As discussed, I wouldn't expect the callback to be called anywhere, as the internal team has stated that it is not used, so the absence of invalidate_sdma_pages_gpu is expected. Apologies if these seem like trivial questions, but to confirm: where unpin_sdma_pages_gpu is being called, is this happening outside of your control (i.e. from amdgpu, which should be guaranteed to not happen, or from userspace somehow)?

unpin_sdma_pages_gpu is under our control.

Note that the actual function in our code is unpin_amd_node(); unpin_sdma_pages_gpu is the trace event emitted by the tracepoint in unpin_amd_node(). But to avoid further confusion, I'll stick to calling it unpin_sdma_pages_gpu() since that's what's in our logs and is only used in unpin_amd_node().

Is it possible to modify your cache inside unpin_sdma_pages_gpu?

No, it is not possible; we will only call unpin_sdma_pages_gpu() after we have evicted the cache entry from our cache and the refcount for that pinning object hits 0.

Even if amdgpu did call our free_callback, our code wouldn't call unpin_sdma_pages_gpu() in that path on the assumption that amdgpu will unpin after we return and that calling into amdgpu from inside the free_callback path could deadlock.

Our free_callback would evict the cache entry from the cache as soon as we get the lock for our cache tree. This is to prevent calls into our driver from other CPUs from getting the old cache entry. Our free_callback would then wait for all outstanding I/O on our cache object to complete before returning.

@ddalessa
Copy link

ddalessa commented Sep 20, 2024

Want to make sure we are all on the same page here and that I understand things. I think what I'm hearing is that the high level issue is our driver doesn't know when the VA changes. The PA is fine, those pages are pinned. The put/get ops handle that.

Basically what happens is this:

  1. User allocates a buffer on GPU (probably with something like hipMalloc()). This gets the user a pointer, call it VA1 for the buffer
  2. User goes to ask the hfi1 driver to send that buffer telling it the VA1
  3. Now hfi1 goes to pin those pages and effectively cache a VA1 to PA1 mapping. This is a time consuming operation and why we need to cache.
  4. User does a hipFree() or something else happens to free that VA.
  5. AMD Driver can free that mapping. The pages are still pinned. The hfi can still access the memory. This is not the problem.
  6. User does another allocation and gets another buffer somehow this VA is the same as it was in step 1 so consider it VA1 still.
  7. User asks hfi1 to send the new buffer again with VA1, and the hfi1 looks up the address associated in our cache and finds PA1, never pinning PA2. The end result is hfi1 sends data from the wrong PA.

There is no mechanism right now for the hfi1 driver to know that VA1 mapping should now be pointing at PA2.

The ask here is that the AMD driver use the call back to inform the driver that we need to handle this scenario in (5).

So while @schung-amd you are correct, the AMD driver doesn't NEED to call the callback. Other users of the buffer do need it to be called. Otherwise we can't cache those VA<->PA mappings. Now for things that use a separate memory registration scheme this may not be an issue, but for our pseudo on-demand paging scheme it is.

@BrendanCunningham Please correct any of my misunderstanding in the above.

@BrendanCunningham
Copy link
Author

@ddalessa no corrections; that is a good summary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants