How did you calculate the NextState? #1

cyborgdennett · 2023-10-14T15:27:10Z

Hi, great work. I was looking through this and did not see an explanation of how you went from the matrix version which uses [4, 1, 2, 1, 4, 4, 1, 1, 1, 2, 0, 0, 0, 0, 0, 0] to [[0, 1, 0, 0], [2, 2, 0, 0], [2, 0, 1, 0]] in the SIMD version. I would like to write this for 256 and 512 bit-SIMD. So I would like to know how I could make the matrix larger.

The text was updated successfully, but these errors were encountered:

Wunkolo · 2023-10-14T18:46:04Z

The [[0, 1, 0, 0], [2, 2, 0, 0], [2, 0, 1, 0]] that you see in the SIMD implementation are actually bit-shifts to the left! Each bit-shift to the left is the same as a multiplication by two!
So each 1 you see is actually a multiplication by 2, each 2 is a multiplication by 4. And the 0 elements leave the values unchanged. So the SIMD matrix that you see is basically the original matrix but each value is log2(n).

I would like to write this for 256 and 512 bit-SIMD. So I would like to know how I could make the matrix larger.

The issue I was encountering was that to make larger matrices, I needed to generalize the derivation somehow that kept things as powers-of-two so that I could use bit-shifts again rather than a regular multiplication. Another issue is that 32-bit and 64-bit values get exhausted very quickly since Fibonacci numbers increase so fast, so it provided little benefit over just using a LUT.

I got some more insight here on this reddit thread that might be of value to you too though.

cyborgdennett · 2023-10-19T12:42:47Z

I found one nice matrix to calculate the next n+2 till n+5 Only thing is You need one more simd vector. But it should be a bit faster.

The matrix:

n+5 [1 16 16 4]
n+4 [8 4 1 1]
n+3 [4 4 1 0]
n+2 [1 4 2 0]

left shiftable:

[2,0,!0,!0]
[4,0,0,1]
[4,2,2,2]
[0,3,2,0]

Rust implementation

#[target_feature(enable = "avx2")]
#[cfg(any(target_arch = "x86_64", target_feature = "avx2"))]
pub unsafe fn fib_parallel_qfib_stride_6_avx2(n: u32) -> u64 {
    use std::arch::x86_64::*;

    let mut n = n;
    // for the inbetween cases where n%6 > 3 : i.e. 4 and 5
    // shift the fibstate for these cases
    let mut fibstate = if n % 6 > 3 {
        n -= 2;
        _mm_set_epi32(5, 3, 2, 1)
    } else {
        _mm_set_epi32(2, 1, 1, 0)
        // _mm_set_epi32(3, 2, 1, 1)
    };

    let nextstates: [__m128i; 4] = [
        _mm_set_epi32(2, 0, !0, !0),
        _mm_set_epi32(4, 0, 0, 1), // !0 == -1
        _mm_set_epi32(4, 2, 2, 2),
        _mm_set_epi32(0, 3, 2, 0),
    ];

    for _ in (0..(n / 6 * 6)).step_by(6) {
        let mut result = _mm_setzero_si128();
        // fibstate = _mm_alignr_epi8(fibstate, fibstate, 4);
        for nextstate in nextstates {
            let product = _mm_sllv_epi32(_mm_broadcastd_epi32(fibstate), nextstate);
            fibstate = _mm_alignr_epi8(_mm_setzero_si128(), fibstate, 4);
            result = _mm_add_epi32(result, product);
        }
        fibstate = result;
    }

    u32x4::from(fibstate).to_array()[(n % 6) as usize]
        .try_into()
        .unwrap()
}

edit:
The stride 6 version is slightly slower when calculating till 43(the max fib you can calculate on i32).
qFib(original): 11.5ns
qFib(6stride): 13 ns
I would guess it is because you have to load one more vector into a register.

cyborgdennett · 2023-10-20T22:36:14Z

The issue I was encountering was that to make larger matrices, I needed to generalize the derivation somehow that kept things as powers-of-two so that I could use bit-shifts again rather than a regular multiplication.

I found a solution to find bigger matrices with all power-of-two's. It's not really a math solution but more of an brute force approach. But what I found does work and I got up to n+10 already.

I will leave this notebook here if you want to take a look:

The basis of this is that when you try to move all terms to the left side, where n-1 is at, a matrix will always have the following structure:

attempting to solve: 2
sifted [4, 0, 0, 1]
attempting to solve: 3
sifted [6, 1, 0, 1]
attempting to solve: 4
sifted [11, 0, 0, 0, 0, 1]
attempting to solve: 5
sifted [17, 1, 0, 1, 0, 1]
attempting to solve: 6
sifted [29, 0, 0, 0, 0, 0, 0, 1]
attempting to solve: 7
sifted [46, 1, 0, 1, 0, 1, 0, 1]
attempting to solve: 8
sifted [76, 0, 0, 0, 0, 0, 0, 0, 0, 1]

For even values, the formula yields F(k) -> x * F(n-1) + F(n-k+2)
For odd values, the formula is: F(k) -> (x+1) * F(n-1) - F(n-k+2)

Using this trick, one can make matrices where every item is a power of 2, than sift to the left and see if it is a fibonacci matrix or not.

Edit: matrixes -> matrices

Wunkolo · 2023-10-21T17:22:04Z

Thanks for the valuable insight! Def might revisit this to see if there's some viability in making higher-precision fibonacci calculations with SIMD. Keeping things as a pow-2 bit-shifts means something like 128/256/512-bit fib-matrices can possibly be much easier to calculate within big SIMD registers such as AVX512 or SVE and possibly beat any LUT-based implementation!

cyborgdennett · 2023-10-24T15:53:07Z

Hey @Wunkolo I found one more thing that can really speed up the fib calculations!

That thing is negative fibonacci numbers.

All you have to do to the normal fibonacci algorithm is: check if n is negative and dividable by 2. If so, make these numbers negative. if n < 0 and n % 2 == 0: c = -c

def fib(n):
  a, b, c = 1, 0, 0;

  for _ in range(0,abs(n)):
    c = a + b
    a = b
    b = c
  if n < 0 and n % 2 == 0:
    c = -c
  return c

How will this speed up calculations?

Take for example the sifted n+6 matrix.

n+4 [11, 0, 0, 0, 0, 1]

Without negative fibonacci numbers, you would have to know the previous fibonacci numbers upto where the 1 is.
Since negative fibs didn't exist. But with negative fibs we can do the following:

fibonacci seq

n=7 13     
n=6 8      
n=5 5      <- 11
n=4 3      <- 0
n=3 2      <- 0  
n=2 1      <- 0     previous situation
n=1 1      <- 0
n=0 0      <- 1    // this was the absolute lowest you could go.
n=-1 1
n=-2 -1 
n=-3 2
n=-4 -3
n=-5 5
n=-6 -8
n=-7 13

New solution can go from where-ever.

n=7 13     
n=6 8      
n=5 5      
n=4 3      
n=3 2      <- 11  
n=2 1      <- 0    
n=1 1      <- 0
n=0 0      <- 0    new situation
n=-1 1     <- 0
n=-2 -1    <- 1   // You can go as low as you want, but this is optimal for fast fibonacci calculation.
n=-3 2     
n=-4 -3
n=-5 5
n=-6 -8
n=-7 13

So, a new calculation comes for fast fibonacci calculations can be formulated. Which is also exponential.

F(n) -> x * F(n/2) +/- F(n/2-1)

Wunkolo · 2023-10-24T16:20:24Z

Awesome work!

F(n) -> x * F(n/2) +/- F(n/2-1)

This looks somewhat similar to the derivation to Chun-Min Chang's fast-doubling method that I have evaluating in the current benchmarks here:

qFib/tests/bench.cpp

Lines 156 to 165 in ebfafce

    
           if( mask & n ) 
        
           { // n_j is odd: k = (n_j-1)/2 => n_j = 2k + 1 
        
           	a = d;        //   F(n_j) = F(2k + 1) 
        
           	b = c + d;    //   F(n_j + 1) = F(2k + 2) = F(2k) + F(2k + 1) 
        
           } 
        
           else 
        
           { // n_j is even: k = n_j/2 => n_j = 2k 
        
           	a = c;        //   F(n_j) = F(2k) 
        
           	b = d;        //   F(n_j + 1) = F(2k + 1) 
        
           }

Particularly this bit looked kinda similar

if (n % 2) { // n is odd: F(n) = F(((n-1) / 2) + 1)^2 + F((n-1) / 2)^2
  unsigned int k = (n - 1) / 2;
  return fib(k) * fib(k) + fib(k + 1) * fib(k + 1);
} else { // n is even: F(n) = F(n/2) * [ 2 * F(n/2 + 1) - F(n/2) ]
  unsigned int k = n / 2;
  return fib(k) * [ 2 * fib(k + 1) - fib(k) ];
}

and identified some key relations here:

  if (n % 2) {
    k = (n - 1) / 2;
    fib_helper(k, f);
    uint64_t a = f[0];            // F(k) = F((n-1)/2)
    uint64_t b = f[1];            // F(k + 1) = F((n- )/2 + 1)
    uint64_t c = a * (2 * b - a); // F(n-1) = F(2k) = F(k) * [2 * F(k + 1) - F(k)]
    uint64_t d = a * a + b * b;   // F(n) = F(2k + 1) = F(k)^2 + F(k+1)^2
    f[0] = d;                     // F(n)
    f[1] = c + d;                 // F(n+1) = F(n-1) + F(n)
  } else {
    k = n / 2;
    fib_helper(k, f);
    uint64_t a = f[0];            // F(k) = F(n/2)
    uint64_t b = f[1];            // F(k + 1) = F(n/2 + 1)
    f[0] = a * (2 * b - a);       // F(n) = F(2k) = F(k) * [2 * F(k + 1) - F(k)]
    f[1] = a * a + b * b;         // F(n + 1) = F(2k + 1) = F(k)^2 + F(k+1)^2
  }

Though rather than doubling to get F(2n), you've found the derivation of the current F(N)-term given by halving the previous terms. A SIMD-optimized version of this method would be very valuable. For the Even/Odd part, AVX512 has the ability to put the lowest bit of each 32-bit term into a mask-register with something like:

vprold xmm1, xmm1, 1
vpmovd2m k1, xmm1
// Use k1 to negate `F(n/2-1)` before adding

and has similar versions for 64-bit elements. Def check out the blog post!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How did you calculate the NextState? #1

How did you calculate the NextState? #1

cyborgdennett commented Oct 14, 2023

Wunkolo commented Oct 14, 2023 •

edited

Loading

cyborgdennett commented Oct 19, 2023 •

edited

Loading

cyborgdennett commented Oct 20, 2023 •

edited

Loading

Wunkolo commented Oct 21, 2023

cyborgdennett commented Oct 24, 2023 •

edited

Loading

Wunkolo commented Oct 24, 2023 •

edited

Loading

How did you calculate the NextState? #1

How did you calculate the NextState? #1

Comments

cyborgdennett commented Oct 14, 2023

Wunkolo commented Oct 14, 2023 • edited Loading

cyborgdennett commented Oct 19, 2023 • edited Loading

cyborgdennett commented Oct 20, 2023 • edited Loading

Wunkolo commented Oct 21, 2023

cyborgdennett commented Oct 24, 2023 • edited Loading

Wunkolo commented Oct 24, 2023 • edited Loading

Wunkolo commented Oct 14, 2023 •

edited

Loading

cyborgdennett commented Oct 19, 2023 •

edited

Loading

cyborgdennett commented Oct 20, 2023 •

edited

Loading

cyborgdennett commented Oct 24, 2023 •

edited

Loading

Wunkolo commented Oct 24, 2023 •

edited

Loading