Skip to content

Latest commit

 

History

History
243 lines (200 loc) · 13.2 KB

README.md

File metadata and controls

243 lines (200 loc) · 13.2 KB

DOI

Fortran Fast math

A collection of functions for fast number crunching using Fortran.

In order to get the maximum performance of this library, compile with "-O3 -march=native" (or equivalent).

Available functions

function name(s) shapes types
sum fsum fsum_kahan(1) 1d real32 real64
dot fprod fprod_kahan(2) 1d real32 real64
cos fcos elemental real32 real64
sin fsin elemental real32 real64
tan ftan elemental real32 real64
tanh ftanh elemental real32 real64
acos facos elemental real32 real64
atan fatan elemental real32 real64
erf ferf elemental real32 real64
log flog_p3 flog_p5 elemental real64
rsqrt(3) frsqrt elemental real32 real64
  • (1) fast (and precise) sum for 1D arrays - possibility of including a mask. fsum: fastest method and at worst, same or 1 order of magnitud more precise than the intrinsic sum. It groups chunks of values in a temporal working batch which is summed up once at the end. fsum_kahan: Highest precision. It has a precission close to a quadratic sum (for real32 summing with real64, and fo real64 summing with real128). It also uses the chunks principle with an elemental kahan operator applied on top.

  • (2) fast (and precise) dot product for 1D arrays - possibility of including a 3rd weighting array. fprod: fastest method and at worst, 1 order of magnitud more precise than the intrinsic dot_product. runtime can vary between 3X and 8X the intrinsic. It groups chunks of products in a temporal working batch which is summed up once at the end (based on fsum). fprod_kahan: Same idea as fsum_kahan but on top of chunked products.

  • (3) rsqrt: reciprocal square root $f(x)=1/sqrt(x)$

API documentation

To generate the API documentation for fast_math using ford run the following command:

ford ford.yml

TODO

  • Contribution guidelines
  • Polish autodoc

Elapsed time examples and precision

Warning: The following values are just references as to see how different can they be between different compilers. Actual speed-ups(downs) should be measured under the true use conditions to account for (lack-off) inlinement, etc etc.

(Click to unfold) Windows gfortran 14.1 > fpm test --flag "-O3 -march=native -mtune=native" CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz
sum r32 [ns/eval] Speed-Up relative error
intrinsic 1.2100 1.00 3.3794E-06
kahan 0.1800 6.72 1.0425E-07
chunk 0.1100 11.00 1.1265E-07
sum r64 [ns/eval] Speed-Up relative error
intrinsic 1.3000 1.00 5.9269E-15
kahan 0.3100 4.19 1.7286E-16
chunk 0.1500 8.67 2.1416E-16
sum r32 mask [ns/eval] Speed-Up relative error
intrinsic 4.1250 1.00 1.5687E-06
kahan 0.1600 25.78 9.1493E-08
chunk 0.1600 25.78 8.8453E-08
sum r64 mask [ns/eval] Speed-Up relative error
intrinsic 4.0350 1.00 2.9428E-15
kahan 0.3750 10.76 1.2179E-16
chunk 0.2450 16.47 1.2768E-16
dot r32 [ns/eval] Speed-Up relative error
intrinsic 1.0600 1.00 3.2735E-06
kahan 0.1500 7.07 9.8348E-08
chunk 0.1000 10.60 1.1587E-07
dot r64 [ns/eval] Speed-Up relative error
intrinsic 1.2100 1.00 5.8091E-15
kahan 0.3300 3.67 1.8407E-16
chunk 0.2000 6.05 2.0528E-16
trigo [ns/eval] Speed-Up relative error
fsin r32 2.8840 13.82 3.4749E-07
fsin r64 3.1040 12.17 4.0784E-16
facos r32 1.6600 28.64 2.9135E-05
facos r64 1.6800 6.89 2.9274E-14
fatan r32 1.6720 23.36 1.7730E-06
fatan r64 2.5120 3.94 6.6869E-06
hyperb [ns/eval] Speed-Up relative error
ftanh r32 2.1640 8.61 5.9480E-08
ftanh r64 2.3480 7.16 1.3282E-09
ferf r32 2.3600 27.21 7.9573E-08
ferf r64 4.1200 15.60 9.6298E-08
rsqrt [ns/eval] Speed-Up relative error
frsqrt r32 1.7720 0.26 9.4039E-04
frsqrt r64 2.2280 0.64 8.9297E-04
(Click to unfold) Windows ifx 2025.0.4 > fpm test --flag "/O3 /Qxhost" CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz
sum r32 [ns/eval] Speed-Up relative error
intrinsic 0.4300 1.00 3.8308E-07
kahan 0.1700 2.53 6.0938E-08
chunk 0.0100 43.00 6.0938E-08
sum r64 [ns/eval] Speed-Up relative error
intrinsic 0.3500 1.00 1.5061E-15
kahan 0.1800 1.94 1.3033E-16
chunk 0.0200 17.50 1.3886E-16
sum r32 mask [ns/eval] Speed-Up relative error
intrinsic 0.3000 1.00 2.0369E-07
kahan 0.2200 1.36 5.2360E-08
chunk 0.1750 1.71 5.2515E-08
sum r64 mask [ns/eval] Speed-Up relative error
intrinsic 0.3500 1.00 3.7423E-16
kahan 0.2900 1.21 8.3862E-17
chunk 0.2800 1.25 9.4422E-17
dot r32 [ns/eval] Speed-Up relative error
intrinsic 0.3400 1.00 3.9539E-07
kahan 0.1600 2.12 6.7639E-08
chunk 0.1600 2.12 6.6906E-08
dot r64 [ns/eval] Speed-Up relative error
intrinsic 0.7100 1.00 1.4730E-15
kahan 0.1500 4.73 1.2270E-16
chunk 0.1700 4.18 1.2459E-16
trigo [ns/eval] Speed-Up relative error
fsin r32 3.0960 0.26 2.0412E-08
fsin r64 2.7080 1.01 3.5190E-17
facos r32 1.6440 0.46 1.3946E-05
facos r64 1.7560 1.51 2.0708E-11
fatan r32 2.6880 0.28 4.4950E-06
fatan r64 1.9000 1.73 6.6869E-06
hyperb [ns/eval] Speed-Up relative error
ftanh r32 2.3200 0.48 1.0284E-08
ftanh r64 2.3080 2.19 1.3282E-09
ferf r32 3.3160 0.23 7.5974E-07
ferf r64 2.9760 0.89 9.6298E-08
rsqrt [ns/eval] Speed-Up relative error
frsqrt r32 1.7280 0.21 9.4033E-04
frsqrt r64 1.6520 0.90 8.7360E-04
(Click to unfold) WSL2 nvfortran 24.3 > fpm test --flag "-Mpreprocess -fast" CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz 1.99 GHz
sum r32 [ns/eval] Speed-Up relative error
intrinsic 0.2100 1.00 1.1295E-07
kahan 0.3200 0.66 9.8169E-08
chunk 0.1400 1.50 7.1764E-08
sum r64 [ns/eval] Speed-Up relative error
intrinsic 0.3300 1.00 3.8969E-16
kahan 0.3200 1.03 1.8086E-16
chunk 0.2200 1.50 9.0372E-17
sum r32 mask [ns/eval] Speed-Up relative error
intrinsic 0.2400 1.00 2.0742E-07
kahan 0.3050 0.79 8.9645E-08
chunk 0.1550 1.55 5.8651E-08
sum r64 mask [ns/eval] Speed-Up relative error
intrinsic 0.4150 1.00 3.8136E-16
kahan 0.5000 0.83 1.2734E-16
chunk 0.2850 1.46 2.4869E-17
dot r32 [ns/eval] Speed-Up relative error
intrinsic 0.2500 1.00 1.1426E-07
kahan 0.2600 0.96 9.7811E-08
chunk 0.1400 1.79 7.2122E-08
dot r64 [ns/eval] Speed-Up relative error
intrinsic 0.2600 1.00 3.9246E-16
kahan 0.3800 0.68 1.9229E-16
chunk 0.1900 1.37 9.0927E-17
trigo [ns/eval] Speed-Up relative error
fsin r32 0.0600 190.80 1.0325E-07
fsin r64 0.0320 357.25 5.0118E-17
facos r32 0.0280 221.43 1.0563E-06
facos r64 0.0160 546.75 3.7996E-15
fatan r32 0.0240 300.50 5.4993E-06
fatan r64 0.0400 244.40 6.6869E-06
hyperb [ns/eval] Speed-Up relative error
ftanh r32 0.0280 510.71 5.5308E-08
ftanh r64 0.0360 348.56 1.3282E-09
ferf r32 0.0400 496.90 9.1205E-08
ferf r64 0.0360 532.44 9.6298E-08
rsqrt [ns/eval] Speed-Up relative error
frsqrt r32 16.3120 0.03 9.4387E-04
frsqrt r64 16.7680 0.11 8.6745E-04

Acknowledgement

  • Compilation of this library was possible thanks to Transvalor S.A. research activities.
  • Part of this library is based on the work of Perini and Reitz, that was funded through the Sandia National Laboratories by the U.S. Department of Energy, Office of Vehicle Technologies, program managers Leo Breton, Gupreet Singh.
  • The fortran lang community discussions such as Some Intrinsic SUMS and fastGPT

Contribution of open-source developers:

jalvesz

perazz