Jax/Flax models 2x slower on Sapphire Rapids (c7i) than Ice Lake (c6i) instances | x86 #23296

Rohanjames1997 · 2024-08-28T16:36:48Z

Rohanjames1997
Aug 28, 2024

Problem

Flax models run upto 2x slower on the latest c7i ec2 instances (Sapphire Rapids) than on c6i instances (Ice Lake)

Steps to repro:

Spawn a c6i.4xlarge and a c7i.4xlarge
pip install jax flax on both instances. It currently installs jax-v0.4.31
Run the below snippet of code on both (an MLP written in Flax)

Result

The latency of the script
On c7i: 49.705s (45.850s using DNNL_DEFAULT_FPMATH_MODE=BF16 to enable AMX)
On c6i: 26.494s

Similar results were seen using flax models such as bert-base-uncased from Huggingface

Questions

Pytorch has a blog that claims that AMX is auto-picked if available, and that it improves performance.
Is there a known issue regarding this perf degrade on Jax?
Are there known flags / environment variables that can be set on Sapphire Rapids to at least match the performance of its predecessor?

Code

Flax MLP

from typing import Sequence
from time import time
from flax import linen as nn
from jax import random

NUM_INFERENCES = 1000
NUM_LAYERS = 10
FEATURES = [256] * NUM_LAYERS
INPUT_DIM = (256, 256)

class MLP(nn.Module):
  features: Sequence[int]

  @nn.compact
  def __call__(self, inputs):
    x = inputs
    for i, feat in enumerate(self.features):
      x = nn.Dense(feat, name=f'layers_{i}')(x)
      if i != len(self.features) - 1:
        x = nn.activation.gelu(x, True)
        
    return x

key1, key2 = random.split(random.key(0), 2)
x = random.uniform(key1, INPUT_DIM)

model = MLP(features=FEATURES)
params = model.init(key2, x)

start = time()
for _ in range(NUM_INFERENCES):
    y = model.apply(params, x)
end = time()
print(f"\nNUM INFERENCES:{NUM_INFERENCES}\n{end-start:.3f}s")

Answered by mdfaijul

Sep 4, 2024

@Rohanjames1997 @penpornk We also measured Huggingface bert-base-uncased model and observed improved performance with bfloat16 numeric. Example code added at the end.

c7i.4xlarge (Sapphire Rapids)
Run commands for the attached code:

export XLA_FLAGS=--xla_cpu_use_thunk_runtime=false

1. python bert.py
2. DNNL_DEFAULT_FPMATH_MODE=BF16 python bert.py
3. python bert.py --precision bfloat16

Configuration	Throughput (examples/sec)
1. No setting of fpmath-mode	35.898
2. FPMATH_MODE=BF16	71.607
3. Using `jax.numpy.blfoat16` directly	85.289

c6i.4xlarge (Ice Lake)
Run commands for the attached code:

export XLA_FLAGS=--xla_cpu_use_thunk_runtime=false
python bert.py

Configuration	T…

View full answer

penpornk · 2024-08-28T16:59:38Z

penpornk
Aug 28, 2024

XLA:CPU supports AMX in contraction ops through custom calls to oneDNN. We have recently transitioned to a new runtime which doesn't support these oneDNN custom calls yet (support coming soon in 1-2 weeks).

In the meanwhile, the old runtime support these custom calls and can use AMX. Setting the environment variable export XLA_FLAGS=--xla_cpu_use_thunk_runtime=false will use the old runtime. Could you please help try if this improves the speed? (For a fair comparison, please use the old runtime for both Sapphire Rapids and Ice Lake.) If there's still a gap, then that means there are other issues as well.

3 replies

Rohanjames1997 Aug 28, 2024
Author

Thanks for the prompt response @penpornk !

Unfortunately, the issue persists even after setting the flag XLA_FLAGS=--xla_cpu_use_thunk_runtime=false
This also makes sense as the thunks interpreter was set as default after jax-v0.4.31 was released.

Like you said, this seems to suggest that there could be other issues as well.

penpornk Aug 28, 2024

Thank you for the quick reply! I've cc'ed the Intel team on this issue so they could help investigate.

agramesh1 Aug 28, 2024

Hi @penpornk thanks for letting us know, we will look at it.

penpornk · 2024-08-28T17:01:28Z

penpornk
Aug 28, 2024

cc: @agramesh1 @TensorFlow-MKL (Intel oneDNN-XLA integration team)

1 reply

agramesh1 Aug 29, 2024

@Rohanjames1997 thanks for reporting. I am not able to replicate the increase in latency in our internal systems.
I will try to replicate on AWS instances and let you know.

mdfaijul · 2024-09-04T18:44:48Z

mdfaijul
Sep 4, 2024

@Rohanjames1997 @penpornk We have tested the code on both c6i.4xlarge and c7i.4xlarge ec2 instances. XLA_FLAGS environment variable has been set as XLA_FLAGS=--xla_cpu_use_thunk_runtime=false. We also tested bfloat16 numeric using jax.numpy.bfloat16 (A reproducer code added below) Here are the measurements. Both the systems had 8 cores with Hyper-threads (2 threads per core).

c7i.4xlarge (Sapphire Rapids)

Configuration	Duration (secs)
No setting of fpmath-mode	38.224
FPMATH_MODE=BF16	36.325
Using `jax.numpy.blfoat16` directly	26.824

c6i.4xlarge (Ice Lake)

Configuration	Duration (secs)
No setting of fpmath-mode	28.810

The performance difference between Sapphire Rapids and Ice Lake for float32 numeric can be attributed to the higher frequency of Ice Lake.

Code using jax.numpy.bfloat16

from typing import Sequence
from time import time
from flax import linen as nn
from jax import random
from jax import numpy as jnp

NUM_INFERENCES = 1000
NUM_LAYERS = 10
FEATURES = [256] * NUM_LAYERS
INPUT_DIM = (256, 256)

class MLP(nn.Module):
  features: Sequence[int]

  @nn.compact
  def __call__(self, inputs):
    x = inputs
    for i, feat in enumerate(self.features):
      x = nn.Dense(feat, name=f'layers_{i}',dtype=jnp.bfloat16, param_dtype=jnp.bfloat16)(x)
      if i != len(self.features) - 1:
        x = nn.activation.gelu(x, True)
        
    return x

key1, key2 = random.split(random.key(0), 2)
x = random.uniform(key1, INPUT_DIM, jnp.bfloat16)

model = MLP(features=FEATURES)
params = model.init(key2, x)

start = time()
for _ in range(NUM_INFERENCES):
  y = model.apply(params, x)
end = time()
print(f"\nNUM INFERENCES:{NUM_INFERENCES}\n{end-start:.3f}s")

0 replies

mdfaijul · 2024-09-04T22:07:27Z

mdfaijul
Sep 4, 2024

@Rohanjames1997 @penpornk We also measured Huggingface bert-base-uncased model and observed improved performance with bfloat16 numeric. Example code added at the end.

c7i.4xlarge (Sapphire Rapids)
Run commands for the attached code:

export XLA_FLAGS=--xla_cpu_use_thunk_runtime=false

1. python bert.py
2. DNNL_DEFAULT_FPMATH_MODE=BF16 python bert.py
3. python bert.py --precision bfloat16

Configuration	Throughput (examples/sec)
1. No setting of fpmath-mode	35.898
2. FPMATH_MODE=BF16	71.607
3. Using `jax.numpy.blfoat16` directly	85.289

c6i.4xlarge (Ice Lake)
Run commands for the attached code:

export XLA_FLAGS=--xla_cpu_use_thunk_runtime=false
python bert.py

Configuration	Throughput (examples/sec)
No setting of fpmath-mode	44.770

# file: bert.py

from argparse import ArgumentParser

import jax
from transformers import AutoTokenizer, FlaxBertModel, BertConfig
import numpy as np

parser = ArgumentParser()
parser.add_argument('--precision', type=str, choices=["float32", "bfloat16"], default="float32")
args = parser.parse_args()

dtype = jax.numpy.float32
if args.precision == "bfloat16":
  dtype = jax.numpy.bfloat16

VOCAB_SIZE = 30522
BS = 32
SEQ_LEN = 128

def get_input_data(batch_size=1, seq_length=384):
  shape = (batch_size, seq_length)
  input_ids = np.random.randint(1, VOCAB_SIZE, size=shape).astype(np.int32)
  token_type_ids = np.ones(shape).astype(np.int32)
  attention_mask  = np.ones(shape).astype(np.int32)
  return { 'input_ids': input_ids, 'token_type_ids': token_type_ids, 'attention_mask': attention_mask }

inputs = get_input_data(BS, SEQ_LEN)
config = BertConfig.from_pretrained("bert-base-uncased", hidden_act="gelu_new")
model = FlaxBertModel.from_pretrained("bert-base-uncased", config=config, dtype=dtype)

@jax.jit
def func():
  outputs = model(**inputs)
  return outputs

(nwarmup, nbenchmark) = (5, 5)

# warmpup
for _ in range(nwarmup):
  func()

# benchmark
import time
start = time.time()
for _ in range(nbenchmark):
  func()
end = time.time()
print(f"Throughput: {((nbenchmark * BS)/(end-start)):.3f} examples/sec")

2 replies

mdfaijul Sep 4, 2024

@Rohanjames1997 Please let us know if you could reproduce on your end.

Rohanjames1997 Oct 10, 2024
Author

Thanks!
The --precision bfloat16 seems to help.

agramesh1 · 2024-09-09T21:54:51Z

agramesh1
Sep 9, 2024

@Rohanjames1997 the recommended way to use AMX on Sapphire Rapids in JAX/FLAX is by using the bfloat16 datatype as @mdfaijul has shown. You can also use DNNL_DEFAULT_FPMATH_MODE=BF16 but it will not give you the full benefits of using AMX and Sapphire Rapids.

2 replies

Rohanjames1997 Oct 10, 2024
Author

Thanks!
I shall try using bfloat16 datatype instead of DNNL_DEFAULT_FPMATH_MODE=BF16 as you have suggested.

Is this recommendation made public anywhere, maybe in some documentation or the like?

agramesh1 Oct 11, 2024

@Rohanjames1997 Thanks for the suggestion, we will update the Jax document and other public documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jax/Flax models 2x slower on Sapphire Rapids (c7i) than Ice Lake (c6i) instances | x86 #23296

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Jax/Flax models 2x slower on Sapphire Rapids (c7i) than Ice Lake (c6i) instances | x86 #23296

Problem

Steps to repro:

Result

Questions

Code

Replies: 5 comments · 8 replies

Rohanjames1997 Aug 28, 2024 Author

Rohanjames1997 Oct 10, 2024 Author

Rohanjames1997 Oct 10, 2024 Author

Replies: 5 comments 8 replies

Rohanjames1997 Aug 28, 2024
Author

Rohanjames1997 Oct 10, 2024
Author

Rohanjames1997 Oct 10, 2024
Author