Performance considerations for NNX #4224

cgarciae · 2024-09-24T13:45:33Z

cgarciae
Sep 24, 2024
Maintainer

Currently nnx.jit traverses the object graph in Python. This is slow and primarily affects the small model regime, as the Python overhead starts to disappear as the model's width grows. To solve this in general, we will be developing a Rust extension called flaxlib (see first steps in #4196) to speedup some of the traversal logic in graph.py, similar to how JAX solved the same issue with jaxlib for standard pytrees.

Meanwhile, there is a pattern you can use to remove the python overhead using regular jax.jit + nnx.split / nnx.merge to stage out the traversal logic. Take this code that uses nnx.jit as an example:

from flax import nnx
import jax
import jax.numpy as jnp
import optax

class Model(nnx.Module):
  def __init__(self, din, dmid, dout, rngs: nnx.Rngs):
    self.linear = nnx.Linear(din, dmid, rngs=rngs)
    self.bn = nnx.BatchNorm(dmid, rngs=rngs)
    self.dropout = nnx.Dropout(0.2, rngs=rngs)
    self.linear_out = nnx.Linear(dmid, dout, rngs=rngs)

  def __call__(self, x):
    x = nnx.relu(self.dropout(self.bn(self.linear(x))))
    return self.linear_out(x)

model = Model(2, 64, 3, rngs=nnx.Rngs(0))  # eager initialization
optimizer = nnx.Optimizer(model, optax.adam(1e-3))  # reference sharing
metrics = nnx.MultiMetric(
  loss=nnx.metrics.Average('loss'),
)

@nnx.jit  # <== currently slow
def train_step(model, optimizer, metrics, x, y):
  def loss_fn(model):
    y_pred = model(x)  # call methods directly
    return ((y_pred - y) ** 2).mean()

  loss, grads = nnx.value_and_grad(loss_fn)(model)
  optimizer.update(grads)  # in-place updates
  metrics.update(loss=loss)

  return loss
  
for _ in range(1000):
  x, y = jnp.ones((32, 2)), jnp.zeros((32, 3))
  loss = train_step(model, optimizer, metrics, x, y)

To speed it up you can use nnx.split before starting the training loop to create a graphdef and state for the NNX objects which are fast to traverse, and then call merge + split inside the jax.jit-decorated function so they only run once during tracing:

model = Model(2, 64, 3, rngs=nnx.Rngs(0))  # eager initialization
optimizer = nnx.Optimizer(model, optax.adamw(1e-3))  # reference sharing
metrics = nnx.MultiMetric(
  loss=nnx.metrics.Average('loss'),
)
graphdef, state = nnx.split((model, optimizer, metrics))

@jax.jit  # regular JAX
def train_step(graphdef, state, x, y):
  model, optimizer, metrics = nnx.merge(graphdef, state)

  def loss_fn(model):
    y_pred = model(x)  # call methods directly
    return ((y_pred - y) ** 2).mean()

  loss, grads = nnx.value_and_grad(loss_fn)(model)
  optimizer.update(grads)  # in-place updates
  metrics.update(loss=loss)

  _, state = nnx.split((model, optimizer, metrics))
  return state, loss

for _ in range(1000):
  x, y = jnp.ones((32, 2)), jnp.zeros((32, 3))
  state, loss = train_step(graphdef, state, x, y)

nnx.update((model, optimizer, metrics), state)

After the training loop is done (or whenever need) nnx.update can be used to update model, optimizer, and metrics to a new state.

rademacher-p · 2024-10-09T14:56:00Z

rademacher-p
Oct 9, 2024

Thanks for posting this update, I remembered you commented on this performance consideration somewhere months back and couldn't find it. Been doing this split/merge w/ standard JAX transforms in my code just in case (and to stay as pure JAX as possible). If a PR goes through to address this, I'll try switching to the NNX transforms! 👍

0 replies

Tomas542 · 2024-10-11T20:35:24Z

Tomas542
Oct 11, 2024

Thanks for posting this! Is there a way to do update with metrics? Or only nnx.merge from #4045? Like

graphdef, state = nnx.split((model, optimizer, metrics))
...
nnx.update((model, optimizer, metrics), state)

Cause now it raises an error:

ValueError: Cannot set key 'count' on immutable node of type ScaleByAdamState

And also it will be great to create page with speed up tips for NNX API!

4 replies

cgarciae Oct 12, 2024
Maintainer Author

Ideally we need no tricks in the future :)
Can you post a minimal example with your issue?

Tomas542 Oct 14, 2024

Sure! It is similar to your code for MNIST from #4045 and nnx MNIST tutorial, but I'll sent it with iris for faster launch. We have imports and data init

# annotations for code readability
import typing as tp

# data visualization and split
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# nn
import jax
from jax import numpy as jnp
from flax import nnx
from flax.typing import Array
import optax

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    shuffle=True, 
                                                    random_state=42)

model class, initialization for metrics dict function and model, optimizer and metrics init

def _init_metrics_hisory() -> dict[str, Array]:
    """Creates metrics_history
    Example:
    >>> metrics = _init_metrics_hisory()
    >>> metrics
    {'train_loss': [], 'train_accuracy': [], 'test_loss': [], 'test_accuracy': []}
    """
class MLP(nnx.Module):
    """Simple MLP with variable hidden layers' size
    
    Args:
        - input_dim (int) - num of input features. Default: 60
        - hidden (tuple[int]) - optional sizes of hidden layers. Default: None
        - output_dim (int) - num of output features. Default: 10
        - rngs (nnx.Rngs) - random generator.
    
    Example:
    >>> import jax.numpy as jnp
    >>> from flax import nnx
    >>> model = MLP(input_dim=60, hidden=(5,), output_dim=10, rngs=nnx.Rngs(42))
    >>> y = jnp.ones(60)
    >>> model(y)
    Array([0.1900208], dtype=float32)
    """
    def __init__(self,
                 input_dim: int = 60,
                 hidden: tp.Optional[tuple[int]] = None, 
                 output_dim: int = 10,
                 rngs: tp.Optional[nnx.Rngs] = None) -> None:
        # init random generator for Jax
        if rngs is None:
            rngs = nnx.Rngs(42)
            
        layers = [] # store all layers

        # without hidden
        if hidden is None:
            layers.append(nnx.Linear(input_dim, output_dim, rngs=rngs))

        # init hidden layers with swish
        else:
            last_layer = input_dim # out_featuers of the last layer
            for hidden_layer in hidden:
                layers.append(nnx.Linear(in_features=last_layer, 
                                         out_features=hidden_layer, 
                                         rngs=rngs))
                layers.append(nnx.swish)
                # layers.append(nnx.Dropout(0.05, rngs=rngs))
                last_layer = hidden_layer
            layers.append(nnx.Linear(last_layer, output_dim, rngs=rngs))

        # unpack all layers to call them later
        self.nn = nnx.Sequential(*layers)
        del layers

    def __call__(self, x: Array) -> Array:
        x = self.nn(x)
        return x
        
        
    metrics_history = {
        'train_loss': [],
        'train_accuracy': [0],
        'test_loss': [],
        'test_accuracy': [0],
    }
    return metrics_history

model = MLP(input_dim=X_train.shape[1], hidden=(5,), output_dim=3, rngs=nnx.Rngs(42))
optimizer = nnx.Optimizer(model, optax.adamw(5e-3, 0.9))
metrics_history = _init_metrics_hisory()
metrics = nnx.MultiMetric(
    accuracy=nnx.metrics.Accuracy(),
    loss=nnx.metrics.Average('loss'),
)

then train and eval steps with CE loss

@nnx.jit
def loss_fn(model: MLP, 
            features: Array, 
            labels: Array) -> tuple[Array]:
    """Computes CE-Loss with optax. Returns loss and logits"""
    
    logits = nnx.vmap(model)(features)
    loss = jnp.mean(jax.vmap(
        optax.softmax_cross_entropy_with_integer_labels)
        (logits=logits, labels=labels)
    )
    return loss, logits


@nnx.jit
def train_step(graphdef: nnx.GraphDef, 
               state: nnx.GraphState,   
               features: Array, 
               labels: Array) -> nnx.GraphState:
    """Train for a single step."""
    model, optimizer, metrics = nnx.merge(graphdef, state)
    grad_fn = nnx.value_and_grad(loss_fn, has_aux=True)
    (loss, logits), grads = grad_fn(model, features, labels)
    metrics.update(loss=loss, logits=logits, labels=labels)
    optimizer.update(grads)
    _, state = nnx.split((model, optimizer, metrics))
    return state


@nnx.jit
def eval_step(graphdef: nnx.GraphDef, 
              state: nnx.GraphState,   
              features: Array,
              labels: Array) -> nnx.GraphState:
    """Eval for single step"""
    model, optimizer, metrics = nnx.merge(graphdef, state)
    loss, logits = loss_fn(model, features, labels)
    metrics.update(loss=loss, logits=logits, labels=labels)
    _, state = nnx.split((model, optimizer, metrics))
    return state

And training loop with batching

num_epochs = 5
batch_size = 10
for i in range(num_epochs):
    # train
    model.train()
    graphdef, state = nnx.split((model, optimizer, metrics))
    for j in range(0, len(X_train), batch_size):
        state = train_step(graphdef=graphdef,
                            state=state,
                            features=X_train[j:j+batch_size], 
                            labels=y_train[j:j+batch_size])

    model, optimizer, metrics = nnx.merge(graphdef, state)
    # nnx.update((model, optimizer, metrics), state) # ERROR

    # store train metrics
    for metric, value in metrics.compute().items():     
        metrics_history[f'train_{metric}'].append(value)
    metrics.reset()

    # eval
    model.eval()
    graphdef, state = nnx.split((model, optimizer, metrics))
    for j in range(0,len(X_test), batch_size):
        state = eval_step(graphdef=graphdef,
                            state=state,
                            features=X_test[j:j+batch_size], 
                            labels=y_test[j:j+batch_size])
    
    model, optimizer, metrics = nnx.merge(graphdef, state)
    # nnx.update((model, optimizer, metrics), state) # ERROR

    # store eval metrics
    for metric, value in metrics.compute().items():    
        metrics_history[f'test_{metric}'].append(value)
    metrics.reset() 

    print(
        f"[train] eposh: {i}, "
        f"loss: {metrics_history['train_loss'][-1]}, "
        f"accuracy: {metrics_history['train_accuracy'][-1] * 100}"
    )
    print(
        f"[test] epoch: {i}, "
        f"loss: {metrics_history['test_loss'][-1]}, "
        f"accuracy: {metrics_history['test_accuracy'][-1] * 100}"
    )

In the last piece coda before metrics storing is made with nnx.merge, and nnx.update raises an error.

The problem could be, that nnx.Optimizer it trying to share with metrics nnx.Average (Accuracy is built on top of average) and the line self.count.value += 1 if isinstance(values, (int, float)) else values.size in Average class trying to take values.size from state of optimizer.

cgarciae Oct 31, 2024
Maintainer Author

@Tomas542 sorry for the super late reply. I've updated the example above to use metrics but it worked fine. Can you try upgrading to the latest flax version?

Tomas542 Nov 3, 2024

@cgarciae Yeah, I've tested it with latest version and now it works, thank you!

rademacher-p · 2024-10-30T01:16:08Z

rademacher-p
Oct 30, 2024

@cgarciae As I mention above, I've been sticking to the split/merge + JAX transforms to future proof against any performance hits. However, I would consider switching to NNX transforms for my current dev if the expectation is that the Rust extension would definitively close the performance gap. Can you comment on the expected gains with flaxlib?

1 reply

cgarciae Oct 31, 2024
Maintainer Author

Hey @rademacher-p! Just to clarify, you only need to do this for the top-level jax.jit transform, inside the jitted function you can use NNX transforms without performance loss e.g. nnx.grad in the example.

Can you comment on the expected gains with flaxlib

Ideally the overhead becomes negligible. Similar to jaxlib, we could eventually specialize the traversal for certain objects such as Module, list, dict, etc, to minimize the Python footprint.

galah92 · 2024-12-22T17:38:27Z

galah92
Dec 22, 2024

@cgarciae in your example, at the end, metrics hold the latest value instead epoch values. Shouldn't you call nnx.update((model, optimizer, metrics), state) after each train_step?

2 replies

cgarciae Dec 23, 2024
Maintainer Author

Ideally not to save some compute.

galah92 Dec 23, 2024

Can you elaborate? What's the best practice here to correctly retrieve all the metrics?

kvablack · 2025-01-03T19:55:08Z

kvablack
Jan 3, 2025

Big fan of NNX!

I personally think there are reasons other than performance to use split/merge and standard JAX transforms. It's "closer to the metal," if you will -- once you understand the split/merge API and JAX's core APIs, you're empowered to do pretty much anything, with a little more boilerplate (holding on to the graphdef) which is not too bad in my opinion (especially since y'all have done such a great job with the static typing!). You can mix NNX's mutable reference semantics with JAX's pure functional semantics to write both convenient and bug-free code.

I worry that encouraging NNX transforms only, while sweeping split/merge under the rug, would be especially bad for newer JAX users. NNX transforms add a layer of abstraction that completely hides the underlying JAX abstractions, which may make it harder to pick up important concepts like tracing/staging out, PyTrees, sharding, etc. As a more experienced JAX user, I've definitely been finding split/merge with explicit state management more comfortable and legible. Another argument for encouraging this pattern is that, at least right now, you must understand split/merge and explicit state management to save and load checkpoints.

I realize not everyone will agree with me! My vote would be to document both split/merge and NNX transforms side-by-side as equivalent ways of doing things, even after flaxlib is complete. That way, even if people do want to use NNX transforms to save on boilerplate, they can still acquire a mental model of what is happening under the hood.

3 replies

8bitmp3 Jan 3, 2025
Maintainer

Thanks @kvablack !

... to document both split/merge and NNX transforms side-by-side as equivalent ways of doing things, even after flaxlib is complete. That way, even if people do want to use NNX transforms to save on boilerplate, they can still acquire a mental model of what is happening under the hood.

@cgarciae @IvyZX @levskaya WDYT

8bitmp3 Jan 3, 2025
Maintainer

fyi For flaxlib context, recent PR #4469 by @cgarciae

cgarciae Jan 4, 2025
Maintainer Author

Hey @kvablack, thanks for the feedback! I'm glad you like the Functional API (split/merge), some JAX contributors like it a lot.

I have two things I'd like to note regarding the preference for NNX transforms and our documentation strategy:

NNX Transforms have been modeled to match the API of JAX transforms (see JAX-style NNX Transforms) in hope that all the concepts like tracing/staging out, PyTrees, sharding, remain intact. In other words, you should be able to swap a JAX transform for its NNX counterpart and it should still work as is. Our goal is for users to think of NNX transforms as "The same JAX transforms but you can pass NNX objects", ideally this remains true.
Currently we are trying to follow the progressive disclosure of complexity principle by showing NNX transforms first since they provide automatic state management and should be easier to use for beginers, and then showing how to use the Functional API for more advanced use cases. You can see this in the ordering of the sections in Flax Basics. However, I do agree we should showcase the Functional API more as users need to use it for thinks like checkpointing or whenever they need to interact with JAX transforms we currently don't support or 3rd party libraries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance considerations for NNX #4224

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 10 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Performance considerations for NNX #4224

cgarciae Sep 24, 2024 Maintainer

Replies: 5 comments · 10 replies

cgarciae Oct 12, 2024 Maintainer Author

cgarciae Oct 31, 2024 Maintainer Author

cgarciae Oct 31, 2024 Maintainer Author

cgarciae Dec 23, 2024 Maintainer Author

8bitmp3 Jan 3, 2025 Maintainer

8bitmp3 Jan 3, 2025 Maintainer

cgarciae Jan 4, 2025 Maintainer Author

cgarciae
Sep 24, 2024
Maintainer

Replies: 5 comments 10 replies

cgarciae Oct 12, 2024
Maintainer Author

cgarciae Oct 31, 2024
Maintainer Author

cgarciae Oct 31, 2024
Maintainer Author

cgarciae Dec 23, 2024
Maintainer Author

8bitmp3 Jan 3, 2025
Maintainer

8bitmp3 Jan 3, 2025
Maintainer

cgarciae Jan 4, 2025
Maintainer Author