Restructure pipelinefunctions #147

anschaible · 2025-06-03T08:22:55Z

When calculating the datacube for many stellar particles (in the order of hundred thousands), we had memory issues. The final datacube had in a lot of spaxels negative flux.

Wer thought this is solved by switching from pmap to shard_map. However, the error still occurred. The speculation was that we got an over/underflow at the point, when the pipeline was asigning the spectra to the stellar particles, because rubix hold all spectra for the individual particles in memory, which lead to a spike in memory and on GPUs the code even failed because it run out of memory.

Therefore this branch changes the spectra assignment and datacube calculation. Rubix is now looking up the spectrum for one particle, mass weights it, doppler shift it and resamples it and then adds the spectrum at the spaxel in the datacube according to the spaxel_assignment. We lax.scan over all particles. Testing on the MaStar ssp template this removes our issue with negative flux. At the same time the computation time does not increase for this method. For comparison see notebook rubix_pipeline_single_function_shard_map_memory.ipynb

I open already the pull request to get feedback in an early stage of this major change in the code structure. Things still to do on this branch, before we move it to the main:

test if the memory issue is also resolved now for GPUs (so far only tested on CPUs)
fsps is still behaving strange, but already for very few particles (e.g. 100), have a look into the template, what is going wrong there
fix pytests
clean the code by removing old functions, once we agreed on a code version

Left: old method, right new method with lax.scan using the MaStar template

…, pading the input data, only typechecking for RubixData class has to be commented outbecause it gets in conflict with NamedSharding, now test on single and multiple GPUs

for more information, see https://pre-commit.ci

…, when directly adding to the cube, but hopefully more memory efficient, will be tested, as soon as jarvis is back online

TobiBu

there is still some cleaning to be done.

TobiBu · 2025-07-03T17:13:50Z

notebooks/debug_spectra_lookup.ipynb

do we need this just for debugging and can it be deleted once we are about to merge? Or shall it stay here?

true, we can delete this

TobiBu · 2025-07-03T17:14:29Z

notebooks/pipeline_sharding_test.ipynb

same here. is this only for testing during development or shall it stay forever?

yes, can be deleted

TobiBu · 2025-07-03T17:16:14Z

rubix/config/pipeline_config.yml

      args: []
      kwargs: {}

+calc_ifu_memory:


will this be called calc_ifu_memory forever or do we rename once we are done developing this feature?

maybe we should rename it, otherwise it could be confusing. I delete the original calc_ifu config and then we have to change it also in the notebooks

TobiBu · 2025-07-03T17:17:27Z

rubix/config/pynbody_config.yml

    density: "rho"
    temperature: "temp"
-    metallicity: "metals"
+    metals: "metals"


we have metals and metallicity. what's the difference? there should be none, right?

Das ist in der config so, metallicity sollte vermutlich entfernt werden? metals wird aber nochmal in input_handler/pynbody per hand gesetzt, aber ja, da sollte man nochmal reinschauen, ob man metallicity vielleicht nicht ganz wegfallen lassen kann

TobiBu · 2025-07-03T17:19:04Z

rubix/core/data.py

-                else:
-                    representationString.append(f"{k}: None")
-        return "\n\t".join(representationString)
+    # def __repr__(self):


shall those commented lines stay? or can they be removed? I think this was from williams experiments, right?

does not work with the sharding to carry this through the pipeline, I remove it, as it is anyways commented out

rubix/core/pipeline.py

TobiBu · 2025-07-03T19:18:23Z

rubix/core/pipeline.py

+        # if the particle number is not modulo the device number, we have to padd a few empty particles
+        # to make it work
+        # this is a bit of a hack, but it works
+        n = inputdata.stars.coords.shape[0]
+        pad = (num_devices - (n % num_devices)) % num_devices
+
+        if pad:
+            # pad along the first axis
+            inputdata.stars.coords = jnp.pad(inputdata.stars.coords, ((0, pad), (0, 0)))
+            inputdata.stars.velocity = jnp.pad(
+                inputdata.stars.velocity, ((0, pad), (0, 0))
+            )
+            inputdata.stars.mass = jnp.pad(inputdata.stars.mass, ((0, pad)))
+            inputdata.stars.age = jnp.pad(inputdata.stars.age, ((0, pad)))
+            inputdata.stars.metallicity = jnp.pad(
+                inputdata.stars.metallicity, ((0, pad))
+            )


are we gonna implement this?

TobiBu · 2025-07-03T19:19:40Z

rubix/core/pipeline.py

+            local_cube = out_local.stars.datacube  # shape (25,25,5994)
+            # in‐XLA all‐reduce across the "data" axis:
+            summed_cube = lax.psum(local_cube, axis_name="data")
+            return summed_cube  # replicated on each device


do we really have to replicate on each device???

Not sure, how it works otherwise. Feel free to change, this was the first thing working

I don´t quite get the logic behind this. wouldn't this allocate a lot of memory on each device? what happens on the devices afterwards that would need the result on all devices? could this be a source of memory issues?

I would propose to refactor the run_sharded function to support a more flexible approach here.

did this refactor happen or shall we open an issue for this?

TobiBu · 2025-07-03T19:40:12Z

rubix/galaxy/input_handler/pynbody.py

                    getattr(self.sim, cls), fields[cls], units[cls], cls
                )

+        # for cls in self.data:


do we still need al those commented lines?

TobiBu · 2025-07-03T19:41:09Z

tests/test_core_ifu.py

 print("Sample_inputs:")
 for key in sample_inputs:
-    sample_inputs[key] = reshape_array(sample_inputs[key])
+    # sample_inputs[key] = reshape_array(sample_inputs[key])


do we need this commented line?

no, removed it

for more information, see https://pre-commit.ci

MaHaWo · 2025-07-17T08:16:44Z

could we maybe move the notebooks into a separate PR to reduce the size a little?

MaHaWo

a few comments here and there, but I would prefer if the run_sharded function could be refactored to allow for the passing of a device configuration and output device perhaps. Maybe the actual sharding could happen in a separate function too, to structure the code a little better.

MaHaWo · 2025-07-22T11:48:45Z

notebooks/rubix_pipeline_sharding.py

@@ -0,0 +1,114 @@
+import os


I wouldn´t put test code like this file into the code base.

MaHaWo · 2025-07-22T11:50:05Z

notebooks/rubix_pipeline_single_function_scaling.ipynb

+   "source": [
+    "# NBVAL_SKIP\n",
+    "#import os\n",
+    "#  os.environ['SPS_HOME'] = '/mnt/storage/annalena_data/sps_fsps'\n",


maybe don't put in hardcoded paths anywhere. this is not executable by anyone other than yourself. rather leave it open and explain what to do here perhaps?

MaHaWo · 2025-07-22T11:53:08Z

notebooks/rubix_pipeline_single_function_scaling.ipynb

@@ -0,0 +1,377 @@
+{


This looks to me a bit like it is experimental? In general, code that just tests functionality or exists for experimenting is not something that should be merged into main imho. Rather, put it into some other place or remove it again before the final version of the branch should be merged.

MaHaWo · 2025-07-22T11:53:33Z

notebooks/rubix_pipeline_single_function_scaling.ipynb

+    "met = inputdata.stars.metallicity\n",
+    "factor = 1\n",
+    "inputdata.stars.coords = jnp.concatenate([coords]*factor, axis=0)\n",
+    "inputdata.stars.velocity = jnp.concatenate([vel]*factor, axis=0)\n",


dummy data? maybe add a markdown cell to make clear the purpose?

MaHaWo · 2025-07-22T11:54:27Z

notebooks/rubix_pipeline_single_function_scaling.ipynb

+   "source": [
+    "# NBVAL_SKIP\n",
+    "import jax.numpy as jnp\n",
+    "gpu_number = jnp.array([1, 2, 3, 4, 5, 6, 7])\n",


what are all these hardcoded numbers? if these are performance experiments, that should not go here...

MaHaWo · 2025-07-22T12:42:34Z

rubix/galaxy/input_handler/pynbody.py

+            pynbody.analysis.angmom.faceon(halo.s)
+            ang_mom_vec = pynbody.analysis.angmom.ang_mom_vec(halo.s)
+            rotation_matrix = pynbody.analysis.angmom.calc_sideon_matrix(ang_mom_vec)
+            if not os.path.exists("./data"):


is this path general/robust enough? In general, it would be better to avoid hardcoding paths. Rather draw them from an environment variable or config ...

MaHaWo · 2025-07-22T12:44:41Z

tests/test_pynbody_handler.py

 def handler_with_mock_data(mock_simulation, mock_config):
+    """
    with patch("pynbody.load", return_value=mock_simulation):
        with patch("pynbody.analysis.angmom.faceon", return_value=None):


same as before, what is this multiline comment for?

MaHaWo · 2025-07-22T12:48:26Z

tests/test_core_ifu.py

-    expected_result = jnp.stack(
+def test_get_calculate_datacube_particlewise():
+    # Setup config and telescope
+    config = {


it could make sense making these things into fixtures, as well as the other classes and configs that exist in this file.

MaHaWo · 2025-07-22T12:55:57Z

tests/test_core_pipeline.py

+    # The cube should have nonzero values (sanity check)
+    assert jnp.any(output_cube != 0)
+
+    print("run_sharded output shape:", output_cube.shape)


could we remove these?

MaHaWo · 2025-07-22T12:56:21Z

tests/test_core_pipeline.py

+    n_particles = num_devices if num_devices > 1 else 2  # At least two for sanity
+
+    # Mock input data
+    input_data = RubixData(


could this conceivably be a fixture?

for more information, see https://pre-commit.ci

TobiBu · 2025-11-10T10:50:11Z

rubix/core/pipeline.py

+            local_cube = out_local.stars.datacube  # shape (25,25,5994)
+            # in‐XLA all‐reduce across the "data" axis:
+            summed_cube = lax.psum(local_cube, axis_name="data")
+            return summed_cube  # replicated on each device


did this refactor happen or shall we open an issue for this?

all done.

anschaible and others added 30 commits March 26, 2025 18:00

seperate data preparation and pipeline run

c316b27

tests on sharding

cab76af

issues with sharding galaxy

a622e28

get rid of reshape data, pmap and the extra dimension in the arrays

9d01319

remove chunking of particles for spectral lookup

f599f73

sharding implementation works on one device

54a71ec

use multiple cpus

249eb80

works on multiple cpus, and olso forparticle number not modulodevides…

d0b5d77

…, pading the input data, only typechecking for RubixData class has to be commented outbecause it gets in conflict with NamedSharding, now test on single and multiple GPUs

experiment with sharding, does not work for large particle sizes yet

3c7d2e6

comment code

76e9abf

sharding nochmal chunken

739f85b

current sharding version, still not fully working

ceeb4dc

implement shard map according to Leonard hints

3d28d2a

sharding now works, remove for loop through jax.for_i is still missing

b51e08f

work on sharding experiments

53ed291

work on sharding test pipeline in separate notebook

ef8c069

reset single function notebook

4396590

finish test notebook

1cadba4

pipeline execution .py file

dceb560

test

6c1133a

test

9938581

sharding works now, I assume

6020fef

nihao and illustris config

5b4f76d

to calculate mock NIHAOS with mastar

bb5adbd

issue with memory and underflow/overflow

e107c13

[pre-commit.ci] auto fixes from pre-commit.com hooks

0c0ae51

for more information, see https://pre-commit.ci

Implement review comment: offset = (_wave[1] - _wave[0]) / 2.

aabcde0

add particle spectra individually to the in the beginning empty datacube

249b51a

lax scan over particles and and add them to datacube

349fc21

lax.scan works and produces same results as old shard map, bit slower…

49a6496

…, when directly adding to the cube, but hopefully more memory efficient, will be tested, as soon as jarvis is back online

anschaible added 2 commits July 3, 2025 11:30

change timing

5235b6f

scaling

315a2b6

TobiBu reviewed Jul 3, 2025

View reviewed changes

Tobias Buck and others added 4 commits July 3, 2025 22:01

Merge branch 'main' into restructure-pipelinefunctions

0ca0a58

[pre-commit.ci] auto fixes from pre-commit.com hooks

1518894

for more information, see https://pre-commit.ci

fix failing test after merge.

090a4e2

updated notebook cells with #NBVAL_SKIP

20fe06c

anschaible linked an issue Jul 4, 2025 that may be closed by this pull request

Refactor Pipeline parallelization using jax.sharding #141

Closed

anschaible and others added 7 commits July 4, 2025 12:31

change pipeline config name

f0759af

[pre-commit.ci] auto fixes from pre-commit.com hooks

3506110

for more information, see https://pre-commit.ci

remove outdated stuff

844c902

[pre-commit.ci] auto fixes from pre-commit.com hooks

91941ff

for more information, see https://pre-commit.ci

added padding function

9dda882

fix tests of ifu pipelin

d2f0d15

fix failing pytests

34e558e

MaHaWo previously requested changes Jul 22, 2025

View reviewed changes

anschaible and others added 7 commits November 4, 2025 11:15

update core data regarding Haralds and Tobias comments

8c8aaf2

[pre-commit.ci] auto fixes from pre-commit.com hooks

85f743f

for more information, see https://pre-commit.ci

change core modules regarding comments

6fe3212

[pre-commit.ci] auto fixes from pre-commit.com hooks

f884b84

for more information, see https://pre-commit.ci

changes according to Haralds review comments

3935dd7

[pre-commit.ci] auto fixes from pre-commit.com hooks

42d6129

for more information, see https://pre-commit.ci

fixing failing pytest

7c150c8

TobiBu self-assigned this Nov 4, 2025

This was referenced Nov 10, 2025

refactor for future data format changes #163

Open

refactor for memory efficiency #164

Open

TobiBu approved these changes Nov 10, 2025

View reviewed changes

TobiBu merged commit d9041c9 into main Nov 10, 2025
6 checks passed

Restructure pipelinefunctions #147

Restructure pipelinefunctions #147

Uh oh!

Conversation

anschaible commented Jun 3, 2025

Uh oh!

TobiBu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaHaWo commented Jul 17, 2025

Uh oh!

MaHaWo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects