Skip to content

Conversation

@VictorForouhar
Copy link
Collaborator

The value of TrackId of the same subhaloes can change across HBT-HERONS re-runs, even when running on the same simulation and FoF group catalogue. This is because the values are assigned based on the index location of the new subhaloes in the std::vector<Subhalo_t> position local to each MPI rank. This means that the value can change randomly across runs if the number of MPI ranks are changed, and even unchanged because of different random ordering.

At best, it is annoying to compare how subhalo properties change across re-runs because you need to match subhalos. At worst, it can mean having to re-run a whole HBT-HERONS analysis if you accidentally overwrite early outputs. The new TrackId values are no longer reflective of those used in the original outputs, and hence the merger trees become a nightmare.

This PR changes the assignment procedure to be based on the global ranking of the HostHaloId of new subhaloes. The change will ensure reproducibility because ordering of std::vector<Subhalo_t> and MPI decomposition no longer play a role in setting TrackId.

Hence, if you re-run on the same simulation and FoF group catalogue, you will obtain the same TrackId for the same subhaloes across re-runs. Of course, if you change the FoF group catalogue the values will change (as HostHaloId will generally change), but in those cases you want to re-run HBT-HERONS anyway.

This change makes the TrackId of re-runs of HBT-HERONS on the same simulation reproducible at the TrackId level (as long as the FoF catalogue is unchanged).
I find the fact that there are two definitions stored in a single variable name cumbersome.
@VictorForouhar
Copy link
Collaborator Author

Here is an example debug print of how this works for the first snapshot of the COLIBRE test box:

# Number of new subhaloes in each rank
Rank 0 has 0 new subhaloes
Rank 1 has 3 new subhaloes

# HostHaloId values of the new subhaloes across ranks
Rank 0 has the following HostIDs with new subhaloes: ()
Rank 1 has the following HostIDs with new subhaloes: (2, 3, 1, )

# Above information is gathered across ranks as a vector 
Number of new births across all ranks: (0, 3, )

# Counts and displacements for MPI_Gatherv in rank 0.
Rank 0 will receive 3 entries.
Rank 0 offsets are (0, 0, )

# Gathered information from all ranks in rank 0
Rank 0 has GlobalHostHaloIds (2, 3, 1, )

# Do an argort on HostHaloID to get unique values to assign
Rank 0 has argsort (2, 0, 1, )

# Add offset of the number of pre-existing subhaloes (first snapshot, so offset = 0)
Rank 0 has offset argsort (2, 0, 1, )

# Do MPI_Scatterv into original ranks and assign the TrackIds.
Rank 0 has new TrackIds ()
Rank 1 has new TrackIds (2, 0, 1, )

@VictorForouhar
Copy link
Collaborator Author

Seems to be working as intended. I have run two HBT-HERONS re-runs using this new branch on the COLIBRE test box. Running the following code confirms that we do not change TrackId values for subhaloes that form at the same SnapshotOfBirth and HostHaloId (i.e. the same subhalos!).

new_changes_run_1 = HBTReader("/cosma7/data/dp004/dc-foro1/colibre/ReproducibleTrackIdChanges_run_1")
new_changes_run_2 = HBTReader("/cosma7/data/dp004/dc-foro1/colibre/ReproducibleTrackIdChanges_run_2")

number_of_mismatchs = 0

for snap_nr in new_changes_run_1.SnapshotIdList:
    
    subhalos_run_1 = new_changes_run_1.LoadSubhalos(snap_nr)
    subhalos_run_2 = new_changes_run_2.LoadSubhalos(snap_nr)

    # Only get TrackId of new subhalos
    new_subhalos_run_1 = subhalos_run_1[subhalos_run_1["SnapshotOfBirth"] == snap_nr]
    new_subhalos_run_2 = subhalos_run_2[subhalos_run_2["SnapshotOfBirth"] == snap_nr]
    # Sort by host halo id
    new_subhalos_run_1 = new_subhalos_run_1[np.argsort(new_subhalos_run_1["HostHaloId"])]
    new_subhalos_run_2 = new_subhalos_run_2[np.argsort(new_subhalos_run_2["HostHaloId"])]

    number_of_mismatchs += (new_subhalos_run_1["TrackId"] != new_subhalos_run_2["TrackId"]).sum()

print (number_of_mismatchs)

> 0

On the other hand, re-running twice with the master version does result in mismatches:

master_branch_run_1 = HBTReader("/cosma7/data/dp004/dc-foro1/colibre/master_run_1")
master_branch_run_2 = HBTReader("/cosma7/data/dp004/dc-foro1/colibre/master_run_2")

....

print (number_of_mismatchs)

> 56863

@VictorForouhar
Copy link
Collaborator Author

The bound mass functions are unchanged as well. Ready to be reviewed.
image

@VictorForouhar
Copy link
Collaborator Author

VictorForouhar commented Nov 2, 2025

Latest commit implements one of the changes of #16, but I use the TrackId instead of ParticleId. I did two re-runs and compared whether the catalogues at each snapshot were the same (I set the MaxSampleSizePotentialEstimate to 200, to exacerbate differences).

new_changes_run_1 = HBTReader("/cosma7/data/dp004/dc-foro1/colibre/ReproducibleTracksRun1_FixSeed//")
new_changes_run_2 = HBTReader("/cosma7/data/dp004/dc-foro1/colibre/ReproducibleTracksRun2_FixSeed//")

number_of_mismatchs = 0

for snap_nr in new_changes_run_1.SnapshotIdList:
    
    subhalos_run_1 = new_changes_run_1.LoadSubhalos(snap_nr)
    subhalos_run_2 = new_changes_run_2.LoadSubhalos(snap_nr)

    # Sort by TrackId to get the same subhaloes 
    subhalos_run_1 = subhalos_run_1[np.argsort(subhalos_run_1["TrackId"])]
    subhalos_run_2 = subhalos_run_2[np.argsort(subhalos_run_2["TrackId"])]

    number_of_mismatchs += (subhalos_run_1 != subhalos_run_2).sum()

Before the latest commit, number_of_mismatchs = 138821. With the fixed seed for RNG, number_of_mismatchs = 8179. The first difference appears in the second snapshot.

@VictorForouhar
Copy link
Collaborator Author

Examining the first subhalo where there is a difference between runs, it is actually due to loss of precision in MboundType. All other entries are the same, but MboundType differs by:

[-3.7252903e-09  2.9802322e-08  0.0000000e+00  0.0000000e+00
   0.0000000e+00  0.0000000e+00]

@VictorForouhar
Copy link
Collaborator Author

In short, when fixing the seed and finding all subhaloes that differ in their Nbound, we have 53 subhaloes with differences. If I further disable subhalo sinking it goes down to 39.

I though subhalo sinking may play a role because of the order in which overlap is evaluated, and hence the resulting ordering of the Particles vector for the subhalo that accretes said particles.

@VictorForouhar
Copy link
Collaborator Author

Example subhalo which deviates between runs. Both instances agree in Nbound until one snapshot after infall, at which point they differ by 2 particles.

image

However, differences already appear when looking at the ordering of bound and sorted particles before, at snapshot 84. 6 particles differ in their binding energy ranking, as shown below. The differences grow to 41 particles at the following snapshot. Eventually, this leads to different Nbound.

image image

I will look into why there is a swap of those three particles in snap 84.

Copy link
Collaborator

@robjmcgibbon robjmcgibbon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I agree that the subhalo_unbind.cpp changes would be better in another PR though

@VictorForouhar VictorForouhar merged commit 4399e98 into master Nov 6, 2025
4 checks passed
@VictorForouhar VictorForouhar deleted the reproducible_trackids branch November 6, 2025 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants