Skip to content

Conversation

@mayunaise
Copy link

@mayunaise mayunaise commented Feb 9, 2026

What does this PR do?

This feature implements Cluster Analyse, a visualization tool for the rapid analysis of reinforcement learning (RL) performance data. It parses profiling data collected via the VeRL framework to generate timeline charts for each stage of the RL process.

image

Sample

image

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, veomni, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

python cluster_analysis.py --input-path ./data --output-path ./output

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@CLAassistant
Copy link

CLAassistant commented Feb 9, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new tool for scheduling analysis based on profiling data. The implementation is comprehensive, including data preprocessing, parsing, visualization, and testing. However, I've identified several critical issues related to code robustness, portability, and correctness that could lead to runtime crashes or unexpected behavior. These include fragile logic that relies on specific file naming conventions, potential null reference exceptions, and incorrect type hint imports. Addressing these issues will significantly improve the tool's reliability and maintainability.

Comment on lines 37 to 42
try:
for (map_key, dir_list) in rank_id_map.items():
dir_list.sort(key=lambda x: x.split('_')[-3])
self.data_map[map_key] = dir_list
except Exception as e:
raise RuntimeError("Found invalid directory name!") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The sorting key lambda x: x.split('_')[-3] is very fragile and will raise an IndexError if a directory name doesn't have at least three underscores. This error is caught by a broad except Exception block which then raises a generic RuntimeError, crashing the program with an unhelpful message. This should be made more robust to handle unexpected directory names without crashing.

Suggested change
try:
for (map_key, dir_list) in rank_id_map.items():
dir_list.sort(key=lambda x: x.split('_')[-3])
self.data_map[map_key] = dir_list
except Exception as e:
raise RuntimeError("Found invalid directory name!") from e
for (map_key, dir_list) in rank_id_map.items():
try:
dir_list.sort(key=lambda x: x.split('_')[-3])
except IndexError:
logger.warning(f"Could not sort paths for {map_key} due to unexpected directory name format. The order may be incorrect.")
self.data_map[map_key] = dir_list

Comment on lines 152 to 169
# Convert to milliseconds
us_to_ms = Constant.US_TO_MS
start_time_ms = start_ids / us_to_ms
duration_ms = (end_ids - start_ids) / us_to_ms
end_time_ms = start_time_ms + duration_ms

event_data = {
'name': roll,
"roll": roll,
'domain': "default",
'start_time_ms': start_time_ms,
'end_time_ms': end_time_ms,
'duration_ms': duration_ms,
'rank_id': rank_id,
'tid': row["pid"]
}

events.append(event_data)

return events
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are two issues here. First, if no events are found for the process_id, start_ids will remain None, and line 154 will raise a TypeError. Second, if the loop at line 124 finds no matching events, row will be unbound, and line 166 will raise an UnboundLocalError. You should handle the case where no events are found, and use process_id for the tid.

        if start_ids is None:
            return events

        # Convert to milliseconds
        us_to_ms = Constant.US_TO_MS
        start_time_ms = start_ids / us_to_ms
        duration_ms = (end_ids - start_ids) / us_to_ms
        end_time_ms = start_time_ms + duration_ms

        event_data = {
            'name': roll,
            "roll": roll,
            'domain': "default",
            'start_time_ms': start_time_ms,
            'end_time_ms': end_time_ms,
            'duration_ms': duration_ms,
            'rank_id': rank_id,
            'tid': process_id
        }

        events.append(event_data)
            
        return events

Comment on lines +125 to +136
t0 = df["Start"].min()
df["Start"] -= t0
df["Finish"] -= t0
df["Duration"] = df["Finish"] - df["Start"]
return df, t0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

If the dataframe df is empty after the cleaning steps above (e.g., all rows have NaNs or invalid start/finish times), df["Start"].min() will raise a ValueError because min() will be called on an empty sequence. You should handle the case of an empty dataframe gracefully, for example by returning early.

Suggested change
t0 = df["Start"].min()
df["Start"] -= t0
df["Finish"] -= t0
df["Duration"] = df["Finish"] - df["Start"]
return df, t0
if df.empty:
return df, 0.0
t0 = df["Start"].min()
df["Start"] -= t0
df["Finish"] -= t0
df["Duration"] = df["Finish"] - df["Start"]
return df, t0

for dir_name in dirs:
if dir_name.endswith(Constant.PT_PROF_SUFFIX):
path = os.path.join(root, dir_name)
ascend_pt_dirs.append({"roll": os.path.dirname(path).split('/')[-1], "path": path})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of split('/') to extract the directory name is not portable and will fail on operating systems that use a different path separator, such as Windows. You should use os.path.basename to make this code cross-platform.

Suggested change
ascend_pt_dirs.append({"roll": os.path.dirname(path).split('/')[-1], "path": path})
ascend_pt_dirs.append({"roll": os.path.basename(os.path.dirname(path)), "path": path})

Comment on lines 298 to 292
ClusterParserFn = Callable[
[
str,
str,
Dict
],
pd.DataFrame,
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The type hint ClusterParserFn is defined as Callable[[str, str, Dict], pd.DataFrame], but the registered functions like cluster_parser_mstx have the signature (config: Dict) -> pd.DataFrame. The type hint should match the actual function signature to avoid confusion and potential issues with static analysis tools.

ClusterParserFn = Callable[[Dict], pd.DataFrame]

@tardis-key tardis-key marked this pull request as draft February 9, 2026 09:44
@mayunaise mayunaise force-pushed the cluster_analysis branch 2 times, most recently from 8f7ad0c to 01268e5 Compare February 9, 2026 10:46
if dir_name.endswith(Constant.PT_PROF_SUFFIX):
path = os.path.join(root, dir_name)
ascend_pt_dirs.append({"roll": os.path.dirname(path).split("/")[-1], "path": path})
data_processor = DataPreprocessor(ascend_pt_dirs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code here needs to be generic and not Ascend-specific.



def main():
arg_parser = argparse.ArgumentParser(description="集群调度可视化")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Translate the comments and help information into English.

def __init__(self, path_list: list[dict]) -> None:
self.path_list = path_list
self.data_map = {}
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove it

rank_id_map[(task_roll, rank_id)].append(dir_name)
try:
for map_key, dir_list in rank_id_map.items():
dir_list.sort(key=lambda x: x.split("_")[-3])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

magic number


class DataMap(TypedDict):
rank_id: int
roll: str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of ROLL

def main():
arg_parser = argparse.ArgumentParser(description="Cluster scheduling visualization")
arg_parser.add_argument("--input-path", default="test", help="Raw path of profiling data")
arg_parser.add_argument("--profiler-type", default="mstx", help="Profiling data type")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

在help加上profiler-type支持的全量类型,CLUSTER_PARSER_REGISTRY.keys()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants