SGLang + Verl #3852

fzyzcjy · 2025-02-25T12:05:21Z

Motivation

~~Still WIP, mark as "ready for review" just to check CI.~~
Ready for review

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

zhaochenyang20 · 2025-02-26T03:27:03Z

examples/runtime/engine/adhoc_verl_torchrun.py

+
+    print(f"hf response: {tokenizer.batch_decode(response)}")
+
+    tensor_model_parallel_size = 4


This runs on TP=4, so it needs four GPUs to run? We only have two GPUs for testing right now. If needed, we can create one for this use case.

Currently I name it "adhoc" and will remove it and this is modified from guangming's #2736. It is currently here both for extra testing and because I guess @ocss884 may need this as a reference for the verl side.

There is another example offline_batch_inference_torchrun.py for testing, and also a test_verl_engine.py containing things like update weights, comparison tests, etc.

There are a lot of things like vllm in the script, because guangming's original script is named like that, and I try to make changes as little as possible, and also deliberately comment out original code instead of removing it, to make it align and easy to check.

Does this need to be a real example? If so, surely this script needs a big refactor.

@fzyzcjy Yeah I got your point. Thanks!
@zhaochenyang20 as far as I know the TP size is not important here. If prefer having this script for testing I would recommend to just clean it up and change TP=2. But in fact it is more like a minimal dev example which showcase "how the actor_ollout part init and update weight in verl using SGLang rollout". So I don't think it is quiet necessary for SGLang to contain such an example, it is more like a verl example.

We should PR this to verl?

Btw test_verl_engine.py somehow mimics this adhoc_verl_torchrun.py, doing things like comparison tests and update weights.

zhaochenyang20 · 2025-02-26T03:27:54Z

examples/runtime/engine/adhoc_verl_torchrun.py

+    # for debug
+    # if rank == 0:
+    #     lines = ["------------------------ state_dict ------------------------"]
+    #     for k, v in state_dict.items():
+    #         v_local = v.to_local()
+    #         lines.append(
+    #             f"{k}\t: {v.shape=} {v_local.shape=} {v.dtype=} {v_local.dtype=} {type(v)=} {type(v_local)=}"
+    #         )
+    #     print("\n".join(lines))
+
+    # NOTE MODIFIED
+    # sampling_params = SamplingParams(temperature=0,
+    #                                  top_p=1,
+    #                                  n=1,
+    #                                  max_tokens=response_length,
+    #                                  logprobs=1,
+    #                                  ignore_eos=True,
+    #                                  detokenize=False)


Could we remove this?

(see above)

zhaochenyang20 · 2025-02-26T03:28:23Z

examples/runtime/engine/adhoc_verl_torchrun.py

+        temperature=0, top_p=1, n=1, max_new_tokens=response_length, ignore_eos=True
+    )
+
+    tp_size, dp_size = 4, 1


in this case. We can also test tp 2 dp 2. This test can run longer.

(see above)

zhaochenyang20 · 2025-02-26T03:30:17Z

examples/runtime/engine/adhoc_verl_torchrun.py

+    # llm = LLM(model=None,
+    #           tokenizer=tokenizer,
+    #           model_hf_config=actor_model_config,
+    #           tensor_parallel_size=tensor_model_parallel_size,
+    #           enforce_eager=True,
+    #           dtype='bfloat16',
+    #           load_format='dummy_dtensor',
+    #           gpu_memory_utilization=0.1,
+    #           trust_remote_code=True)


delete this?

(see above)

zhaochenyang20 · 2025-02-26T03:30:55Z

examples/runtime/engine/adhoc_verl_torchrun.py

Sorry I don't quite understand why this is running on verl-VLLM. Where is the SGLang one?

(see above)

zhaochenyang20 · 2025-02-26T03:36:49Z

python/sglang/test/runners.py

@@ -422,3 +466,53 @@ def _check_and_enable_sdpa(config, hard_check_only: bool = False):
        return config

    setattr(Gemma2PreTrainedModel, "_check_and_enable_sdpa", _check_and_enable_sdpa)
+
+
+# TODO Ask: is it ok to refactor test code like this


zhaochenyang20 · 2025-02-26T03:37:30Z

test/srt/test_update_weights_from_tensor.py

+        write_param_name = f"model.layers.6.self_attn.qkv_proj.weight"
+        read_param_name = f"model.layers.6.self_attn.k_proj.weight"


shall check more tensors? This should be quick

ch-wan · 2025-02-26T05:10:35Z

examples/runtime/engine/adhoc_verl_torchrun.py

+    )
+
+    t = time.time()
+    if 0:


what's the meaning of this line?

(see above - this is a to-be-deleted adhoc script)

ch-wan · 2025-02-26T05:46:35Z

python/sglang/srt/entrypoints/engine.py

+        self,
+        named_tensors: List[Tuple[str, torch.Tensor]],
+        load_format: Optional[str] = None,
+        has_more: bool = False,


Can we rename it flush_cache? The current argument has_more assumes one specific use case for keeping a cache.

It seems flush_cache is a bit implementation details. For example, suppose one day SGLang decides to have a fancy_optimization that is slow to execute after weight update. Then, if we call it has_more, we can do fancy_optimization when has_more=False. But if we call it flush_cache, then we may not be able to skip the fancy_optimization when has_more=True.

The users may need flexibility to determine the optimization they need. If a new optional optimization is implemented, we can provide another argument so that the users can better control system behavior. In addition, the current argument name has_more is a little bit confusing. The users do not know what will happen when they turn on this argument. Using flush_cache is more specific.

ch-wan · 2025-02-26T05:48:49Z

python/sglang/srt/entrypoints/verl_engine.py

+        self._tp_size = device_mesh_cpu.size()
+        tp_size_per_node = self._tp_size // nnodes
+        node_rank = self._tp_rank // tp_size_per_node
+        first_rank_in_node = self._tp_rank % tp_size_per_node == 0


first_rank_in_group

The semantics here seems to mean the first rank in a (physical) machine, because we need to execute one (and exactly one) SGLang Engine per machine. If we rename it to "group", I am wondering whether the semantics will be clearer or not, since "group" can mean arbitrary groups.

I misunderstood its usage. This question is resolved.

ch-wan · 2025-02-26T05:54:20Z

python/sglang/srt/entrypoints/verl_engine.py

+class VerlEngine:
+    def __init__(
+        self,
+        device_mesh_cpu: DeviceMesh,


This device mesh has only one dimension. Can we use ProcessGroup instead?

I am personally OK for whatever API here, but the original feature request #2736 seems to pass in a 1D DeviceMesh so my default is to align with that.

EDIT: Btw quickly searched but ProcessGroup does not seem to have API like device_mesh_cpu.mesh[0].

ProcessGroup does not seem to have API like device_mesh_cpu.mesh[0].

Can we use dist.get_global_rank(group, 0) or dist.get_process_group_ranks(group)[0]?

I feel that the SGLang community is more familiar with ProcessGroup. It would be great if we can keep such consistency.

ch-wan · 2025-02-26T06:17:02Z

python/sglang/srt/model_executor/model_runner.py

+            (name, _unwrap_tensor(tensor, tp_rank=self.tp_rank))
+            for name, tensor in named_tensors
+        ]
+        # TODO should we name it "direct" or "megatron"?


based on its implementation, I recommend "direct".

P.S. #2736 named it "megatron", while I feel "direct" may be a bit more suitable, thus I leave the question here.

ch-wan · 2025-02-26T08:34:52Z

python/sglang/test/runners.py

@@ -292,8 +308,8 @@ def __init__(
            tp_size=tp_size,
            dtype=get_dtype_str(torch_dtype),
            port=port,
-            mem_fraction_static=mem_fraction_static,
-            trust_remote_code=False,
+            mem_fraction_static=0.65,


Why not using the input argument mem_fraction_static?

Yes this should be fixed

EDIT: Looks like caused by merging (e79f742 updates it), and fixed

ch-wan · 2025-02-26T08:35:22Z

python/sglang/test/runners.py

@@ -269,6 +212,79 @@ def __exit__(self, exc_type, exc_value, traceback):
        self.model_proc.terminate()
        self.in_queue = self.out_queue = None

+    @staticmethod


Is this refactor necessary?

(see below)

ch-wan · 2025-02-26T08:35:43Z

python/sglang/test/runners.py

@@ -408,6 +374,84 @@ def __exit__(self, exc_type, exc_value, traceback):
        self.engine.shutdown()
        del self.engine

+    @staticmethod


Is this refactor necessary?

(see below)

ch-wan · 2025-02-26T08:41:02Z

python/sglang/test/runners.py

-            mem_fraction_static=mem_fraction_static,
-            trust_remote_code=False,
+            mem_fraction_static=0.65,
+            trust_remote_code=True,


Is this change necessary? Many other tests use this code. It would be better to keep the original version.

For changes in test/runners.py:

Firstly, it is both OK for me to refactor (to avoid code duplication) or to copy (to avoid changing existing code), though I personally slightly prefer refactoring, thus I commented # TODO Ask: is it ok to refactor test code like this in the code. Indeed zhaochenyang20 above seems to say LGTM.

Secondly, it is refactored because, in test_verl_engine.py, I made some comparison tests to ensure HuggingFace outputs are the same as SGLang outputs. The test_verl_engine.py roughly mimics adhoc_verl_torchrun.py, which is a minimal modification from guangming's Verl integration test script. This is quite similar to how comparison tests are done in test_generation_models.py, thus common logic are extracted.

For trust_remote_code, IIRC it is because some model (maybe THUDM/glm-4-9b-chat?) requires this. I copied the list of models in test_generation_models.py and put it in test_verl_engine.py and test them, and this model comes from there.

I leave the decision to @zhaochenyang20 as he is more knowledgeable about this refractor's potential impact.

ch-wan · 2025-02-26T08:42:39Z

python/sglang/test/runners.py

@@ -130,7 +130,7 @@ def start_model_process(self, in_queue, out_queue, model_path, torch_dtype):
            self.base_model = AutoModelForCausalLM.from_pretrained(
                model_path,
                torch_dtype=torch_dtype,
-                trust_remote_code=False,
+                trust_remote_code=True,


Is this change necessary? Many other tests use this code. It would be better to keep the original version.

(see above)

ch-wan · 2025-02-26T08:42:48Z

python/sglang/test/runners.py

-        self.tokenizer = get_tokenizer(model_path, torch_dtype=torch.dtype)
+        self.tokenizer = get_tokenizer(
+            model_path, torch_dtype=torch.dtype, trust_remote_code=True
+        )


Is this change necessary? Many other tests use this code. It would be better to keep the original version.

(see above)

fzyzcjy added 30 commits February 25, 2025 20:04

empty

15b1cb1

more

b86144c

more

8dec005

more

5d3aaa3

more

9a22ee4

more

d889d9a

more

8c6e2e5

more

6827ecb

more

a245074

more

2f2221f

more

51e73a9

more

76efa04

more

8769324

more

0af4a69

more

07704ca

more

e497d46

more

562f46f

more

0f37323

more

eba2bbf

more

f451284

more

d9ff06c

more

328a3ab

rm gather_pyobj

2ec60f5

more

a0ca4a7

more

497cf2f

more

48bc84a

more

540b774

more

4b1107d

more

0c10ec8

more

6a4891e

fzyzcjy and others added 16 commits February 25, 2025 22:44

cleanup tests

5cd818b

more

5104d3b

more

e16e7ef

more

502019e

more

5a08027

more

9527150

more

c661ec2

more

e14436e

more

d88940c

more

deac1b3

more

5ea4007

more

d09b224

more

07b9c1a

fmt

1a705c3

Merge branch 'main' into feat/verl_20250225

4630b19

comments

2c6d20a

fzyzcjy changed the title ~~[WIP] SGLang + Verl~~ SGLang + Verl Feb 26, 2025

fzyzcjy added 2 commits February 26, 2025 09:50

more tests

6aa5c85

fmt

8de7c8b

zhaochenyang20 requested changes Feb 26, 2025

View reviewed changes

ch-wan suggested changes Feb 26, 2025

View reviewed changes

ch-wan reviewed Feb 26, 2025

View reviewed changes

fzyzcjy and others added 2 commits February 26, 2025 16:49

fix merge

b0b3d6e

Merge branch 'main' into feat/verl_20250225

a15079f


		print(f"hf response: {tokenizer.batch_decode(response)}")

		tensor_model_parallel_size = 4

		write_param_name = f"model.layers.6.self_attn.qkv_proj.weight"
		read_param_name = f"model.layers.6.self_attn.k_proj.weight"

SGLang + Verl #3852

Are you sure you want to change the base?

SGLang + Verl #3852

Conversation

fzyzcjy commented Feb 25, 2025 • edited Loading

Motivation

Modifications

Checklist

Choose a reason for hiding this comment

fzyzcjy Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fzyzcjy Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fzyzcjy Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fzyzcjy Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

ch-wan Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fzyzcjy Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fzyzcjy Feb 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fzyzcjy commented Feb 25, 2025 •

edited

Loading

fzyzcjy Feb 26, 2025 •

edited

Loading

fzyzcjy Feb 26, 2025 •

edited

Loading

fzyzcjy Feb 26, 2025 •

edited

Loading

fzyzcjy Feb 26, 2025 •

edited

Loading

ch-wan Feb 26, 2025 •

edited

Loading

fzyzcjy Feb 26, 2025 •

edited

Loading

fzyzcjy Feb 26, 2025 •

edited

Loading