-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SGLang + Verl #3852
base: main
Are you sure you want to change the base?
SGLang + Verl #3852
Conversation
|
||
print(f"hf response: {tokenizer.batch_decode(response)}") | ||
|
||
tensor_model_parallel_size = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This runs on TP=4, so it needs four GPUs to run? We only have two GPUs for testing right now. If needed, we can create one for this use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently I name it "adhoc" and will remove it and this is modified from guangming's #2736. It is currently here both for extra testing and because I guess @ocss884 may need this as a reference for the verl side.
There is another example offline_batch_inference_torchrun.py for testing, and also a test_verl_engine.py containing things like update weights, comparison tests, etc.
There are a lot of things like vllm
in the script, because guangming's original script is named like that, and I try to make changes as little as possible, and also deliberately comment out original code instead of removing it, to make it align and easy to check.
Does this need to be a real example? If so, surely this script needs a big refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fzyzcjy Yeah I got your point. Thanks!
@zhaochenyang20 as far as I know the TP size is not important here. If prefer having this script for testing I would recommend to just clean it up and change TP=2. But in fact it is more like a minimal dev example which showcase "how the actor_ollout part init and update weight in verl using SGLang rollout". So I don't think it is quiet necessary for SGLang to contain such an example, it is more like a verl example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should PR this to verl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw test_verl_engine.py somehow mimics this adhoc_verl_torchrun.py, doing things like comparison tests and update weights.
# for debug | ||
# if rank == 0: | ||
# lines = ["------------------------ state_dict ------------------------"] | ||
# for k, v in state_dict.items(): | ||
# v_local = v.to_local() | ||
# lines.append( | ||
# f"{k}\t: {v.shape=} {v_local.shape=} {v.dtype=} {v_local.dtype=} {type(v)=} {type(v_local)=}" | ||
# ) | ||
# print("\n".join(lines)) | ||
|
||
# NOTE MODIFIED | ||
# sampling_params = SamplingParams(temperature=0, | ||
# top_p=1, | ||
# n=1, | ||
# max_tokens=response_length, | ||
# logprobs=1, | ||
# ignore_eos=True, | ||
# detokenize=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see above)
temperature=0, top_p=1, n=1, max_new_tokens=response_length, ignore_eos=True | ||
) | ||
|
||
tp_size, dp_size = 4, 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case. We can also test tp 2 dp 2. This test can run longer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see above)
# llm = LLM(model=None, | ||
# tokenizer=tokenizer, | ||
# model_hf_config=actor_model_config, | ||
# tensor_parallel_size=tensor_model_parallel_size, | ||
# enforce_eager=True, | ||
# dtype='bfloat16', | ||
# load_format='dummy_dtensor', | ||
# gpu_memory_utilization=0.1, | ||
# trust_remote_code=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I don't quite understand why this is running on verl-VLLM. Where is the SGLang one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see above)
@@ -422,3 +466,53 @@ def _check_and_enable_sdpa(config, hard_check_only: bool = False): | |||
return config | |||
|
|||
setattr(Gemma2PreTrainedModel, "_check_and_enable_sdpa", _check_and_enable_sdpa) | |||
|
|||
|
|||
# TODO Ask: is it ok to refactor test code like this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
write_param_name = f"model.layers.6.self_attn.qkv_proj.weight" | ||
read_param_name = f"model.layers.6.self_attn.k_proj.weight" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall check more tensors? This should be quick
) | ||
|
||
t = time.time() | ||
if 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the meaning of this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see above - this is a to-be-deleted adhoc script)
self, | ||
named_tensors: List[Tuple[str, torch.Tensor]], | ||
load_format: Optional[str] = None, | ||
has_more: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename it flush_cache
? The current argument has_more
assumes one specific use case for keeping a cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems flush_cache
is a bit implementation details. For example, suppose one day SGLang decides to have a fancy_optimization
that is slow to execute after weight update. Then, if we call it has_more
, we can do fancy_optimization
when has_more=False
. But if we call it flush_cache
, then we may not be able to skip the fancy_optimization
when has_more=True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The users may need flexibility to determine the optimization they need. If a new optional optimization is implemented, we can provide another argument so that the users can better control system behavior. In addition, the current argument name has_more
is a little bit confusing. The users do not know what will happen when they turn on this argument. Using flush_cache
is more specific.
self._tp_size = device_mesh_cpu.size() | ||
tp_size_per_node = self._tp_size // nnodes | ||
node_rank = self._tp_rank // tp_size_per_node | ||
first_rank_in_node = self._tp_rank % tp_size_per_node == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first_rank_in_group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The semantics here seems to mean the first rank in a (physical) machine, because we need to execute one (and exactly one) SGLang Engine per machine. If we rename it to "group", I am wondering whether the semantics will be clearer or not, since "group" can mean arbitrary groups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood its usage. This question is resolved.
class VerlEngine: | ||
def __init__( | ||
self, | ||
device_mesh_cpu: DeviceMesh, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This device mesh has only one dimension. Can we use ProcessGroup
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally OK for whatever API here, but the original feature request #2736 seems to pass in a 1D DeviceMesh so my default is to align with that.
EDIT: Btw quickly searched but ProcessGroup does not seem to have API like device_mesh_cpu.mesh[0]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ProcessGroup does not seem to have API like device_mesh_cpu.mesh[0].
Can we use dist.get_global_rank(group, 0)
or dist.get_process_group_ranks(group)[0]
?
I feel that the SGLang community is more familiar with ProcessGroup
. It would be great if we can keep such consistency.
(name, _unwrap_tensor(tensor, tp_rank=self.tp_rank)) | ||
for name, tensor in named_tensors | ||
] | ||
# TODO should we name it "direct" or "megatron"? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on its implementation, I recommend "direct".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. #2736 named it "megatron", while I feel "direct" may be a bit more suitable, thus I leave the question here.
python/sglang/test/runners.py
Outdated
@@ -292,8 +308,8 @@ def __init__( | |||
tp_size=tp_size, | |||
dtype=get_dtype_str(torch_dtype), | |||
port=port, | |||
mem_fraction_static=mem_fraction_static, | |||
trust_remote_code=False, | |||
mem_fraction_static=0.65, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using the input argument mem_fraction_static
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this should be fixed
EDIT: Looks like caused by merging (e79f742 updates it), and fixed
@@ -269,6 +212,79 @@ def __exit__(self, exc_type, exc_value, traceback): | |||
self.model_proc.terminate() | |||
self.in_queue = self.out_queue = None | |||
|
|||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this refactor necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see below)
@@ -408,6 +374,84 @@ def __exit__(self, exc_type, exc_value, traceback): | |||
self.engine.shutdown() | |||
del self.engine | |||
|
|||
@staticmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this refactor necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see below)
mem_fraction_static=mem_fraction_static, | ||
trust_remote_code=False, | ||
mem_fraction_static=0.65, | ||
trust_remote_code=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary? Many other tests use this code. It would be better to keep the original version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For changes in test/runners.py:
Firstly, it is both OK for me to refactor (to avoid code duplication) or to copy (to avoid changing existing code), though I personally slightly prefer refactoring, thus I commented # TODO Ask: is it ok to refactor test code like this
in the code. Indeed zhaochenyang20 above seems to say LGTM.
Secondly, it is refactored because, in test_verl_engine.py, I made some comparison tests to ensure HuggingFace outputs are the same as SGLang outputs. The test_verl_engine.py roughly mimics adhoc_verl_torchrun.py, which is a minimal modification from guangming's Verl integration test script. This is quite similar to how comparison tests are done in test_generation_models.py, thus common logic are extracted.
For trust_remote_code
, IIRC it is because some model (maybe THUDM/glm-4-9b-chat
?) requires this. I copied the list of models in test_generation_models.py and put it in test_verl_engine.py and test them, and this model comes from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I leave the decision to @zhaochenyang20 as he is more knowledgeable about this refractor's potential impact.
@@ -130,7 +130,7 @@ def start_model_process(self, in_queue, out_queue, model_path, torch_dtype): | |||
self.base_model = AutoModelForCausalLM.from_pretrained( | |||
model_path, | |||
torch_dtype=torch_dtype, | |||
trust_remote_code=False, | |||
trust_remote_code=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary? Many other tests use this code. It would be better to keep the original version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see above)
self.tokenizer = get_tokenizer(model_path, torch_dtype=torch.dtype) | ||
self.tokenizer = get_tokenizer( | ||
model_path, torch_dtype=torch.dtype, trust_remote_code=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change necessary? Many other tests use this code. It would be better to keep the original version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(see above)
Motivation
Still WIP, mark as "ready for review" just to check CI.Ready for review
Modifications
Checklist