Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGLang + Verl #3852

Open
wants to merge 113 commits into
base: main
Choose a base branch
from
Open

SGLang + Verl #3852

wants to merge 113 commits into from

Conversation

fzyzcjy
Copy link
Collaborator

@fzyzcjy fzyzcjy commented Feb 25, 2025

Motivation

Still WIP, mark as "ready for review" just to check CI.
Ready for review

Modifications

Checklist

@fzyzcjy fzyzcjy changed the title [WIP] SGLang + Verl SGLang + Verl Feb 26, 2025

print(f"hf response: {tokenizer.batch_decode(response)}")

tensor_model_parallel_size = 4
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs on TP=4, so it needs four GPUs to run? We only have two GPUs for testing right now. If needed, we can create one for this use case.

Copy link
Collaborator Author

@fzyzcjy fzyzcjy Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently I name it "adhoc" and will remove it and this is modified from guangming's #2736. It is currently here both for extra testing and because I guess @ocss884 may need this as a reference for the verl side.

There is another example offline_batch_inference_torchrun.py for testing, and also a test_verl_engine.py containing things like update weights, comparison tests, etc.

There are a lot of things like vllm in the script, because guangming's original script is named like that, and I try to make changes as little as possible, and also deliberately comment out original code instead of removing it, to make it align and easy to check.

Does this need to be a real example? If so, surely this script needs a big refactor.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fzyzcjy Yeah I got your point. Thanks!
@zhaochenyang20 as far as I know the TP size is not important here. If prefer having this script for testing I would recommend to just clean it up and change TP=2. But in fact it is more like a minimal dev example which showcase "how the actor_ollout part init and update weight in verl using SGLang rollout". So I don't think it is quiet necessary for SGLang to contain such an example, it is more like a verl example.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should PR this to verl?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw test_verl_engine.py somehow mimics this adhoc_verl_torchrun.py, doing things like comparison tests and update weights.

Comment on lines +131 to +148
# for debug
# if rank == 0:
# lines = ["------------------------ state_dict ------------------------"]
# for k, v in state_dict.items():
# v_local = v.to_local()
# lines.append(
# f"{k}\t: {v.shape=} {v_local.shape=} {v.dtype=} {v_local.dtype=} {type(v)=} {type(v_local)=}"
# )
# print("\n".join(lines))

# NOTE MODIFIED
# sampling_params = SamplingParams(temperature=0,
# top_p=1,
# n=1,
# max_tokens=response_length,
# logprobs=1,
# ignore_eos=True,
# detokenize=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

temperature=0, top_p=1, n=1, max_new_tokens=response_length, ignore_eos=True
)

tp_size, dp_size = 4, 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case. We can also test tp 2 dp 2. This test can run longer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

Comment on lines +164 to +172
# llm = LLM(model=None,
# tokenizer=tokenizer,
# model_hf_config=actor_model_config,
# tensor_parallel_size=tensor_model_parallel_size,
# enforce_eager=True,
# dtype='bfloat16',
# load_format='dummy_dtensor',
# gpu_memory_utilization=0.1,
# trust_remote_code=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't quite understand why this is running on verl-VLLM. Where is the SGLang one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

@@ -422,3 +466,53 @@ def _check_and_enable_sdpa(config, hard_check_only: bool = False):
return config

setattr(Gemma2PreTrainedModel, "_check_and_enable_sdpa", _check_and_enable_sdpa)


# TODO Ask: is it ok to refactor test code like this
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +32 to +33
write_param_name = f"model.layers.6.self_attn.qkv_proj.weight"
read_param_name = f"model.layers.6.self_attn.k_proj.weight"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall check more tensors? This should be quick

)

t = time.time()
if 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the meaning of this line?

Copy link
Collaborator Author

@fzyzcjy fzyzcjy Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above - this is a to-be-deleted adhoc script)

self,
named_tensors: List[Tuple[str, torch.Tensor]],
load_format: Optional[str] = None,
has_more: bool = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename it flush_cache? The current argument has_more assumes one specific use case for keeping a cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems flush_cache is a bit implementation details. For example, suppose one day SGLang decides to have a fancy_optimization that is slow to execute after weight update. Then, if we call it has_more, we can do fancy_optimization when has_more=False. But if we call it flush_cache, then we may not be able to skip the fancy_optimization when has_more=True.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The users may need flexibility to determine the optimization they need. If a new optional optimization is implemented, we can provide another argument so that the users can better control system behavior. In addition, the current argument name has_more is a little bit confusing. The users do not know what will happen when they turn on this argument. Using flush_cache is more specific.

self._tp_size = device_mesh_cpu.size()
tp_size_per_node = self._tp_size // nnodes
node_rank = self._tp_rank // tp_size_per_node
first_rank_in_node = self._tp_rank % tp_size_per_node == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first_rank_in_group

Copy link
Collaborator Author

@fzyzcjy fzyzcjy Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The semantics here seems to mean the first rank in a (physical) machine, because we need to execute one (and exactly one) SGLang Engine per machine. If we rename it to "group", I am wondering whether the semantics will be clearer or not, since "group" can mean arbitrary groups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood its usage. This question is resolved.

class VerlEngine:
def __init__(
self,
device_mesh_cpu: DeviceMesh,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This device mesh has only one dimension. Can we use ProcessGroup instead?

Copy link
Collaborator Author

@fzyzcjy fzyzcjy Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am personally OK for whatever API here, but the original feature request #2736 seems to pass in a 1D DeviceMesh so my default is to align with that.

EDIT: Btw quickly searched but ProcessGroup does not seem to have API like device_mesh_cpu.mesh[0].

Copy link
Contributor

@ch-wan ch-wan Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ProcessGroup does not seem to have API like device_mesh_cpu.mesh[0].

Can we use dist.get_global_rank(group, 0) or dist.get_process_group_ranks(group)[0]?

I feel that the SGLang community is more familiar with ProcessGroup. It would be great if we can keep such consistency.

(name, _unwrap_tensor(tensor, tp_rank=self.tp_rank))
for name, tensor in named_tensors
]
# TODO should we name it "direct" or "megatron"?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on its implementation, I recommend "direct".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P.S. #2736 named it "megatron", while I feel "direct" may be a bit more suitable, thus I leave the question here.

@@ -292,8 +308,8 @@ def __init__(
tp_size=tp_size,
dtype=get_dtype_str(torch_dtype),
port=port,
mem_fraction_static=mem_fraction_static,
trust_remote_code=False,
mem_fraction_static=0.65,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not using the input argument mem_fraction_static?

Copy link
Collaborator Author

@fzyzcjy fzyzcjy Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should be fixed

EDIT: Looks like caused by merging (e79f742 updates it), and fixed

@@ -269,6 +212,79 @@ def __exit__(self, exc_type, exc_value, traceback):
self.model_proc.terminate()
self.in_queue = self.out_queue = None

@staticmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this refactor necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see below)

@@ -408,6 +374,84 @@ def __exit__(self, exc_type, exc_value, traceback):
self.engine.shutdown()
del self.engine

@staticmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this refactor necessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see below)

mem_fraction_static=mem_fraction_static,
trust_remote_code=False,
mem_fraction_static=0.65,
trust_remote_code=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary? Many other tests use this code. It would be better to keep the original version.

Copy link
Collaborator Author

@fzyzcjy fzyzcjy Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For changes in test/runners.py:

Firstly, it is both OK for me to refactor (to avoid code duplication) or to copy (to avoid changing existing code), though I personally slightly prefer refactoring, thus I commented # TODO Ask: is it ok to refactor test code like this in the code. Indeed zhaochenyang20 above seems to say LGTM.

Secondly, it is refactored because, in test_verl_engine.py, I made some comparison tests to ensure HuggingFace outputs are the same as SGLang outputs. The test_verl_engine.py roughly mimics adhoc_verl_torchrun.py, which is a minimal modification from guangming's Verl integration test script. This is quite similar to how comparison tests are done in test_generation_models.py, thus common logic are extracted.

For trust_remote_code, IIRC it is because some model (maybe THUDM/glm-4-9b-chat?) requires this. I copied the list of models in test_generation_models.py and put it in test_verl_engine.py and test them, and this model comes from there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I leave the decision to @zhaochenyang20 as he is more knowledgeable about this refractor's potential impact.

@@ -130,7 +130,7 @@ def start_model_process(self, in_queue, out_queue, model_path, torch_dtype):
self.base_model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch_dtype,
trust_remote_code=False,
trust_remote_code=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary? Many other tests use this code. It would be better to keep the original version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

self.tokenizer = get_tokenizer(model_path, torch_dtype=torch.dtype)
self.tokenizer = get_tokenizer(
model_path, torch_dtype=torch.dtype, trust_remote_code=True
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary? Many other tests use this code. It would be better to keep the original version.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(see above)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants