feat: Wrappers for Consistent Response Formatting in GEM #121
+583
−155
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
tl;dr
Add a
EncapsulateWrapperto detect and enforce common response signposts used by LLMs (like\boxed{}and<answer></answer>)Motivation
Currently GEM as a library appears somewhat undecided about how to extract unambiguous answers from LLMs.
QAEnvuses<answer>HTML tags while the TextArena environments appear to use\boxed{}TeX tags.Since they are currently baked into the environment, the environments themselves can become more difficult to use if a different tagging convention is desired (either due to language model constraints/quirks, or experimental consistency).
Taking into account this potential diversity of tagging conventions, this PR hopes to offer a potential direction forward for decoupling action extraction from action handling within GEM environments, to make sure environments can be useful to a wider audience and to reduce implementation complexity (because now environments no longer need to assume a certain kind of tagging format)
Word of Caution
Despite this PR pointing out the inconsistencies in GEM's response tagging and extraction, no real changes have yet been made to existing/in-development environments, their prompts, or their action parsing procedures.
These are breaking changes which may affect experimental results (due to differences in prompting), and I am currently looking for ways to resolve this with minimal impact.