forked from DannyWeitekamp/tutorenvs
-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Code Review
- Review the code together
- Merge branches into main
LLM evaluation
- Check that correct/incorrect are evaluated one at a time, not as lists
- Each environment returns its list of actions that it supports and these are injected into the prompt, so we can support new environments with new actions
- Finalize LLM prompts for tutoring and simulated student with Deepseek models (even small ones if faster), then run with the larger models and the for-pay models
- Tutoring evaluation prompt
- Simulated student evaluation
- Consider adding a "Get hint" action to the tutor.
RL Wrapper
- Test to make sure we have something 🤷.
CTAT Env
- Need to make as much of the matcher interpreter work as possible
- Involves implementing several subroutines
- There is a global unordered attribute to graphs that isn’t directly implemented
- Double check multiple next action behavior; make sure the behavior recorder is working as we expect
OA Env
- Are we properly capturing the hint/substep sequences
Update documentation
- add readme that talks about how to get things to download and run
- Make it easy to run LLM models
- Outline how someone can add a new environment (might require some refactoring)
- do we need a separate content repository.
Overall Ideas for the future
- Consider using Cohen's Kappa as a way to evaluate model performance
- Consider generating negative examples for each tutor using the LLM student
- Consider: ReAct: Synergizing reasoning and acting in language models
https://arxiv.org/abs/2210.03629 Tree of thoughts: Deliberate problem solving with large language models
https://proceedings.neurips.cc/paper_files/paper/2023/hash/271db9922b8d1f4dd7aaef84ed5ac703-Abstract-Conference.html - Consider trying to get CTAT human data
- Consider trying to get OA tutor data
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels