Rough Draft of a Data Generation Flow Chart #41
robbiegwald
started this conversation in
Data
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hey all!
I've been reading through all the various discussions recently and thought I would combine some of the current theories and ideas into a cohesive "plan". Please keep in mind that this is just my interpretation of how we could replicate Strawberry. It is also a very rough draft, as I'm sure we'll learn much more about Strawberry as time goes on. I'm not saying we should conclude the planning phase at all here, I still think there is a lot to consider. I also think we should start out with a more easily-validated problem set like chess. Suggestions for improvement are very much welcome. Feel free to build off of this chart and plan. Here's the mermaid code:
Flow Chart
Step 1: Prompt Generation
Generate high quality prompts using both models and crowd-sourced queries. This work has already begun. I think we should mix both synthetic and human-generated prompts. Every prompt should be separated into categories (math, history, physics, etc) to make the validation process easier. For building a chess problem set, we can create a simple python program that generates a legal chess game, then asks for the best move. See this video for more on prompt generation.
Step 2: Model CoT reasoning using actor-critic
Some work has also already begun on this. From what I can tell it makes more sense for the "actor" (the one solving the problem and generating the CoT) to be the same model as the one we will fine-tune (e.g. Llama 3.1 70B). The "critic" model can be whatever we want it to, likely o1-preview/mini or Claude 3.5 Sonnet. The critic model should give simple 1 or 2 sentence answers. I myself have found that giving the critic model more time to explain itself causes it to give bad feedback. Dave breaks down the reasoning process quite well in this video. This process needs some more prompt engineering, and possibly some novel solutions. In my own testing, I find that the actor can get stuck in loops, unable to solve the problem and also not trying any new strategies. Perhaps this could be fixed with prompt engineering. OpenAI likely solved this with brute force (generate so many answers that one of them has to be right).
Step 3a/b: Validation
This is probably the trickiest section of them all. I think it's pretty clear that brute force can solve this problem, but we don't have the millions of dollars of compute that OpenAI has. We have to be more creative. One solution that has floated around that I quite like is giving a validator model access to execute simple python scripts. This would make verifying certain problems trivially easy, particularly math and physics problems. However, this wouldn't work for everything. It's possible that with some prompt engineering you could get a single-model validator to give high accuracy validations, but I doubt it. We could do a semi-brute-force method of using a jury of models to evaluate the answers. We could also leverage the open-source nature of the project and create a public web interface where people can evaluate model answers. This may create problems with data quality, but would be much cheaper.
Step 4: Human confirmation of model validations (temporary)
When first attempting this process in a larger run, we should have humans validating some or all of the approved answers to ensure that they are still accurate.
Step 5: Synthesize final chains-of-thought
This step involves taking the full actor-critic chat and combining it into one solid block of reasoning. This step is trickier than it may seem since getting LLMs to spit out large chats verbatim without paraphrasing is a challenge, especially on harder problems that require more reasoning. I've placed this step here since I think it's probably better to do validation before rephrasing. There's an argument to be made that it should be placed before validation.
Step 6: Grade CoT on a rubric
This step really only applies if we massively scale up the project. The idea is if you have multiple chains-of-thought for a given prompt that are correct, you can have another model grade the response based on how high-quality the reasoning is. How consistent it is, how well the model corrects its mistakes, etc. We likely won't be generating enough data for this step to be necessary at first. If we do have problems that have multiple correct responses on our smaller runs, there likely would be few enough of them that we could manually select them.
Step 7: Fine-tune the model
Pretty self-explanatory. Fine-tune an open source model on our dataset.
Step 8: Evaluate
Also pretty self-explanatory. Possible evaluations include, SIMPLE bench, AIME, MATH, GSM8K, Mensa IQ tests, etc.
Conclusion
Let me know what you think of this process. It may be missing something you feel is critical to this project, or it may contain an unnecessary step that you feel is wasting resources. Please give any feedback you have on it. I've focused on existing ideas that are known to work and would be (relatively) easy to implement. I agree with Dave in that Strawberry seems to not have much secret sauce, or the novel ideas it does have aren't super important.
Beta Was this translation helpful? Give feedback.
All reactions