Video walkthrough:
This is a framework for measuring the effectiveness of AI agents in generating ReactJS code.
It was created to evaluate GitWit, but it's easy to use this framework with your own code generation tool/agent.
You can do experiments like:
- Which tasks does my agent excel at?
- What issues does the code outputted by my agent have?
- How does changing the LLMs change the performance of my agent?
In short, you can batch run a list of prompts on your agent, then run the resulting code in a ReactJS sandbox, storing all errors and screenshots for later analysis.
To clone the repository and install dependencies, run:
git clone https://github.com/gitwitorg/react-eval/
cd react-eval
yarn install
The React evaluation sandbox uses a custom E2B sandbox. To access E2B, install the E2B CLI tool and authenticate:
yarn global add @e2b/cli@latest
e2b auth login
Now, you need an OpenAI API key (or Azure API key) and an E2B API key.
Finally, copy .env.example or create a .env file with the following:
E2B_API_KEY=e2b_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Once these steps are finished, you can go on to the instructions below.
The inputs and outputs of evaluations are structured like this:
Component | Location |
---|---|
Evaluation Tasks | /evals/[eval].json |
Generated Code | /runs/[run]/generations.json |
Evaluation Results | /runs/[run]/evaluations.json |
Logs, Screenshots | /runs/[run]/logs , /runs/[run]/screenshots |
A typical workflow is 1) generate 2) evaluate 3) view. For example:
yarn generate react
yarn evaluate 1
yarn view 1
These steps will be explained below:
Generate
The command yarn generate [eval-name]
, runs the agent M x N times. M is the number of prompts in [eval-name], and N is a number you can configure in config.json
.
The results are stored in JSON format in a director such as /runs/1/
. (If 1 is taken, 2 will be used, etc.)
Evaluate
Run yarn evaluate [run-name]
, to evaluate each code generation. Evaluation results will appear in the same folder as the generations.
View
Run yarn view [run-name]
, to see the results in HTML format. You don't have to wait until the evaluation results are finished to do this, just run the command again to update the generated HTML.
To integrate React Eval to your own code generation tool, see where the generateCode
function is called in generate.ts. Currently this function is used to apply one prompt to one file, and replace the entire file with the reuslts.
React Eval is configured to generate code with GitWit server by default. To modify the server code, first fork and clone gitwit-server to your computer. In the gitwit-server
directory, run yarn link
. Then, in the react-eval
directory run yarn install gitwit-server
.
If you make any changes inside the sandbox directory, you need to create a new E2B sandbox as follows. This requires Docker to be installed and running.
cd sandbox
e2b template build --name "your-sandbox-name"
Then, change react-evals in evaluate.ts to your new sandbox name.