data

clinicalml · Apr 3, 2024 · 2835ecc · 2835ecc
1 parent 4dbb657
commit 2835ecc
Show file tree

Hide file tree

Showing 3 changed files with 224 additions and 222 deletions.
diff --git a/README.md b/README.md
@@ -1,21 +1,20 @@
 # The RealHumanEval
 
-Associated code and data for the paper "The Real HumanEval: Evaluating Large Language Models’ Abilities to Support Programmers".
+Associated code, data and interface for the paper "The Real HumanEval: Evaluating Large Language Models’ Abilities to Support Programmers".
 
 
 <img src="./static/fig1.jpg" alt="Overview of RealHumanEval" width="50%"/>
 
 
 # What is it?
 
-Coding benchmarks such as HumanEval  evaluate the ability of large language models (LLMs) to generate code that passes unit tests. 
-While these benchmarks play an important role in evaluating LLMs, they do not necessarily reflect the current use of LLMs to assist programmers. 
-We conduct a user study (N=213), the RealHumanEval, to measure the ability of different LLMs to support programmers in their tasks. 
-We developed an online web app in which users interacted with one of six different LLMs integrated into an editor through either autocomplete support, akin to GitHub Copilot, or chat support, akin to ChatGPT, in addition to a condition with no LLM assistance. 
+
+This repository introduces an interface for evaluating humans writing code with large language models (LLMs) "RealHumanEval". Users can interact with LLMs integrated into an editor through either autocomplete support, akin to GitHub Copilot, or chat support, akin to ChatGPT.
+
+
+Using this interface, we ran  a user study (N=213) to measure the ability of different LLMs to support programmers in their tasks. 
 We measure user performance in terms of the speed and amount of tasks completed, as well as user satisfaction metrics of LLM helpfulness.
 While we find general correspondence between benchmark performance and user performance (i.e., less performant models tended to slow users down and reduce the number of tasks completed), the gaps in benchmark performance are not proportional to gaps in human performance metrics.
-Furthermore, benchmark performance does not translate into user perceptions of helpfulness.  
-Our study also reveals that the benefits of LLM support for programming may currently be overstated; thus, we caution against over-optimizing for benchmark performance and highlight multiple avenues to improve both autocomplete and chat systems. We also hope that LLM developers evaluate the coding ability of their models with the help of our RealHumanEval platform. 
 
 In this repository, you can find the data of participants study sessions as well as code to analyze that data.
 
@@ -27,7 +26,8 @@ In this repository, you can find the data of participants study sessions as well
 
 # Data
 
-You can find our data on Huggingface hub at [realhumaneval](https://huggingface.co/datasets/hsseinmz/realhumaneval), or for a direct download link, you can use the following link: [link](https://storage.googleapis.com/public-research-data-mozannar/realhumaneval_data.zip).
+You can find our data on Huggingface hub at [realhumaneval](https://huggingface.co/datasets/hsseinmz/realhumaneval), or for a direct download link you can find in [./data](./data).
+
 
 The data released consists of four parts (can also be found in the folder [./data](./data)):
 
@@ -53,6 +53,7 @@ This repository is organized as follows:
 
 - [data](data) should contain the raw data used for analysis
 
+- [interface](interface) (COMING SOON)
 
 
 # Paper Reproducibility 

diff --git a/data/study_data.csv b/data/study_data.csv
diff --git a/data/study_data.pkl b/data/study_data.pkl