Artifact for HinglishEval: Evaluating the Effectiveness of Code-generation Models on Hinglish Prompts.
We used GPT-4 to translate prompts in OpenAI's HumanEval Benchmark to Hinglish and manually verified and fixed these translations. Our benchmark, HinglishEval is available as a JSON file (HinglishEval.json
) in this repository.
Hindi is one of the most widely spoken languages in the world, and the most widely spoen in India. A majority of the population in India does not speak English as their first language, and therefore language models that can understand prompts in native languages are important for wider accessibility. Hinglish is a blend of Hindi and English, with frequent usage of English words in sentences with standard Hindi grammar. This is not representative of everyday spoken Hindi for most people, but is rather common in coversations involving technical language, especially in the context of programming.
Therefore it is most natural for Hindi speaking users to prompt LLMs in Hinglish when they want to generate code, or ask for help with programming in general (like explanations or debugging). This benchmark is an attempt to understand how well LLMs can understand and generate code when prompted in such a language.
The HinglishEval benchmark contains all the problems in the HumanEval benchmark, with their prompts translated to Hinglish. The translation does not modify function signatures or doctests, and is limited to the purpose statement (supplied as a docstring in Python) of each function. The translations were manually verified and corrected to ensure that they sound like idiomatic Hinglish.
We have publicly released completions generated from 18 models on the prompts in the HinglishEval benchmark.
We evaluate models on the HinglishEval dataset using the pass@1 metric as well as Item Response Theory (IRT).