Description: Website for paper, The Counterfeit Conundrum
code (3218) llm (259) code generation (44) program synthesis (8)
While language models are increasingly more proficient at code generation, they still frequently generate incorrect programs. Many of these programs are obviously wrong, but others are more subtle and pass weaker correctness checks such as being able to compile. In this work, we focus on these counterfeit samples : programs sampled from a language model that 1) have a high enough log-probability to be generated at a moderate temperature, 2) are incorrect, but 3) pass weak correctness checks.
Overall, we discover that open-source models have a very shallow understanding of counterfeits . First, models mistakenly classify them as correct. Second, they are worse at reasoning about the execution behaviour of counterfeits and often predict their execution results as if they were correct. Third, when asking models to fix counterfeits, the likelihood of a model successfully repairing a counterfeit is often even lower than that of sampling a correct program from scratch. Many self-repair and self-verif
Caveat: this paper focuses on open-source models, primarily CL-34B and DS-I-33B. We also present limited results on GPT-3.5 and GPT-4 which suggest that GPT-3.5 behaves similarly to the open-source models while GPT-4 has a much better understanding of counterfeits. Nevertheless, we still find that GPT-4 still exhibits some of these misunderstandings.