Treffer: Analyzing code generation by AI models: an ANOVA-based study of quality, consistency, and composite ranking
https://louis.uah.edu/context/uah-theses/article/1771/viewcontent/rajavi_11161.pdf
https://louis.uah.edu/context/uah-theses/article/1771/filename/0/type/additional/viewcontent/Anova_Analysis_11161.ipynb
https://louis.uah.edu/context/uah-theses/article/1771/filename/1/type/additional/viewcontent/DataSheet_Thesis_11161.csv
Weitere Informationen
As generative Artificial Intelligence(AI) tools become increasingly common in software development, there is a growing need to understand how well these tools perform beyond just producing code that runs. This thesis examines the performance of four popular generative AI models, ChatGPT (GPT-4 mini), GitHub Copilot, Code LLaMA 3.3, and DeepSeek Web, in generating code that is not only functionally correct but also efficient and maintainable. To do this, we tested each model on six real-world-style coding problems sourced from LeetCode, covering a range of algorithmic challenges like dynamic programming, graph traversal, and array manipulation. Using a consistent prompting strategy, we collected Python code samples from each model and evaluated them using established software engineering metrics: Lines of Code, Cyclomatic Complexity, Halstead Complexity, and the Maintainability Index. We then applied a detailed statistical analysis, including ANOVA, post hoc testing, and nonparametric methods, to see which models consistently performed best. Our results show that the type of problem has the biggest impact on the complexity and length of the code, but when it comes to how maintainable the code is, the Artificial Intelligence(AI) model itself matters a lot. LLaMA produced the most maintainable code across the board, while GitHub Copilot often generated more complex, harder-to-maintain solutions. ChatGPT and DeepSeek showed similar and generally solid performance, landing somewhere in the middle. This research goes beyond simple pass/fail benchmarks and provides a clearer and more nuanced understanding of how generative AI tools behave in practical programming tasks. Developers, educators, and tool makers can use these findings to choose the right AI assistant for their needs and better understand where these models shine and where they still fall short.