Treffer: Refactoring Python Code with LLM-Based Multi-Agent Systems: An Empirical Study in ML Software Projects
Weitere Informationen
Refactoring is essential for improving software maintainability, yet it often remains a validation-intensive and developer-guided task—particularly in Python projects shaped by fast-paced experimentation and iterative workflows, as is common in the machine learning (ML) domain. Recent advances in large language models (LLMs) have introduced new possibilities for automating refactoring, but many existing approaches rely on single-model prompting and lack structured coordination or task specialization. This study presents an empirical evaluation of a modular LLM-based multi-agent system (LLM-MAS), orchestrated through the MetaGPT framework, which enables sequential coordination and reproducible communication among specialized agents for static analysis, refactoring strategy planning, and code transformation. The system was applied to 1,719 Python files drawn from open-source ML repositories, and its outputs were compared against both the original and human-refactored versions using eight static metrics related to complexity, modularity, and code size. Results show that the agent consistently produces more compact and modular code, with measurable reductions in function length and structural complexity. However, the absence of a validation agent led to 281 syntactically invalid outputs, reinforcing the importance of incorporating semantic and syntactic verification to ensure transformation correctness and build trust in automated refactoring. These findings highlight the potential of LLM-based multi-agent systems to automate structural code improvements and establish a foundation for future domain-aware refactoring in ML software.