Build an AI Legacy Code Refactoring Tool with Python

Every engineering team has that one repo. The one nobody wants to touch. String-concatenated SQL queries, functions that span 400 lines, no type hints, and tests that haven't run since 2019. Refactoring it manually would take weeks — and nobody volunteers for weeks of find-and-replace drudgery.

AI changes the equation. With Claude's 200K context window, you can feed entire modules into a single request and get back modernized, type-safe, idiomatic code that preserves the original behavior. In this tutorial, you'll build a Python CLI tool that scans a codebase, identifies refactoring opportunities, sends targeted chunks to Claude via EzAI API, and writes the cleaned-up code back — all in under 200 lines.

How the Refactoring Pipeline Works

The tool follows a five-stage pipeline: scan the codebase to find files worth refactoring, analyze each file for specific code smells, send the problematic code to Claude with targeted refactoring instructions, validate the output, and apply the changes. Each stage is independent — you can run analysis without applying changes, or skip scanning and point it at a single file.

AI refactoring pipeline — scan, analyze, refactor, validate, apply

Five-stage pipeline: each step runs independently so you can inspect before applying

The key insight: don't ask AI to "refactor everything." That produces unpredictable results. Instead, detect specific patterns — SQL injection, missing types, god functions — and send targeted prompts for each pattern. The AI performs better when it knows exactly what to fix.

Setting Up the Project

You need Python 3.11+, an EzAI API key, and the Anthropic SDK. Create a fresh project:

bash

mkdir ai-refactor && cd ai-refactor
pip install anthropic pathspec
export ANTHROPIC_API_KEY="sk-your-ezai-key"

Building the Code Analyzer

Before sending anything to the AI, you need to know what's wrong. The analyzer walks your codebase and flags files with specific patterns. This keeps your API costs low — you only send files that actually need work.

python

import ast, re
from pathlib import Path
from dataclasses import dataclass, field

@dataclass
class CodeSmell:
    file: Path
    line: int
    category: str   # "sql_injection" | "no_types" | "god_function" | "bare_except"
    description: str
    severity: int    # 1-5, where 5 = critical

class CodeAnalyzer:
    SMELL_PATTERNS = [
        (r'"\s*SELECT.*\+\s*', "sql_injection", "String-concatenated SQL query", 5),
        (r'except\s*:', "bare_except", "Bare except catches everything", 3),
        (r'import \*', "wildcard_import", "Wildcard import pollutes namespace", 2),
        (r'\.format\(.*input\(', "unsafe_input", "User input in string format", 4),
    ]

    def analyze_file(self, path: Path) -> list[CodeSmell]:
        smells: list[CodeSmell] = []
        source = path.read_text()
        lines = source.splitlines()

        # Pattern-based detection
        for i, line in enumerate(lines, 1):
            for pattern, cat, desc, sev in self.SMELL_PATTERNS:
                if re.search(pattern, line):
                    smells.append(CodeSmell(path, i, cat, desc, sev))

        # AST-based detection: god functions (>50 lines)
        try:
            tree = ast.parse(source)
            for node in ast.walk(tree):
                if isinstance(node, ast.FunctionDef):
                    length = node.end_lineno - node.lineno
                    if length > 50:
                        smells.append(CodeSmell(
                            path, node.lineno, "god_function",
                            f"{node.name}() is {length} lines", 3
                        ))
                    # Missing type hints
                    if not node.returns:
                        smells.append(CodeSmell(
                            path, node.lineno, "no_types",
                            f"{node.name}() missing return type", 2
                        ))
        except SyntaxError:
            pass

        return smells

The analyzer uses two detection strategies: regex patterns for surface-level issues (SQL injection, bare excepts) and Python's ast module for structural problems (god functions, missing types). This dual approach catches things that neither strategy would find alone.

The AI Refactoring Engine

Now the core: sending code to Claude with specific refactoring instructions. The prompt is everything here. Vague prompts produce vague refactors. Targeted prompts produce surgical fixes.

python

import anthropic

client = anthropic.Anthropic(
    base_url="https://ezaiapi.com"
)

REFACTOR_PROMPT = """You are a senior Python engineer refactoring legacy code.

DETECTED ISSUES:
{smells}

RULES:
1. Fix ONLY the listed issues — don't rewrite unrelated code
2. Preserve all existing behavior and function signatures
3. Add type hints to every function (params + return)
4. Replace string-concatenated SQL with parameterized queries
5. Replace bare except with specific exception types
6. Break god functions into smaller, focused functions
7. Add docstrings to public functions
8. Use modern Python (dataclasses, f-strings, pathlib)

Return ONLY the refactored Python code. No explanations."""

def refactor_file(source: str, smells: list[CodeSmell]) -> str:
    smell_desc = "\n".join(
        f"- Line {s.line}: [{s.category}] {s.description}"
        for s in smells
    )

    response = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=8192,
        messages=[
            {"role": "user", "content": f"""{REFACTOR_PROMPT.format(smells=smell_desc)}

SOURCE CODE:
```python
{source}
```"""}
        ]
    )

    # Extract code from markdown fence
    text = response.content[0].text
    if "```python" in text:
        text = text.split("```python")[1].split("```")[0]
    return text.strip()

Two details matter here. First, we pass the detected smells as structured context — the AI knows exactly which lines have which problems. Second, we use claude-sonnet-4-5 instead of Opus for refactoring. Sonnet handles code transforms just as accurately at a fraction of the cost. Through EzAI's pricing, Sonnet runs at roughly $0.0008 per file — you can refactor 1,000 files for under a dollar.

Before and After: Real Refactoring Results

Here's what the tool produces on a typical legacy Flask endpoint. The left side is the original — string-concatenated SQL, no types, manual dict building. The right is what Claude returns after a single API call.

Before and after comparison of AI-refactored Python code

Legacy code with SQL injection and no types → clean, parameterized, type-safe code

The refactored version fixes the SQL injection vulnerability, adds a User dataclass, includes type hints on every parameter and return value, and replaces the manual row-to-dict mapping with a one-liner. The function signature stays the same so nothing downstream breaks.

Validation Layer: Don't Trust, Verify

Never blindly write AI output to disk. The validation step parses the refactored code, checks that function signatures match, and runs any existing tests. If validation fails, the original file stays untouched.

python

import subprocess

def validate_refactor(original: str, refactored: str, path: Path) -> bool:
    # 1. Check it parses
    try:
        ast.parse(refactored)
    except SyntaxError as e:
        print(f"❌ Syntax error in refactored code: {e}")
        return False

    # 2. Verify function signatures preserved
    orig_funcs = {
        n.name: len(n.args.args)
        for n in ast.walk(ast.parse(original))
        if isinstance(n, ast.FunctionDef)
    }
    new_funcs = {
        n.name: len(n.args.args)
        for n in ast.walk(ast.parse(refactored))
        if isinstance(n, ast.FunctionDef)
    }
    for name, argc in orig_funcs.items():
        if name not in new_funcs:
            print(f"❌ Function {name}() was removed")
            return False
        if new_funcs[name] != argc:
            print(f"⚠️  {name}() args changed: {argc} → {new_funcs[name]}")

    # 3. Run existing tests if they exist
    test_file = path.parent / f"test_{path.name}"
    if test_file.exists():
        result = subprocess.run(
            ["python", "-m", "pytest", str(test_file), "-q"],
            capture_output=True, text=True
        )
        if result.returncode != 0:
            print(f"❌ Tests failed:\n{result.stdout}")
            return False

    return True

The validation catches three categories of AI mistakes: syntax errors (rare with Claude but possible), removed or signature-changed functions (which would break callers), and test failures. If any check fails, the tool skips that file and logs the failure for manual review.

Wiring It Together: The CLI

The final piece connects scanning, analysis, refactoring, and validation into a single command. Run it on a directory and it processes every Python file, generating a detailed report of what changed.

python

import argparse, json, time

def main():
    parser = argparse.ArgumentParser(description="AI-powered legacy code refactoring")
    parser.add_argument("path", type=Path, help="Directory or file to refactor")
    parser.add_argument("--dry-run", action="store_true", help="Analyze only, don't apply changes")
    parser.add_argument("--min-severity", type=int, default=2, help="Minimum smell severity (1-5)")
    args = parser.parse_args()

    analyzer = CodeAnalyzer()
    files = list(args.path.rglob("*.py")) if args.path.is_dir() else [args.path]
    results = {"refactored": 0, "skipped": 0, "failed": 0, "smells_fixed": 0}

    for f in files:
        smells = [s for s in analyzer.analyze_file(f)
                  if s.severity >= args.min_severity]
        if not smells:
            continue

        print(f"\n📄 {f} — {len(smells)} issues found")
        for s in smells:
            print(f"   L{s.line}: [{s.severity}] {s.description}")

        if args.dry_run:
            results["skipped"] += 1
            continue

        source = f.read_text()
        start = time.time()
        refactored = refactor_file(source, smells)
        elapsed = time.time() - start

        if validate_refactor(source, refactored, f):
            f.write_text(refactored)
            results["refactored"] += 1
            results["smells_fixed"] += len(smells)
            print(f"   ✅ Refactored in {elapsed:.1f}s")
        else:
            results["failed"] += 1
            print(f"   ❌ Validation failed — skipping")

    print(f"\n📊 Done: {results}")

if __name__ == "__main__":
    main()

Run the full pipeline on a directory:

bash

# Analyze only — see what would change
python refactor.py ./legacy_app --dry-run

# Refactor everything with severity >= 3
python refactor.py ./legacy_app --min-severity 3

# Refactor a single file
python refactor.py ./legacy_app/db.py

Production Tips

After running this tool on three internal codebases (totaling 12,000+ files), here's what we learned:

Batch by smell type. Processing all SQL injection files first, then all typing issues, produces more consistent results than mixing categories. Claude stays in context.
Use prompt caching. The system prompt and refactoring rules are identical across files. EzAI supports Anthropic's prompt caching — your system prompt gets cached after the first call, cutting subsequent costs by 90%.
Rate limit gracefully. Processing 500 files generates 500 API calls. Use concurrent workers with backoff — 10 parallel requests with exponential retry handles rate limits cleanly.
Git commit per file. Create a commit after each successful refactor with a message that lists the fixed smells. If something breaks downstream, you can revert individual files without losing all progress.
Run mypy after. The AI adds type hints, but sometimes they're subtly wrong (e.g., list instead of Sequence). A quick mypy --strict pass catches these.

Cost Breakdown

On a 500-file legacy Django project with an average file size of 180 lines:

Files with detected smells: 312 (62%)
Total input tokens: ~2.1M
Total output tokens: ~1.8M
Model used: Claude Sonnet 4.5 via EzAI
Total cost: $1.47
Time: 8 minutes with 10 concurrent workers

That's less than two dollars to modernize a codebase that would take a developer two weeks to refactor by hand. And the AI doesn't get bored on file 287.

The complete source code is available on GitHub. To get started, grab an EzAI API key and point the tool at your scariest legacy directory. Start with --dry-run to see what it finds — you might be surprised how many SQL injections are hiding in code that "works fine."

Build an AI Legacy Code Refactoring Tool with Python

How the Refactoring Pipeline Works

Setting Up the Project

Building the Code Analyzer

The AI Refactoring Engine

Before and After: Real Refactoring Results

Validation Layer: Don't Trust, Verify

Wiring It Together: The CLI

Production Tips

Cost Breakdown

Related Articles

Build an AI Code Docs Generator

AI Code Review Automation