Files
lora/docs/specialists/serpent.md
T

3.7 KiB
Raw Blame History

Serpent — Python Specialist LoRA

Adapter codename: serpent Agent: build-python Base model: Qwen/Qwen3.5-27B

Objective

Teach the model Python-idiomatic code generation aligned with the build-python agent's system prompt. The adapter should internalize:

  • Type hints on all function signatures (params + return)
  • pathlib.Path over os.path
  • uv for dependency management, pyproject.toml as source of truth
  • Pydantic v2 patterns (.model_dump(), not .dict())
  • pytest + pytest-asyncio for testing
  • ruff for lint + format
  • ruff → pytest → mypy verification cycle

Data Sources

Session extraction (~100200 examples)

Classify from opencode build sessions by:

  • File paths: .py, pyproject.toml, requirements.txt, setup.py
  • Bash commands: python, pip, uv, pytest, ruff, mypy
  • Error patterns: SyntaxError, TypeError, ImportError, traceback format
  • Framework detection: FastAPI, Django, Flask imports

Git repo mining (~50100 examples)

Target repos:

  • madcat-os/lora — training scripts (this repo)
  • Automation scripts in ~/.config/opencode/scripts/
  • Any Python projects in madcat-os org

Extract patterns:

  • Data processing scripts (JSONL, datasets)
  • ML training pipelines (unsloth, torch, datasets)
  • CLI tools and automation
  • FastAPI endpoint implementations

Synthetic (~50100 examples)

  • async/await with asyncio patterns
  • Pydantic v2 model definitions with validators
  • pathlib file operations
  • Type-narrowing with TypeGuard, assert_never
  • Dataclass vs Pydantic decision points
  • uv project setup and dependency resolution

Example Format

{
  "messages": [
    {
      "role": "system",
      "content": "<build-python agent system prompt>"
    },
    {
      "role": "user",
      "content": "Add a CLI argument to specify output format (json/csv) in the extraction script"
    },
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "read",
            "arguments": {"filePath": "extract.py"}
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "<existing script>"
    },
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "edit",
            "arguments": {
              "filePath": "extract.py",
              "oldString": "def main():",
              "newString": "def main(output_format: Literal[\"json\", \"csv\"] = \"json\") -> None:"
            }
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "Edit applied"
    },
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "bash",
            "arguments": {"command": "ruff check extract.py && ruff format --check extract.py"}
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "All checks passed"
    },
    {
      "role": "assistant",
      "content": "Added `--format` CLI arg via `argparse`. Type-hinted with `Literal[\"json\", \"csv\"]`. Ruff clean."
    }
  ]
}

Evaluation Criteria

  1. All functions have type hints (params and return)
  2. Uses pathlib.Path, not os.path
  3. ruff check and ruff format --check pass
  4. pytest tests pass
  5. Pydantic v2 patterns (no v1 .dict(), .json())
  6. No requirements.txt — uses pyproject.toml + uv
  7. Tool call sequence: read → edit → lint → test

Training Config Overrides

MAX_SEQ = 8192
LR      = 5e-5

Estimated Size

  • 200400 examples
  • ~1.5M tokens
  • Training time: ~1.5 hrs on H100
  • Adapter size: ~305 MB