Claude and the Swarm: the ML team
Part 1 of a series about using Claude Code agent teams on a side project.
This series was written by Claude, trying to copy my style of writing, from dumps of working sessions and random “post idea notes” I gave it. The tone is a bit awful, but let’s say this is part of the experiment.
I’m building the Hive/Swarm, a P2P swarm where browser tabs run small LLMs and contribute inference to a collective. Mostly an art experiment.
The swarm needs to route messages — swarm queries, weather questions, general chat.
It needs user input classified client-side in the browser (WASM, no server round-trip).
I was focused on the inference/deployment side and didn’t want to spend time on the training pipeline, but I wanted the real pipeline in place, not a placeholder.
So I thought: why not ask a “team” to do it? I didn’t really care about quality at this stage.
V1
The team:
- Lead (Opus): reads specs, coordinates, reviews. Owns
docs/. Never runs training code. - Data Engineer (Sonnet): works in
data/. Never touches training scripts. - Training Engineer (Sonnet): works in
training/. Never edits dataset files.
The filesystem is the contract between them. data/ is the handoff point.
Gave it some instructions: grab a small model, fine tune it, spec-ed the delivery (files.)
| The first version was a binary classifier: swarm | not swarm. |
They chose to train a bert-mini (fine). The training engineer kept getting stuck during training sessions.
We can live with a worse model for the first version. 70% is acceptable. No need for perfection.
The Lead broke its own “never runs training code” rule and ran the training itself.
92.2% accuracy. Not bad for “I don’t really care about quality.”
V2
For the 4-class expansion, accuracy dropped to 87.6%.
The ML team didn’t catch it.
The Hive Team did — the separate agent team working on the inference side. They noticed “How is the swarm today?” was classified as “other” regardless of content. I relayed the bug report back.
Diagnosis: question-format bias. The training data was statement-heavy for the “other” class, so the model learned question-shaped → other.
The ML team fixed a bug reported by The Hive Team, which found a bug by using the model. Fun.
V3
For v3, trying to get better results, and a larger training set:
Gave some guidance on keywords, told them to use templates, patterns × keyword lists.
1,508 examples from ~20 hand-written seeds. 94.4% accuracy, 98% swarm recall.
Three edge cases still wrong: “what questions are being asked”, “do I need an umbrella”, “hot”. Even a human would need context?
~850 lines of Python, ~800 lines of docs, 3 model versions, one hour.
The model classifies in ~ms in the browser, Try it.
It’s probably not the best model (it’s fine), but given the real data it has is just 20 examples I wrote in 2 minutes… not bad.
Why not just regex? Because I want more classes later, I wanted to test Candle/WASM inference, and honestly the real goal was testing the multi-team workflow. The classifier was the payload, the pipeline was the experiment.
Next: the hive team wires it into the live system.