AI & LLMs

Three Cobblers, One Zhuge Liang: Making Cheaper Models Work Together

A personal AI architecture lesson from the Chinese saying 三个臭皮匠，顶个诸葛亮: why cheaper models fail on giant prompt blobs, and how focused specialist sessions, orchestration, synthesis, and temperature control can make them useful.

April 30, 20265 min read

ai-architectureweak-modelsmulti-aiprompt-engineeringllm-costsorchestration

AI-Powered

AI-powered · Limited to 20 requests per hour

An ancient strategist silhouette looking over three focused AI workstations connected by soft threads of light

There is an old Chinese saying I keep coming back to when I think about AI architecture:

三个臭皮匠，顶个诸葛亮

Literally, it means "three humble cobblers can match one Zhuge Liang." Zhuge Liang was the legendary strategist from the Three Kingdoms era, the kind of figure people use as shorthand for impossible intelligence. The saying is not really about cobblers. It is about pooled perspective. Three ordinary people, if coordinated well, can compete with one genius.

That sentence started feeling very practical to me once token bills became part of the architecture discussion.

For a while, the default answer to any difficult AI workflow was simple: use the strongest model you can afford. Run Sonnet. Run Opus. Run the best GPT model available. If the output misses something, add more instructions. Eventually the prompt becomes a giant blob: requirements, examples, edge cases, logs, constraints, and "please be careful" all crushed into one request.

It feels reasonable. It is also where cheaper models start to fall apart.

The giant prompt trap

A small model node being overwhelmed by a single massive glowing prompt blob full of tangled requirements

Smaller models like Haiku-class systems are useful. They are fast, cheap enough to call repeatedly, and good at narrow tasks. But they are not compressed Opus.

Compared with Sonnet, a smaller model is more likely to miss the second or third constraint in a long prompt. It may follow the main instruction while forgetting the exception. Compared with Opus, the gap gets sharper: long-horizon planning, conflict resolution, and self-checking are weaker. When it makes a plausible mistake, it often polishes the mistake instead of catching it.

This is expected. The mistake is not that Haiku misses things. The mistake is designing the workflow as if it should not.

The first improvement: separate the job from the rules

The first fix I learned was embarrassingly simple: stop stuffing everything into the user prompt.

A clear system prompt changes the shape of the task. The system prompt defines the role, priorities, constraints, output contract, and evaluation lens. The user prompt carries the payload. That separation matters because the model no longer has to infer which parts are permanent rules and which parts are one-time data.

For weaker models, that difference is large. A focused system prompt acts like a rail. It tells the model what kind of judgment to apply before it sees the giant blob. "You are a requirements auditor. Only check missing acceptance criteria. Return findings as JSON." That is easier to follow than a long prompt that says, somewhere in paragraph twelve, "also act like a requirements auditor."

The reasoning is concrete: smaller models have less room to juggle instructions. When rules, examples, data, and desired output are mixed together, the model has to spread its attention across all of it. A system prompt anchors the behavior first, then lets the task data flow through it.

This does not make a weak model brilliant. It makes the task narrower.

The real architecture: three cobblers

Multiple focused model sessions examining different slices of one problem before passing structured notes forward

The professional answer is not "write a better giant prompt." It is "stop asking one session to be every profession at once."

Split the work.

One session checks requirements. Another goes after edge cases. A third extracts facts. The next looks for contradictions. The last rewrites for tone. Each session gets its own system prompt and narrow task prompt. None of them needs to be Zhuge Liang. They just need to be decent at their assigned corner.

Then a final synthesis session combines the results.

This is where the proverb becomes architecture. Three smaller models with focused responsibilities can cover more surface area than one overloaded model trying to remember everything. The improvement does not come from pretending weak models are strong. It comes from reducing the number of things each model can forget.

Parallelism helps when the subtasks are independent: security review, UX review, cost review, factual extraction. Chaining helps when one output becomes the input to the next: classify, extract, validate, summarize. In both cases, the important move is the same. Replace one broad judgment with several narrow judgments.

The hub-and-spoke version

A central orchestrator node routing structured context between specialized AI agents arranged around it

There is another pattern I like: the hub-and-spoke model.

One session acts as the orchestrator. It does not try to solve the whole problem directly. Instead, it decides which specialist should inspect which part. It passes only the relevant context, collects the replies, and asks follow-up questions when outputs conflict. Then it synthesizes the final answer.

This is useful when the work is not a clean pipeline. Real tasks are messy. A review agent might find a missing requirement. That missing requirement might need to go back to a planning agent. A cost agent might disagree with the proposed architecture. The orchestrator keeps the state moving without forcing every specialist to understand the whole world.

The trick is to keep the orchestrator honest. It should pass structured summaries, not vague vibes. It should preserve disagreements instead of smoothing them away. And when the spokes produce conflicting answers, the final synthesis should say so or escalate to a stronger model.

Cheap models are useful here because they become sensors. Each one looks from a specific angle. The orchestrator does not need them to be perfect. It needs enough coverage that important misses become less likely.

The last knob: temperature

A precise temperature control dial balancing predictable pipeline work with creative exploration

Temperature is not a cure for weak reasoning, but it is one of the simplest ways to make a pipeline less chaotic.

For extraction, validation, classification, synthesis, and review, I want low temperature. Predictability matters more than novelty. If the same input produces a different schema or a different judgment every run, the workflow becomes hard to debug.

For creative work, I raise it. Naming, brainstorming, metaphors, first-draft copy, visual ideas: those tasks benefit from variation. I do not want the model to return the safest average answer every time.

The mistake is using one temperature everywhere. Architecture tasks need different modes. A specialist that checks compliance should be boring. A specialist that proposes blog titles can be loose. The orchestrator should usually be conservative.

That is the lesson I keep learning: do not spend all your energy searching for one perfect model call. Design the work so imperfect calls can still be useful.

Three cobblers do not magically become Zhuge Liang. But if each one knows exactly what to look at, and someone sensible combines the result, the system can get surprisingly close.

License

Article text © 2026 Mark Huang. Licensed under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) unless otherwise noted. Article text is licensed for non-commercial sharing with attribution to the original article URL. Commercial use requires prior written permission and must clearly cite the original source.

Code snippets, screenshots, third-party assets, and site source code may have separate terms.

Suggested attribution: Based on "Three Cobblers, One Zhuge Liang: Making Cheaper Models Work Together" by Mark Huang, originally published at https://markhuang.ai/blog/three-cobblers-one-zhuge-liang-ai-architecture.

May 26, 20265 min read

System Prompt vs User Prompt: The Layer Under GenAI Features

A beginner-friendly explanation of system_prompt and user_prompt using ChatGPT, Claude Projects, Claude Cowork, and Claude Code examples.

Read article

Apr 8, 202610 min read

The 1+1 Hypothesis: Can You Break Coding Problems Small Enough for Any LLM?

Every LLM can do 100×100. Every coding LLM can rename a variable. But where does reliability break — and can harness engineering push that boundary? Exploring residual solution entropy, test-first contracts, layered defense architectures, and why blind consensus fails while verified search works.

Read article

Jun 3, 20267 min read

I Feel Sorry for AI

Why both AI hype and anti-AI hostility miss the same point: LLMs behave more like straight-A new graduates than senior experts, and useful agents need onboarding, skills, and maintained memory rather than impossible first-attempt expectations.

Read article

The giant prompt trap

The first improvement: separate the job from the rules

The real architecture: three cobblers

The hub-and-spoke version

The last knob: temperature

License

Related Articles

System Prompt vs User Prompt: The Layer Under GenAI Features

The 1+1 Hypothesis: Can You Break Coding Problems Small Enough for Any LLM?

I Feel Sorry for AI