Research 10 min read

Why Every AI Engineer Should Learn Classical Chinese

ai.rs Apr 14, 2026
Why Every AI Engineer Should Learn Classical Chinese illustration

Or at least, why your agents should be writing in it.


Six months into any serious LLM-agent project, the same thing happens.

The conversation history, the decision log, the accumulated project context — all of it balloons past the model's context window. You start summarizing. The summaries lose fidelity. You feed the summaries back in and the model hedges more, hallucinates more, forgets the decisions it made a month ago. Every call pays for the same project preamble again. The API bill climbs.

If you're calling a frontier model at scale, the cost of context isn't theoretical. It's line one of your infra spend.

So when a GitHub issue crossed my feed claiming that Classical Chinese — 文言文, a literary language whose grammar stabilized around the time of Confucius — could compress agent memory by 28% compared to structured English shorthand, I did what any engineer does on seeing a claim like that.

I assumed it was nonsense and set out to prove it.

I was half right.

The claim, and the skeptic's case

The two projects in question:

  • MemPalace — an agent-memory architecture that shards long conversations into a "palace" of wings, rooms, closets, and drawers, each holding structured-English compressed notes in a format called AAAK. It scores 96.6% on LongMemEval without calling an LLM summarizer.
  • MemChinesePalace — a fork-in-spirit by a different author, replacing AAAK with what they call "Wenjian" (文简 — Classical Chinese shorthand). The issue proposing this was closed by the upstream maintainer within hours: "Classical Chinese wouldn't be natively readable by most LLMs."

The case for skepticism looked strong:

Tokenizers don't love Chinese. OpenAI's older cl100k_base tokenizer (used by GPT-4 and GPT-3.5) splits most Chinese characters into 2–3 BPE tokens. "Character count" and "token count" are not the same thing, and Chinese often costs more tokens per character than English.

Classical Chinese is famously ambiguous. Two thousand years of commentators have argued over what any given passage of 文言文 means. For a memory system where you need deterministic recall, that's the opposite of what you want.

AAAK already works. A format like DECISION:auth.migrate:auth0->clerk is ugly but parses with a regex and leaves zero room for interpretation. It uses common English tokens. It's hard to see what Classical Chinese adds.

So the headline claim — "28% fewer tokens" — smelled like someone counting characters and calling them tokens.

Test one: does the token claim hold?

I wrote the smallest possible benchmark: five realistic memory samples (a decision, a bug finding, a milestone event, a team preference, a proposal), encoded in three formats each. Plain English, AAAK, and Wenjian. Then I fed them through tiktoken against two real BPE tokenizers.

The result, totalled across all five samples:

Tokenizer English AAAK Wenjian Wenjian vs AAAK
cl100k_base (GPT-4 / 3.5) 250 220 234 +6.4% (worse)
o200k_base (GPT-4o / 5) 253 220 191 −13.2%

My suspicion was right: the 28% figure was character-counted, not token-counted. On the older tokenizer, Wenjian actually loses to AAAK.

But my suspicion was also wrong: on the modern o200k_base tokenizer — the one used by every frontier OpenAI model today — Wenjian really is about 13% smaller. Not 28%, but not zero either.

Half a win for the Wenjian side. The real question, I thought, was whether the model could still read the compressed form accurately. That's where Wenjian's polysemy problem was supposed to bite.

Test two: can the model actually read it?

For this I used a local setup — ollama serving qwen3:32b, qwen3.5:27b, and (later) llama3.1:8b. Qwen is the strongest open model for Chinese, which makes it the fairest test of the "LLMs natively read 文言文" premise. If Wenjian can't perform there, it can't perform anywhere.

The protocol: for each of the five memory samples, I generated three factual questions. The model got only the compressed memory record and one question, and had to answer. Scoring was a deterministic keyword match (no LLM-as-judge — reproducible across runs).

One hundred and twenty calls later, I had my answer:

Model English AAAK Wenjian
qwen3.5:27b 15/15 (100%) 14/15 (93%) 15/15 (100%)
qwen3:32b 15/15 (100%) 15/15 (100%) 15/15 (100%)

Wenjian matches English. On both models.

The polysemy concern that I and the upstream maintainer had both raised — that Classical Chinese would be too ambiguous for reliable fact recall — simply didn't materialize. When asked "what was the target deadline?" of the line 议 26/Q1末 迁身份:Auth0→Clerk, the model answered "end of Q1 2026" without hesitation. When asked who discovered a bug encoded as 普雅设审中得 ("Priya, in the security audit, discovered"), it answered "Priya" or "普雅" — both scored correct.

At this point I had to update. The Wenjian claim isn't bullshit. On Chinese-strong models, it's a Pareto improvement over plain English: 24% smaller, same retrieval. The upstream maintainer was wrong to close the issue that fast.

A hybrid that nearly beat them both

While I was at it, I built a third format: a hybrid that keeps AAAK's deterministic KEY:value|key:value skeleton but inlines five Chinese idiom macros — 亡羊 (tech-debt / known-defect), 破竹 (major breakthrough), 金蝉 (migration / refactor), 定鼎 (final architecture decision), 一石 (single-action-multiple-wins).

These idioms are the genuinely novel contribution of Classical Chinese to this problem. Each one is 2–3 tokens but encodes a multi-token English concept. And because frontier models are trained on enough Chinese literature to know what they mean, there's no learning cost per session — just a one-line legend in the system prompt.

The hybrid scored best on tokens: 28% smaller than English, 17% smaller than AAAK. But when I ran the retrieval test, it stumbled — 87% combined across the two Qwen models. The failures were specific: the shorthand @Q1.26 was read as decoration rather than a deadline, and parenthesized reason-codes like (cwrites+json) were too cryptic to expand when asked "why is this preferred?".

So I wrote a v2 that used t:Q1.26 and why:cwrites,json. It cost nine extra tokens. Retrieval jumped from 87% to 97% on Qwen.

Hybrid v2 now tied Wenjian on both axes — same compression, same recall. On Qwen. The interesting question was what would happen on a model that wasn't trained on a mountain of Chinese text.

Test three: does it survive a Western model?

I pulled llama3.1:8b — a small, general-purpose Meta model with much thinner CJK coverage than Qwen. This was the test the upstream maintainer had implicitly failed back when he closed the issue.

Format Llama3.1:8b
English 15/15 (100%)
AAAK 13/15 (87%)
Wenjian 13/15 (87%)
Hybrid v2 14/15 (93%)

Three findings worth pulling out:

Wenjian didn't collapse. It dropped from 100% on Qwen to 87% on Llama, landing exactly where AAAK already was. The upstream maintainer's concern was directionally right but overstated — even an 8B Western-trained model extracts most of Wenjian's content correctly.

Hybrid v2 was the top compressed format. At 93%, it beat both Wenjian and AAAK on Llama. The design bet — "keep Latin keys for everything except the five macros" — paid off. The macros are common enough in LLM training data to survive anywhere; the Latin keys keep the rest tokenizer-stable.

Direction arrows broke Llama across every format. > and -> got inverted multiple times. pg>mysql was read as "mysql is preferred", and jenkins->gh_actions as "Jenkins is recommended". That's a format-neutral finding worth fixing in any compression scheme: textual from:X|to:Y is worth the extra tokens.

Combined cross-model ranking, 45 questions each:

Format Tokens vs English Retrieval Behaviour
English 0% 100% reference
Wenjian −24% 96% peaks on Chinese-strong, drops to AAAK-parity elsewhere
Hybrid v2 −24% 96% more uniform across model families
AAAK −13% 93% solid but less compressed
Hybrid v1 −28% 87% too aggressive, dominated by v2

The methodology surprise

The most useful single finding from this whole exercise wasn't about Classical Chinese at all. It was about how to evaluate a format in the first place.

The weakest model was the most informative.

  • qwen3:32b scored 4 of 5 formats at 100%. Ceiling effect. Almost no signal about which format is actually more robust.
  • qwen3.5:27b — somewhat fewer parameters, newer training — separated hybrid v1 from the pack but still saturated Wenjian and English.
  • llama3.1:8b was the only model that produced different failure modes per format, surfaced the direction-arrow bug, and cleanly separated hybrid v2 from the others.

If I had only run this on Claude Opus or GPT-5, I'd have concluded that all four compressed formats were equivalent. I'd have shipped the wrong one. The frontier models succeed despite the format, not because of it. Their format-robustness is invisible from their top-line score.

There's a sub-finding inside this that's worth calling out separately. Within the Qwen family, the newer model (3.5) scored worse than the older model (3.0) on every compressed format — 93% vs 100% on Hybrid v2, 80% vs 93% on v1. Both Q4_K_M, so quantization is constant. Two plausible reads: (a) five billion fewer parameters hurt literal-parsing capability more than a generation of training gains it, or (b) newer RLHF tunes models away from shorthand literalism — the 3.5 misses were mostly "the record does not specify…", the model hedging instead of committing to what's there.

Either way: do not assume the newest model in a family is the best format-reader. Test your actual deployment target.

What this means for you, practically

If you're building anything that persists memory across LLM calls — an agent, a copilot, a long-lived assistant, a RAG pipeline that stuffs retrieved docs into a context window — these numbers have a direct read.

For a fixed context window, compressing memory into a denser dialect stores roughly a quarter more facts in the same tokens. Equivalently, if you're paying per token over an API, that's a ~24% input-cost reduction at ~96% retrieval fidelity on mid-sized open models. Six months of project context now fit where four did before.

The practical picks:

  • If your serving model is Chinese-strong (Qwen, DeepSeek, Yi, any Chinese-tuned Claude or GPT deployment), use pure Wenjian. It peaks there.
  • If you serve a mix — or you don't know what model the user picks — use the Hybrid v2 format. More uniform across model families, same compression, one miss per 15 on weak Western models.
  • Either way, replace direction arrows with textual labels. That's a universal improvement; it costs a few tokens and prevents a whole class of Llama-style inversions.

And the deeper lesson, applicable far beyond this one experiment: if you're comparing prompt formats, tool-call schemas, structured-output styles, or domain DSLs — evaluate them on small or mid-sized open models. Not on the flagship. The flagship's ceiling effect will hide the failures that show up in production on cheaper inference.

So — learn Classical Chinese?

Literally? No. You don't need to read 文言文 yourself. The point is that a language whose grammar stabilized two millennia ago, which removed every grammatical redundancy human writers could find to remove, and which modern LLMs were trained on because it's part of humanity's literary record anyway — that language is already sitting in your model, unused, ready to compress your memory by a quarter.

You don't have to learn Classical Chinese.

Your agents should be writing in it.


Full benchmark code, results, and raw data: github.com/.../MemChinese (upstream) — see the README for the unvarnished numbers and next steps.

Wondering if this fits your business?

The benchmarks and architectures we cover here power real production AI assistants. See where your business stands in 2 minutes.

Take the AI Readiness Check
Share: Post Share

Read next