GLM-5.2 Is Here: Z.ai's 753B Open Model Built for Long-Horizon Agentic Work

Z.ai shipped GLM-5.2, an open-weight 753B-parameter MoE with a 1M-token context and the new IndexShare attention trick. Here are the real specs, the benchmarks, the honest caveats, and how much VRAM it actually needs.

By Jyoti Ranjan Swain | Updated:
GLM-5.2 by Z.ai — a 753B-parameter open-weight Mixture-of-Experts model with a 1M-token context and MIT license

Short Intro

GLM-5.2 is Z.ai's newest flagship open model, and the framing is unusually specific: it is built for long-horizon tasks — sustained, multi-step agentic work rather than one-shot answers. It is a Mixture-of-Experts model with 753 billion total parameters, a 1M-token context window, MIT-licensed open weights on Hugging Face, and a new attention optimization called IndexShare that the team says cuts per-token compute by 2.9x at full context.

The headline story is not raw benchmark dominance. GLM-5.2 trades blows with Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro — winning some rows, losing others. The real story is that an open-weight model you can download and self-host is now genuinely competitive on agentic engineering, and it ships with the architectural tricks needed to make a 1M context practical. This post walks through what it is, what the numbers actually say, and what it costs to run.

Table of Contents

What GLM-5.2 Actually Is

GLM-5.2 by Z.ai, a 753B open-weight Mixture-of-Experts model

GLM-5.2 is the latest model in Z.ai's (the lab formerly known publicly as Zhipu, publishing under the zai-org org on Hugging Face) GLM-5 line. Z.ai positions it as their "latest flagship model for long-horizon tasks," and the model card leads with four claims:

  • A 1M-token context that "stably sustains long-horizon work" — the emphasis is on stable use of the window, not just a big number on the spec sheet.
  • Stronger coding with multiple selectable thinking effort levels, so you can trade latency for depth per request.
  • Architecture improvements — specifically IndexShare and an improved MTP (multi-token prediction) layer for faster speculative decoding.
  • An MIT open-source license, described by the team as "no regional limits, technical access without borders."

It is a text-generation model with English and Chinese support, distributed in BF16 with an FP8 variant (GLM-5.2-FP8) also published. The companion technical report is titled, tellingly, "GLM-5: from Vibe Coding to Agentic Engineering" — which tells you exactly where the team aimed this release.

The Architecture, In Plain Numbers

Pulled straight from the model's published config.json, here is what GLM-5.2 is under the hood:

FieldValue
Total parameters~753B (MoE)
Hidden layers78
Hidden size6,144
Attention heads64
KV heads64
Head dimension192
Dense intermediate size12,288
MoE intermediate size2,048
Routed experts256
Experts per token8
Shared experts1
Max context1,048,576 tokens
Vocab size154,880
Weights dtypeBF16

A few things worth flagging for anyone who plans to serve this:

  • It is a sparse MoE. 753B total parameters, but only 8 of 256 routed experts (plus 1 shared) fire per token. That keeps the active compute per token far below what the total parameter count implies — that is the whole point of the design.
  • KV heads equal attention heads (64 = 64). Unlike the aggressive grouped-query attention you see in Llama or Qwen (which cut KV heads to 8), GLM-5.2 keeps full multi-head KV. Combined with a large head dim of 192, that makes the KV cache heavy per token — which matters enormously at long context. IndexShare exists largely to make that tractable.
  • 78 layers is deep, which compounds the KV cache cost. We will come back to this in the VRAM section.

IndexShare and the 1M-Token Context

The marquee architectural change is IndexShare. In Z.ai's words, it "reuses the same indexer across every four sparse attention layers, reducing per-token FLOPs by 2.9x at a 1M context length." In practice: sparse attention needs an index to decide which tokens to attend to, computing that index is expensive at long context, and sharing one indexer across groups of four layers amortizes that cost. The result is that the 1M window is not just supported on paper but is meant to stay efficient deep into the context.

The second change is an improved MTP layer for speculative decoding, which the team says raises acceptance length "by up to 20%." Longer accepted drafts mean fewer verification passes and faster tokens-per-second in practice.

The honest read: these are real, documented efficiency wins, and they target exactly the failure mode that kills long-context models — compute and memory blowing up as the window fills. But "2.9x fewer FLOPs" is a compute claim, not a free pass on memory. The KV cache still grows linearly with context, and as the numbers below show, that is where a 1M window gets expensive.

The Benchmark Numbers (And How To Read Them)

Z.ai published a head-to-head evaluation against GLM-5.1, Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro, with all models run at their maximum thinking effort.

GLM-5.2 LLM performance evaluation chart across eight benchmarks

Here is a selection of the reported scores:

BenchmarkGLM-5.2GLM-5.1Claude Opus 4.8GPT-5.5Gemini 3.1 Pro
SWE-bench Pro62.158.469.258.654.2
Terminal-Bench 2.181.063.585.084.074.0
NL2Repo48.942.769.750.733.4
DeepSWE46.218.058.010.0
ProgramBench63.750.971.970.839.5
MCP-Atlas77.071.877.875.369.2
Tool-Decathlon48.240.759.955.648.8
Humanity's Last Exam40.5 (54.7 w/ tools)31.052.357.951.4

Reported reasoning and math scores are also strong: GPQA-Diamond 91.2, AIME 2026 99.2, HMMT Feb 2026 92.5, IMOAnswerBench 91.0.

Two honest readings:

  1. The generational jump over GLM-5.1 is large and consistent. DeepSWE more than doubles (18.0 → 46.2), Terminal-Bench jumps from 63.5 to 81.0, and every coding and agentic row improves. This is a real release, not a point update.
  2. Against the frontier, it is mixed. GLM-5.2 leads or ties on agentic rows like MCP-Atlas and on AIME, but trails Claude Opus 4.8 on most pure-coding benchmarks (SWE-bench Pro, NL2Repo, ProgramBench). The pitch is not "beats every closed model" — it is "an open-weight model that is genuinely in the conversation, and that you can run yourself."

The standard caveat applies: these are vendor-run numbers with each model on its own harness, so treat the cross-model rows as directional. The GLM-5.1 → GLM-5.2 comparison (same lab, same setup) is the most trustworthy column, and it is the one that looks best. Before re-weighting any production routing, test it on your own workload.

How Much VRAM Does GLM-5.2 Need?

This is where the architecture choices bite. GLM-5.2 is a 753B-parameter model with full KV heads (64), a large head dim (192), and 78 layers — so both the weights and the KV cache are large.

Rough weight footprint, before any KV cache or runtime overhead:

  • BF16 (16-bit): ~1,400 GiB — multi-node, server-grade only.
  • INT8: ~700 GiB.
  • INT4: ~350 GiB — still well beyond any single GPU, but reachable on a multi-GPU server node.

And because KV heads were not reduced, the per-token KV cache is unusually heavy — at a 1M context the cache alone runs into the hundreds of gigabytes per concurrent user before quantization. That is precisely the cost IndexShare and KV-cache quantization are designed to fight, but it is real, and it is why "1M context" and "runs on a workstation" do not belong in the same sentence for this model.

Rather than guess, plug the exact architecture into the calculator. We have added GLM-5.2 as a preset:

Open the AI VRAM Calculator with GLM-5.2 pre-selected →

It breaks weights, KV cache, runtime overhead, and (for training) optimizer states and activations into separate lines for inference, full fine-tuning, and QLoRA — so you can see exactly how quantization and context length move the number. Drop the context to 8K and switch to INT4 to see how aggressively the footprint shrinks, then push context toward 1M to watch the KV cache take over.

The practical takeaway: GLM-5.2 is a server-class model, not a laptop model. For almost everyone, the right first step is the hosted API; self-hosting only makes sense if you have multi-GPU (ideally multi-node) hardware and a real reason to keep weights in-house.

Availability And Licensing

  • Open weights: downloadable from Hugging Face under the MIT license — commercial use allowed, no regional restrictions. An FP8 variant is published alongside the BF16 weights.
  • API: hosted access via the Z.ai API Platform.
  • Chat: a consumer interface at chat.z.ai.
  • Local serving: supported on SGLang, vLLM, Transformers, KTransformers, Unsloth, and Ascend NPU stacks.

MIT is about as permissive as open weights get, and it is a deliberate contrast with the more restrictive community licenses some labs attach. For teams that need to own their stack or operate in regions where API access is awkward, that license is a big part of the appeal.

Who Should Actually Use It

  • Reach for GLM-5.2 via API if you want a strong, open agentic-engineering model and care about long-horizon tasks — multi-file refactors, long agent runs, large-context analysis — without standing up your own cluster.
  • Self-host it only if you have multi-GPU/multi-node hardware and a concrete reason (data residency, cost at scale, customization) to keep weights local. Use INT4/FP8 and KV-cache quantization, and budget VRAM with the calculator first.
  • Prefer Claude Opus 4.8 or GPT-5.5 if your workload is dominated by the specific coding benchmarks where they still lead and you do not need open weights.
  • Compare against other open MoEs — Kimi K2.6/K2.7, DeepSeek-V4 — on your own tasks. They sit in the same server-class tier, and the right choice is workload-specific.

FAQ

What is GLM-5.2?

GLM-5.2 is Z.ai's flagship open-weight large language model, released in June 2026. It is a Mixture-of-Experts model with about 753B total parameters, a 1M-token context window, and MIT-licensed weights on Hugging Face, built specifically for long-horizon agentic tasks.

How big is GLM-5.2 and how many parameters are active?

It has ~753B total parameters but is sparse: only 8 of 256 routed experts (plus 1 shared expert) activate per token, so the active compute per token is a fraction of the total. It has 78 layers, a hidden size of 6,144, and 64 attention heads.

Is GLM-5.2 better than Claude Opus 4.8 or GPT-5.5?

Not uniformly. On Z.ai's own benchmarks GLM-5.2 leads on some agentic and math rows (MCP-Atlas, AIME 2026) but trails Claude Opus 4.8 on several pure-coding benchmarks like SWE-bench Pro and NL2Repo. Its real advantage is being open-weight and self-hostable while staying competitive.

What is IndexShare?

IndexShare is GLM-5.2's attention optimization that reuses one sparse-attention indexer across every four layers, cutting per-token FLOPs by about 2.9x at a 1M-token context. It is what makes the very long context efficient enough to be usable in practice.

Can I run GLM-5.2 locally?

Only on serious hardware. In BF16 the weights alone need roughly 1.4 TB of memory; even INT4 needs around 350 GB, which is multi-GPU server territory. It is supported by vLLM, SGLang, Transformers, KTransformers, and Unsloth, but most users should start with the hosted API. Use the AI VRAM Calculator to size it for your quantization and context.

What license is GLM-5.2 under?

MIT — permissive, commercial-use-friendly, with no regional limits, applied to both the BF16 and FP8 weight releases.

Conclusion

GLM-5.2 is a clear statement that open-weight models are now real contenders for long-horizon agentic engineering. The generational jump over GLM-5.1 is large and consistent, the 1M-token context is backed by genuine efficiency work in IndexShare and the improved MTP layer, and the MIT license removes the usual friction around commercial and regional use.

The sober counterpoint is that it still trails the closed frontier on several coding benchmarks, the comparison numbers are vendor-run, and the architecture — full KV heads, deep stack, 753B parameters — makes it firmly a server-class model with a heavy KV-cache bill at long context. The honest move is the same as always: take the open weights or the API, wire it into your own test harness, size the hardware with a real VRAM budget, and let your workload decide.

Sources

  • GLM-5.2 — official model card and config, Hugging Face (huggingface.co/zai-org/GLM-5.2)
  • GLM-5.2 benchmark figure and logo — zai-org/GLM-5 repository (github.com/zai-org/GLM-5)
  • "GLM-5.2: Built for Long-Horizon Tasks" — Z.ai announcement
  • "GLM-5: from Vibe Coding to Agentic Engineering" — GLM-5 technical report
  • GLM-4.6 model card — Hugging Face (huggingface.co/zai-org/GLM-4.6), for lineage context

Free tools mentioned in this article

Browser-based, no sign-up. Try them while the topic is fresh.

More From ToolMintX

Other Blog Posts