Apr 1, 2026 · 9 min read

Teaching a Transformer to Solve the Rubik's Cube

machine learning transformer reinforcement learning GRPO Rubik's Cube

🎮 Live Demo — Rubik's Cube Transformer Solver

Input a scramble or click "Random" and watch the model solve it step by step in 3D.

⚠️ Hosted on a temporary Cloudflare tunnel — may be unavailable if the server is down. Full source on GitHub.

Why the Rubik's Cube?

I was curious about the so-called scaling law — the idea that adding more parameters and more data monotonically improves model performance. It works spectacularly for language models. But does it hold for combinatorial puzzles? A puzzle where one wrong move can invalidate an entire 20-step chain?

The Rubik's Cube is a great testbed for this. The state space is enormous (~4.3 × 10¹⁹ configurations), yet the rules are perfectly deterministic. You can generate infinite training data with zero annotation cost — just scramble and solve. And there's a known ground truth: every cube is solvable in at most 20 moves (God's Number). That means you can measure success with a crisp boolean: did the model solve it?

So I spent a few months training six versions of a Transformer-based solver, progressively trying different strategies: bigger models, data augmentation, reinforcement learning, difficulty-weighted rewards. Here's what happened.

The code and a live 3D web demo are on GitHub.

The Setup: Representing a Cube

A Rubik's Cube has 6 faces, each with 9 tiles — 54 colored stickers total. I encode each sticker as an integer (0–5), giving a flat 54-dimensional state vector. The model needs to output a sequence of moves from a vocabulary of 18 quarter-turn and half-turn operations (U, U', U2, D, D', D2, F, F', F2, B, B', B2, L, L', L2, R, R', R2).

The architecture is a standard encoder-decoder Transformer I call XLCubeSolver:

Input: cube state (54 tokens, each 0–5)
Encoder: embeds + positional encoding → 10 transformer layers
Decoder: autoregressive, attends to encoder → 10 transformer layers
Output: move sequence (up to 30 steps)

d_model = 768, n_heads = 12, ff_dim = 3072
Total parameters: 165.5 million

The decoder generates moves one at a time, attending back to the full encoder representation of the initial state. Think of it like a seq2seq model — the encoder "understands" the scrambled cube, and the decoder "writes out" the solution.

Curriculum Learning (V2 & V3): Crawl Before You Walk

The core challenge is that a randomly scrambled cube (20 moves) is extremely hard to solve directly. So I train with curriculum learning — starting from easy problems and gradually increasing difficulty.

The curriculum has 5 phases based on how many scramble moves were applied (ms):

Phase 1 — ms ≤ 3: nearly trivial
Phase 2 — ms ≤ 5: still easy
Phase 3 — ms ≤ 8: getting harder
Phase 4 — ms ≤ 12: moderate
Phase 5 — ms ≤ 20: the real challenge

Each phase trains the model until it reaches a target solve rate on that difficulty level before advancing. The training data is generated on-the-fly: scramble a solved cube by ms random moves, then use the known solution as the label.

def generate_sample(max_scramble_steps):
    cube = solved_cube()
    moves = random_moves(k=random.randint(1, max_scramble_steps))
    scrambled = apply(cube, moves)
    solution = reverse(moves)   # ground truth
    return scrambled, solution

V2 used a 58.9M parameter model. V3 scaled up to 165.5M (XLCubeSolver). Results for ms = 20 solve rate:

Version	Method	Params	ms=20 Solve Rate
V2	Supervised (SL)	58.9M	47.0%
V3	Supervised (SL)	165.5M	44.2%
V4	SL + data augmentation	165.5M	48.6%

Notice something uncomfortable: scaling from 58.9M to 165.5M parameters didn't help — V3 was actually slightly worse than V2. Adding data augmentation (V4) recovered the loss and inched slightly ahead. But all three versions plateau between 45–49%. Something structural is going on.

Reinforcement Learning with GRPO (V5 & V6): Breaking the Ceiling?

Supervised learning teaches the model to imitate the training solutions, which are themselves just reversed scramble sequences — not necessarily the shortest or most generalizable paths. What if we let the model discover solutions on its own, and reinforce whatever works?

This is where GRPO (Group Relative Policy Optimization) comes in. GRPO is an RL algorithm popularized by DeepSeek in 2024. Unlike PPO, it doesn't need a separate value network — it computes advantages within a group of trajectories sampled for the same input.

How GRPO Works (for this problem)

For each training step:

Sample a batch of scrambled cubes.
For each cube, sample N = 8 different move sequences from the model (with temperature > 1 for exploration).
Execute each sequence and record a binary reward: 1.0 if the cube ends up solved, 0.0 otherwise.
Compute group-relative advantage: subtract the mean reward and divide by the standard deviation of the group. Successful paths get positive advantage; failed paths get negative.
Update the model via PPO-style clipped gradient + KL penalty against a frozen reference model.

# For each cube in the batch:
sequences = [sample(model, cube, temp=1.2) for _ in range(8)]
rewards   = [1.0 if solves(seq, cube) else 0.0 for seq in sequences]

mean_r = mean(rewards)
std_r  = std(rewards)
advantages = [(r - mean_r) / std_r for r in rewards]

# PPO clip update
for seq, adv in zip(sequences, advantages):
    ratio = exp(log_prob_new(seq) - log_prob_old(seq))
    clipped = clip(ratio, 0.8, 1.2)
    loss = -min(ratio * adv, clipped * adv)
    loss += beta * KL(model, ref_model)   # beta = 0.02

The key insight: GRPO doesn't need any external oracle. The reward is purely physical — did the sequence actually solve the cube? No Kociemba, no lookup table, just physics as the judge.

V5: Standard GRPO

V5 fine-tunes from the V3 Phase 4 checkpoint using GRPO with a 3-phase curriculum (ms=12 → ms=15 → ms=20). The results are striking at the easier difficulties but tell a sobering story at ms=20:

Phase	Scramble	V3 SL ceiling	V5 GRPO best
1	ms = 12	65.8%	75.0% ✅
2	ms = 15	—	59.3%
3	ms = 20	47–48%	45.3% ❌

GRPO clearly beats supervised learning at ms=12 — a 9 percentage point jump. But at ms=20, it plateaus at the same ~45% neighborhood. The ceiling didn't move.

V6: Difficulty-Weighted Reward

Here's a subtle flaw in standard GRPO for this task: the training curriculum mixes easy cubes (ms=3, already ~96% solved) and hard cubes (ms=20, ~5% solved) in the same batch. For the easy cubes, almost all 8 sampled trajectories succeed — so all advantages are near zero, and the gradient signal is tiny. The model gets little useful feedback from problems it already knows how to solve.

V6 addresses this by multiplying each reward by a difficulty weight based on the actual scramble depth:

def reward_weight(ms, R=15, floor=0.05):
    # Gaussian-based: near-zero for small ms, approaches 1 for ms=20
    return max(floor, 1 - exp(-(ms / R) ** 2))

# Example values:
# ms=3  → 0.050 (floor)
# ms=8  → 0.248
# ms=12 → 0.473
# ms=15 → 0.632
# ms=20 → 0.831

reward = float(solved) * reward_weight(scramble_steps)

This pushes the gradient signal toward the hard problems, where the model actually needs to learn. The results:

Version	ms=12	ms=15	ms=20 (best)
V5 GRPO	75.0%	59.3%	45.3%
V6 GRPO + weighted	72.3%	60.3%	49.0%

V6 reached a new peak of 49.0% at ms=20 — the best across all six versions. But even this didn't break through 50%. The curve oscillated between 38–49% across 60 epochs without a clear upward trend.

The Ceiling: Why ~50% is a Structural Limit

After six training runs, the pattern is unmistakable:

V2  (SL,  58.9M)        → 47.0%
V3  (SL, 165.5M)        → 44.2%
V4  (SL + augmentation) → 48.6%
V5  (GRPO)              → 45.3%
V6  (GRPO + weighted)   → 49.0%  ← best, but still <50%

This isn't a training issue. The problem is deeper: greedy inference on a long action sequence.

All six models generate solutions step by step, picking the highest-probability move at each timestep. This is inherently fragile for ms=20. Think about it: a 20-step chain has 20 decisions. One slip — one wrong move in the middle — and the rest of the sequence likely fails. For a ~50% success rate per step (a rough intuition), that's 0.5²⁰ ≈ 0.000001. The model must do dramatically better than 50% at every step, which means it needs to find the very specific sequences that work, not just the most likely ones at each position.

What's missing is search at inference time. The model can't backtrack. It can't explore alternatives. It commits greedily and hopes for the best. More parameters and more RL tricks can move the needle slightly, but they can't fix this fundamental architecture choice.

What Comes Next: Search

The literature has known this for a while. DeepCubeA (McAleer et al., 2018) solved this by training a value network — not a policy, but a function V(s) that estimates how far state s is from the solved state. At inference, you use this value function as a heuristic for A* search, exploring the move tree guided by V(s) rather than committing greedily. The result: near 100% solve rate on ms=20.

The three most promising next steps, in order of effort:

Beam search (hours of work): No retraining needed. Keep the top-K candidate solutions at each step. Expected: 65–75% ms=20 with beam width = 128.
Kociemba supervised fine-tuning (1–2 days): Use the Kociemba algorithm to generate optimal solutions, then fine-tune on those. Expected: 70–80% ms=20.
Value network + A* search (DeepCubeA style, 1–2 weeks): Add a value head, generate training data via backward BFS from the solved state. Expected: 95%+.

All of this confirms what the scaling law crowd already knows, but in a more specific way: for combinatorial search problems, architecture matters more than scale. A 58.9M model with beam search would likely destroy a 165.5M pure policy model on this task.

Lessons Learned

The scaling law has limits. In language modeling, bigger models absorb more statistical patterns from text. But the Rubik's Cube isn't statistical — it's logical. The model needs to reason across a long decision horizon, and raw capacity doesn't substitute for the right inductive biases.

GRPO is genuinely powerful — but not magic. At ms=12, GRPO jumped from 65.8% to 75%, a real breakthrough. At ms=20, it hit the same wall as supervised learning. The reward signal is just too sparse at that difficulty: out of 8 sampled 20-step sequences, often zero succeed, so there's nothing to learn from.

Difficulty-weighted reward is a clean idea worth testing in other RL contexts. The intuition is simple: the model already knows how to solve easy cases, so those experiences carry little gradient signal. Reweighting by difficulty concentrates learning where it's actually needed.

Inference-time compute matters. The entire field has been rediscovering this — AlphaGo, chain-of-thought prompting, o1-style reasoning. For the Rubik's Cube, the lesson is the same: if you can't explore at inference, you're leaving a lot of accuracy on the table.

The Web Demo

The project includes a Three.js 3D visualizer — you can input a scramble and watch the model solve it in real time, move by move. The interface runs in the browser and hits an inference server running XLCubeSolver V3 (the strongest checkpoint before the Phase 5 forgetting bug I ran into).

The full code — training scripts for V2 through V6, the inference server, the web demo — is on github.com/Arsene-flyingcat/proj-cube.

Closing Thoughts

When I started this project I half-expected the scaling law to just... work. Throw a big enough Transformer at the problem, train it long enough, and it solves the cube. That's the story for text, after all.

What I got instead was a cleaner lesson about where scaling works and where it doesn't. Language models benefit from scale because there's a huge amount of latent structure in language that larger models can extract. The Rubik's Cube has structure too — but it's the kind you need to actively search, not passively pattern-match. A trained policy alone is a memorized reflex. What you actually need is a trained intuition plus the ability to think ahead.

Six training runs, 165 million parameters, weeks of GPU time — and the most important insight turned out to be architectural rather than statistical. That feels right.

🎮 在线演示 — 魔方 Transformer 求解器

输入打乱状态或点击"随机"，实时观看模型一步一步在 3D 中求解魔方。

⚠️ 托管在临时 Cloudflare 隧道上，若服务器停机可能无法访问。完整源码在 GitHub。

为什么选魔方？

我想验证一个问题：所谓的扩展定律（Scaling Law）对组合难题适不适用？在语言模型上，参数越多效果越好几乎是铁律。但对于一道一步走错全盘皆输的 20 步谜题，这个规律还成立吗？

魔方是个很好的测试台。状态空间巨大（约 4.3 × 10¹⁹ 种），规则却完全确定；训练数据可以无限生成，零标注成本——打乱再还原就行；而且存在已知的真实上界：任意魔方最多 20 步可解（上帝之数）。评价指标也很直白：模型解出来了吗？

于是我花了几个月训练了六个版本的基于 Transformer 的魔方求解器，依次尝试了更大的模型、数据增强、强化学习和难度加权奖励。以下是全部经过。

代码和在线 3D 演示都放在 GitHub 上。

基础设置：怎么表示魔方

魔方有 6 面，每面 9 格，共 54 个色块。我把每个色块编码成 0–5 的整数，得到一个 54 维的状态向量。模型需要输出一个移动序列，词表包含 18 种四分之一转和半转操作（U, U', U2, D, D', D2……）。

模型架构是标准的编码器-解码器 Transformer，我叫它 XLCubeSolver：

输入：魔方状态（54 个 token，每个 0–5）
编码器：嵌入 + 位置编码 → 10 层 Transformer
解码器：自回归，注意编码器 → 10 层 Transformer
输出：移动序列（最多 30 步）

d_model = 768，n_heads = 12，ff_dim = 3072
总参数量：1.655 亿

解码器逐步生成移动，每步都能看到编码器对初始状态的完整表示。可以理解成一个 seq2seq 模型：编码器"理解"打乱的魔方，解码器"写出"解法。

课程学习（V2 & V3）：先学走再学跑

核心难点在于：随机打乱（20 步）的魔方直接求解极难。所以我用课程学习——从简单问题出发，逐步增加难度。

课程按打乱步数（ms）分 5 个阶段：

阶段 1 — ms ≤ 3：几乎是送分题
阶段 2 — ms ≤ 5：仍然容易
阶段 3 — ms ≤ 8：开始有挑战
阶段 4 — ms ≤ 12：中等难度
阶段 5 — ms ≤ 20：真正的考验

训练数据实时生成：把还原的魔方随机打乱 ms 步，用逆序操作作为标注答案。

def generate_sample(max_scramble_steps):
    cube = solved_cube()
    moves = random_moves(k=random.randint(1, max_scramble_steps))
    scrambled = apply(cube, moves)
    solution = reverse(moves)   # 真实答案
    return scrambled, solution

V2 用了 5890 万参数的模型，V3 扩展到 1.655 亿（XLCubeSolver）。ms=20 求解率对比：

版本	方法	参数量	ms=20 求解率
V2	监督学习（SL）	5890 万	47.0%
V3	监督学习（SL）	1.655 亿	44.2%
V4	SL + 数据增强	1.655 亿	48.6%

有个让人不舒服的发现：参数量从 5890 万增加到 1.655 亿，效果非但没提升，V3 反而比 V2 略差。V4 加上数据增强才把差距补回来。但三个版本全都卡在 45–49%。

用 GRPO 强化学习（V5 & V6）：能打破天花板吗？

监督学习让模型模仿训练数据里的解法——而这些解法本质上只是打乱步骤的逆序，未必是最短路径或最易泛化的。如果让模型自己摸索解法，强化有效的策略呢？

这就是 GRPO（Group Relative Policy Optimization）的用武之地。GRPO 是 DeepSeek 2024 年推广的 RL 算法，不需要单独的 value 网络——它在同一批输入的多条采样轨迹之间做组内相对比较来计算优势函数。

GRPO 怎么工作（在这个问题上）

采样一批打乱的魔方。
对每个魔方，用模型（高温采样增强探索）生成 N = 8 条不同的移动序列。
执行每条序列，记录二值奖励：解出 1.0，未解出 0.0。
计算组内相对优势：减去组均值再除以标准差。成功路径优势为正，失败路径为负。
用 PPO 风格的截断梯度 + 相对参考模型的 KL 惩罚来更新模型。

# 对 batch 中的每个魔方：
sequences = [sample(model, cube, temp=1.2) for _ in range(8)]
rewards   = [1.0 if solves(seq, cube) else 0.0 for seq in sequences]

mean_r = mean(rewards)
std_r  = std(rewards)
advantages = [(r - mean_r) / std_r for r in rewards]

# PPO 截断更新
for seq, adv in zip(sequences, advantages):
    ratio = exp(log_prob_new(seq) - log_prob_old(seq))
    clipped = clip(ratio, 0.8, 1.2)
    loss = -min(ratio * adv, clipped * adv)
    loss += beta * KL(model, ref_model)   # beta = 0.02

关键点：GRPO 不需要任何外部标注器。奖励完全来自物理规则——序列是否真的把魔方复原了？不需要 Kociemba，物理就是裁判。

V5：标准 GRPO

阶段	难度	V3 SL 天花板	V5 GRPO 最佳
1	ms = 12	65.8%	75.0% ✅
2	ms = 15	—	59.3%
3	ms = 20	47–48%	45.3% ❌

GRPO 在 ms=12 上有真实突破，比监督学习高了 9 个百分点。但 ms=20 依然卡在 45% 附近。天花板没有移动。

V6：难度加权奖励

标准 GRPO 有个隐性缺陷：训练时简单魔方（ms=3，已有 96% 求解率）和难魔方（ms=20，只有 5%）混在同一个 batch 里。对于简单魔方，8 条采样轨迹几乎全都成功，组内方差接近零，梯度信号几乎消失——模型从自己已经会的事情上几乎学不到任何东西。

V6 的做法是给每个奖励乘一个随打乱深度增加的权重：

def reward_weight(ms, R=15, floor=0.05):
    # 反高斯：ms 小时接近 0，ms=20 时接近 1
    return max(floor, 1 - exp(-(ms / R) ** 2))

# 示例值：
# ms=3  → 0.050（下限）
# ms=8  → 0.248
# ms=12 → 0.473
# ms=15 → 0.632
# ms=20 → 0.831

reward = float(solved) * reward_weight(scramble_steps)

这把梯度信号集中到模型真正需要学习的困难样本上。结果：

V2  (SL,  5890 万)       → 47.0%
V3  (SL, 1.655 亿)      → 44.2%
V4  (SL + 增强)          → 48.6%
V5  (GRPO)              → 45.3%
V6  (GRPO + 难度加权)    → 49.0%  ← 最高，但仍 <50%

V6 达到了 49.0% 的峰值——六个版本里最高。但 60 个 epoch 里，结果在 38–49% 之间震荡，没有明显的上升趋势。50% 的天花板没有被打破。

天花板的本质：为什么 ~50% 是结构性极限

六轮训练下来，规律已经非常清晰。问题不在于训练，而在更深处：对长动作序列的贪心推理。

所有模型都是逐步生成解法，每步选概率最高的动作。这在 ms=20 时天然脆弱——20 步链条，中间任何一步错误都很可能让后续失败。而模型在推理时没有回溯、没有探索其他分支的能力。它只是贪心地一步步输出，然后祈祷。

缺失的是推理时搜索。更多参数和更好的 RL 技巧能稍微移动指针，但无法修复这个根本性的架构选择。

下一步：搜索

学术界对此早有答案。DeepCubeA（McAleer et al., 2018）用的是训练一个 value 网络——不是策略，而是估计状态 s 离还原状态有多远的函数 V(s)。推理时用 V(s) 作为 A* 搜索的启发函数，在移动树上有引导地探索，而不是贪心决策。结果：ms=20 求解率接近 100%。

接下来最有希望的三条路，按工作量排序：

束搜索（beam search，几小时）：不需要重新训练，只改推理脚本，保留 top-K 候选路径。预期：beam=128 时 ms=20 达到 65–75%。
Kociemba 监督微调（1–2 天）：用 Kociemba 算法生成最优解序列，在这些数据上微调现有模型。预期：ms=20 达到 70–80%。
Value 网络 + A* 搜索（DeepCubeA 路线，1–2 周）：加 value head，从还原状态逆向 BFS 生成训练数据。预期：95%+。

几点感想

扩展定律有边界。语言模型之所以"越大越好"，是因为语言里藏着大量统计规律，更大的模型能更充分地提取。魔方也有结构，但那是需要主动搜索的结构，而不是被动匹配的模式。

GRPO 是真的有用——但不是魔法。在 ms=12 上，GRPO 从 65.8% 跳到了 75%，这是真实的突破。在 ms=20 上，它撞上了和监督学习一样的墙。在最难的难度上，稀疏奖励信号不够用了。

难度加权奖励是个值得在其他 RL 场景推广的想法。直觉很简单：模型已经会的事情提供的梯度信号极少，按难度加权可以把学习聚焦在真正有价值的地方。

推理时算力很重要。整个行业都在重新发现这一点——AlphaGo、思维链、o1 风格推理，背后都是同一个道理。对魔方也是：如果推理时无法探索，精度就有硬上限。

网页演示

项目里附带了一个 Three.js 3D 可视化界面——你可以输入一个打乱状态，然后实时看模型一步一步地求解。前端跑在浏览器里，后端连着运行 XLCubeSolver V3 的推理服务器（Phase 5 训练遇到了灾难性遗忘的 bug，所以用了更稳定的 V3 checkpoint）。

完整代码——从 V2 到 V6 的训练脚本、推理服务器、网页演示——都在 github.com/Arsene-flyingcat/proj-cube。

结语

开始这个项目的时候，我半期望扩展定律直接奏效——丢一个足够大的 Transformer 进去，训练足够长，它就能解魔方了。毕竟语言上就是这么回事。

最后得到的是一个更清晰的教训：扩展在哪里有效，在哪里不行。语言模型受益于规模，是因为语言里有大量潜在统计结构等着被大模型提取。魔方也有结构，但那是需要主动搜索的结构，不是被动匹配的模式。一个训练好的策略网络本质上是一个记忆好的反射。真正需要的是训练好的直觉，加上向前思考的能力。

六轮训练，1.655 亿参数，几周的 GPU 时间——最重要的洞察最后是架构层面的，而不是统计层面的。这感觉很对。