Code is the right action interface for spatial reasoning agents.
†Affiliated with KAIST. Work done during Seokju Cho's internship at NVIDIA.
SpatialClaw lets a VLM-backed agent write Python in a persistent kernel, composing perception modules, inspecting intermediate results, and revising its strategy across steps.
It is training-free, with no benchmark- or model-specific adaptation, yet it beats a recent prior agent by +11.2 points on 20 benchmarks and improves consistently across six VLM backbones.
The capability of a tool-augmented agent is bounded not by which tools are available, but by how they can be composed.
Three action interfaces for spatial reasoning agents, contrasted in panels (a)–(c) and detailed below.
Writes a complete Python program in one shot. Cannot revise once execution starts, so any wrong assumption propagates straight to the answer.
Dispatches typed tools through a fixed JSON interface. Cannot freely combine perception outputs with NumPy / SciPy primitives to express test-time computations.
Code as the action interface, backed by a persistent Python kernel. Perception outputs are ordinary variables: composable, inspectable, and reusable across steps.
A real agent session, start to finish: it writes code, inspects the result, and revises. Plays automatically; swipe or use the arrows for more samples.
A +11.2 pp margin over the prior spatial agent, and improvement on 19 of 20 benchmarks on the same backbone.
Per-category average accuracy. SpatialClaw consistently improves over the no-tool baseline across all six backbones, spanning two model families and 26B–397B parameters, with no benchmark- or model-specific tuning.
| Backbone | Method | Single-img | Multi-view | General | Video & 4D | Video Und. | Average | Δ |
|---|---|---|---|---|---|---|---|---|
| Qwen 3.5-397B-A17B | No-tool | 59.0 | 53.9 | 65.1 | 53.3 | 58.1 | 57.3 | – |
| SpatialClaw | 60.8 | 60.7 | 64.7 | 58.5 | 59.7 | 60.4 | +3.1 | |
| Qwen 3.5-122B-A10B | No-tool | 52.9 | 48.3 | 62.5 | 50.5 | 57.1 | 53.7 | – |
| SpatialClaw | 58.8 | 54.6 | 62.3 | 54.3 | 56.5 | 56.9 | +3.2 | |
| Qwen 3.6-35B-A3B | No-tool | 54.5 | 43.7 | 60.3 | 49.8 | 55.7 | 52.6 | – |
| SpatialClaw | 58.8 | 55.4 | 62.4 | 54.0 | 57.8 | 57.2 | +4.6 | |
| Qwen 3.6-27B | No-tool | 58.6 | 49.0 | 62.3 | 51.3 | 55.9 | 55.0 | – |
| SpatialClaw | 61.7 | 64.0 | 67.0 | 60.6 | 62.7 | 62.7 | +7.7 | |
| Gemma 4-31B | No-tool | 55.6 | 50.2 | 62.4 | 47.6 | 55.5 | 53.4 | – |
| SpatialClaw | 61.9 | 62.5 | 64.8 | 55.1 | 59.4 | 59.9 | +6.5 | |
| Gemma 4-26B-A4B | No-tool | 50.7 | 42.5 | 59.0 | 40.5 | 51.6 | 48.0 | – |
| SpatialClaw | 56.0 | 56.8 | 62.7 | 48.7 | 52.8 | 54.3 | +6.3 |
All values are accuracy in %. The largest per-benchmark gains include DSI-Bench +17.6 pp (4D), MindCube +15.3 pp (multi-view), and MMSI +13.4 pp, in categories that benefit most from iterative multi-step geometric computation.
Same toolset, same prompt. Only the action interface differs. Gemma 4-31B, 20-benchmark average.
| Action interface | Avg. | Δ |
|---|---|---|
| No-tool baseline | 53.4 | – |
| Single-pass code | 55.2 | +1.8 |
| Structured tool-call | 56.7 | +3.3 |
| SpatialClaw (code as action) | 59.9 | +6.5 |
All methods use the same Gemma 4-31B backbone with official implementations.
| Method | Interface | Avg. | Δ |
|---|---|---|---|
| VADAR (Marsili '25) | Single-pass | 40.5* | −19.4 |
| pySpatial (Luo '26) | Single-pass | 47.8 | −12.1 |
| SpaceTools-Toolshed (Chen '26) | Tool-call | 48.7 | −11.2 |
| SpatialClaw | Code as action | 59.9 | best |
*VADAR does not support video / multi-image inputs; only single-image benchmarks averaged.
The gains trace to the action interface itself, not to engineered utilities or to the perception tooling.
We ablate two design choices. (I) No utility functions: remove all utility wrappers (tools.Mask, tools.Geometry, …), keeping only core perception tools (SAM 3, DA 3) and scientific libraries (NumPy, SciPy). (II) No perception tools: remove SAM 3 / DA 3 entirely, leaving only the code-as-action interface with scientific libraries.
Variant (I) matches the full configuration (56.4 vs 56.9 avg). The persistent kernel with scientific primitives largely compensates for the absent utility tools. Variant (II) still improves over the no-tool baseline by +2.7 pp, isolating the contribution of the action interface itself, independent of the perception tools.
| Variant | Single-image | Multi-view | General | Video & 4D | Average | Δ |
|---|---|---|---|---|---|---|
| SpatialClaw (Full) | 53.8 | 60.0 | 56.8 | 54.9 | 56.9 | – |
| (I) No utility functions | 53.4 | 59.4 | 55.8 | 54.4 | 56.4 | −0.5 |
| (II) No perception tools | 53.1 | 46.1 | 53.3 | 47.9 | 51.4 | −5.5 |
| No-tool baseline | 48.1 | 43.3 | 53.1 | 46.1 | 48.7 | −8.2 |
Gemma 4-26B-A4B backbone, 15 benchmarks, 500 samples each.

Primitive-usage frequency across 13 meta-categories.
Without any category-specific prompt or tool routing, the agent selects geometrically appropriate primitives purely from question semantics:
KDTree search and vector norms.This spontaneous, task-adaptive composition is precisely the behavior structured tool-call interfaces struggle to elicit, and it is what an expressive action interface unlocks.

SpatialClaw secures a net advantage in 11 of 13 meta-categories over both Structured tool-call and Single-pass code. The largest lifts (+6 to +9 pp) concentrate in Camera motion, Multi-view / viewpoint reasoning, and Relative direction, the categories that require chained geometric computation across frames and viewpoints. Where gains are smaller, the bottleneck is perception quality on tasks already near-saturated by the backbone VLM.
Together, this breakdown confirms that the expressive action interface, rather than model capacity or tool coverage, is the primary driver of performance.
For every sample where SpatialClaw is correct but structured tool-call fails, an LLM judge (Gemini 3 Pro) reads both reasoning traces and the ground-truth answer, then assigns attribution categories:
if / for branching over intermediate results.Over 70% of wins trace directly to capabilities a fixed-API tool interface cannot easily provide.

LLM-as-judge attribution of wins over structured tool-call.
The gains come from the action interface itself, not from engineered utilities or perception tooling. Removing utility wrappers leaves performance essentially unchanged; removing perception tools still beats the no-tool baseline. Code is the right abstraction for spatial reasoning agents.
Every trajectory is a real run: the question, the code the agent wrote, the images it inspected, and the answer it submitted.




scipy.spatial. 76 cross-step variable references; 6 targeted VLM queries.


scipy.spatial. Iterative visual inspection: 12 show() calls.


show() calls, 17 targeted VLM queries: the persistent kernel at its most compositional.


matplotlib; computes cross products for orientation.


matplotlib for intermediate visualization; cross-product for viewpoint orientation.


matplotlib visualization to disambiguate subtask intent across frames.

@article{cho2026spatialclaw,
title = {SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning},
author = {Cho, Seokju and Hachiuma, Ryo and Badki, Abhishek and
Su, Hang and Lee, Byung-Kwan and Song, Chan Hee and
Liu, Sifei and Radhakrishnan, Subhashree and Kim, Seungryong and
Wang, Yu-Chiang Frank and Chen, Min-Hung},
journal = {arXiv preprint},
year = {2026}
}