SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

01The thesis

Code is the right action interface for spatial reasoning.

SpatialClaw lets a VLM-backed agent write Python in a persistent kernel, composing perception modules, inspecting intermediate results, and revising its strategy across steps.

It is training-free, with no benchmark- or model-specific adaptation, yet it beats a recent prior agent by +11.2 points on 20 benchmarks and improves consistently across six VLM backbones.

02The problem

Why the action interface matters.

The capability of a tool-augmented agent is bounded not by which tools are available, but by how they can be composed.

Three action interfaces for spatial reasoning agents, compared

Three action interfaces for spatial reasoning agents, contrasted in panels (a)–(c) and detailed below.

(a) Single-pass code

Commit before observing

Writes a complete Python program in one shot. Cannot revise once execution starts, so any wrong assumption propagates straight to the answer.

(b) Structured tool-call

Limited composition

Dispatches typed tools through a fixed JSON interface. Cannot freely combine perception outputs with NumPy / SciPy primitives to express test-time computations.

(c) SpatialClaw

Compose, inspect, revise

Code as the action interface, backed by a persistent Python kernel. Perception outputs are ordinary variables: composable, inspectable, and reusable across steps.

03Live walkthrough

Watch SpatialClaw solve a benchmark question.

A real agent session, start to finish: it writes code, inspects the result, and revises. Plays automatically; swipe or use the arrows for more samples.

Ready0 / 0

50 more reasoning trajectories below

04Results

Consistent gains across every backbone.

A +11.2 pp margin over the prior spatial agent, and improvement on 19 of 20 benchmarks on the same backbone.

✓Improves on 19 of 20 benchmarks
✓Average gain of +6.5 pp over the no-tool baseline
✓+11.2 pp over a recent prior agent (SpaceTools-Toolshed)

Main results across 20 benchmarks & 6 backbones

Per-category average accuracy. SpatialClaw consistently improves over the no-tool baseline across all six backbones, spanning two model families and 26B–397B parameters, with no benchmark- or model-specific tuning.

Backbone	Method	Single-img	Multi-view	General	Video & 4D	Video Und.	Average	Δ
Qwen 3.5-397B-A17B	No-tool	59.0	53.9	65.1	53.3	58.1	57.3	–
Qwen 3.5-397B-A17B	SpatialClaw	60.8	60.7	64.7	58.5	59.7	60.4	+3.1
Qwen 3.5-122B-A10B	No-tool	52.9	48.3	62.5	50.5	57.1	53.7	–
Qwen 3.5-122B-A10B	SpatialClaw	58.8	54.6	62.3	54.3	56.5	56.9	+3.2
Qwen 3.6-35B-A3B	No-tool	54.5	43.7	60.3	49.8	55.7	52.6	–
Qwen 3.6-35B-A3B	SpatialClaw	58.8	55.4	62.4	54.0	57.8	57.2	+4.6
Qwen 3.6-27B	No-tool	58.6	49.0	62.3	51.3	55.9	55.0	–
Qwen 3.6-27B	SpatialClaw	61.7	64.0	67.0	60.6	62.7	62.7	+7.7
Gemma 4-31B	No-tool	55.6	50.2	62.4	47.6	55.5	53.4	–
Gemma 4-31B	SpatialClaw	61.9	62.5	64.8	55.1	59.4	59.9	+6.5
Gemma 4-26B-A4B	No-tool	50.7	42.5	59.0	40.5	51.6	48.0	–
Gemma 4-26B-A4B	SpatialClaw	56.0	56.8	62.7	48.7	52.8	54.3	+6.3

All values are accuracy in %. The largest per-benchmark gains include DSI-Bench +17.6 pp (4D), MindCube +15.3 pp (multi-view), and MMSI +13.4 pp, in categories that benefit most from iterative multi-step geometric computation.

Action interface ablation

Same toolset, same prompt. Only the action interface differs. Gemma 4-31B, 20-benchmark average.

Action interface	Avg.	Δ
No-tool baseline	53.4	–
Single-pass code	55.2	+1.8
Structured tool-call	56.7	+3.3
SpatialClaw (code as action)	59.9	+6.5

Comparison with prior spatial agents

All methods use the same Gemma 4-31B backbone with official implementations.

Method	Interface	Avg.	Δ
VADAR (Marsili '25)	Single-pass	40.5^*	−19.4
pySpatial (Luo '26)	Single-pass	47.8	−12.1
SpaceTools-Toolshed (Chen '26)	Tool-call	48.7	−11.2
SpatialClaw	Code as action	59.9	best

^*VADAR does not support video / multi-image inputs; only single-image benchmarks averaged.

05Analysis & insights

Four findings on why code works.

The gains trace to the action interface itself, not to engineered utilities or to the perception tooling.

Finding 1

SpatialClaw generalizes across diverse spatial tasks even without pre-defined utility tools.

We ablate two design choices. (I) No utility functions: remove all utility wrappers (tools.Mask, tools.Geometry, …), keeping only core perception tools (SAM 3, DA 3) and scientific libraries (NumPy, SciPy). (II) No perception tools: remove SAM 3 / DA 3 entirely, leaving only the code-as-action interface with scientific libraries.

Variant (I) matches the full configuration (56.4 vs 56.9 avg). The persistent kernel with scientific primitives largely compensates for the absent utility tools. Variant (II) still improves over the no-tool baseline by +2.7 pp, isolating the contribution of the action interface itself, independent of the perception tools.

Variant	Single-image	Multi-view	General	Video & 4D	Average	Δ
SpatialClaw (Full)	53.8	60.0	56.8	54.9	56.9	–
(I) No utility functions	53.4	59.4	55.8	54.4	56.4	−0.5
(II) No perception tools	53.1	46.1	53.3	47.9	51.4	−5.5
No-tool baseline	48.1	43.3	53.1	46.1	48.7	−8.2

Gemma 4-26B-A4B backbone, 15 benchmarks, 500 samples each.

Finding 2

The agent spontaneously adapts its tool composition to the question type.

Primitive-usage frequency across 13 meta-categories.

Without any category-specific prompt or tool routing, the agent selects geometrically appropriate primitives purely from question semantics:

Distance questions invoke KDTree search and vector norms.
Direction questions rely on dot products and angular operations.
Camera motion draws on pose composition and transformation chains.

This spontaneous, task-adaptive composition is precisely the behavior structured tool-call interfaces struggle to elicit, and it is what an expressive action interface unlocks.

Finding 3

Gains are largest where chained geometric computation across frames and viewpoints is required.

SpatialClaw secures a net advantage in 11 of 13 meta-categories over both Structured tool-call and Single-pass code. The largest lifts (+6 to +9 pp) concentrate in Camera motion, Multi-view / viewpoint reasoning, and Relative direction, the categories that require chained geometric computation across frames and viewpoints. Where gains are smaller, the bottleneck is perception quality on tasks already near-saturated by the backbone VLM.

Together, this breakdown confirms that the expressive action interface, rather than model capacity or tool coverage, is the primary driver of performance.

Finding 4

Composition is the main driver of SpatialClaw's gains over structured tool-call.

For every sample where SpatialClaw is correct but structured tool-call fails, an LLM judge (Gemini 3 Pro) reads both reasoning traces and the ground-truth answer, then assigns attribution categories:

Code composition (52.2%). Chaining multiple tool calls into a single coherent program.
Control flow (19.5%). if / for branching over intermediate results.
Interface-neutral (28.3%). Wins on perceptual tasks either interface could solve.

Over 70% of wins trace directly to capabilities a fixed-API tool interface cannot easily provide.

LLM-as-judge attribution of wins over structured tool-call.

Key takeaway

The gains come from the action interface itself, not from engineered utilities or perception tooling. Removing utility wrappers leaves performance essentially unchanged; removing perception tools still beats the no-tool baseline. Code is the right abstraction for spatial reasoning agents.

06Reasoning trajectories

Browse real agent sessions.

Every trajectory is a real run: the question, the code the agent wrote, the images it inspected, and the answer it submitted.

Multi-viewMMSI· Gemma 4-31B

Question

Assuming the wall where the sink is located faces east, in Figure 1 what is the position of the door relative to the sink?

Composes 3D reconstruction with mask segmentation, decomposes via SVD. 63 cross-step variable references; 7 targeted VLM queries.

Open reasoning trajectory →

Single-imageOmni3D· Gemma 4-31B

Question

If the right-most chair is 0.2 m wide, how tall is the wooden cabinet?

Composes 3D reconstruction with mask segmentation, reaches into scipy.spatial. 76 cross-step variable references; 6 targeted VLM queries.

Open reasoning trajectory →

Video & 4DVSI-Bench-U· Gemma 4-31B

Question

Measuring from the closest point of each object, what is the distance between the sofa and the toilet (in meters)?

Composes 3D reconstruction with mask segmentation, reaches into scipy.spatial. Iterative visual inspection: 12 show() calls.

Open reasoning trajectory →

Video & 4DVSI-Bench-U· Gemma 4-31B

Question

Measuring from the closest point of each object, what is the distance between the stool and the washer (in meters)?

101 cross-step variable references, 18 show() calls, 17 targeted VLM queries: the persistent kernel at its most compositional.

Open reasoning trajectory →

Single-imageSPBench· Gemma 4-31B

Question

Standing by the laptop and facing the chair, is the monitor to my left-front, left-back, right-front, or right-back?

3D reconstruction + mask segmentation; plots intermediate state with matplotlib; computes cross products for orientation.

Open reasoning trajectory →

General spatialViewSpatial· Gemma 4-31B

Question

Imagine standing at the desk looking towards the television. Where is the toilet?

Reconstruction + segmentation; uses matplotlib for intermediate visualization; cross-product for viewpoint orientation.

Open reasoning trajectory →

Video & 4DPAI-Bench· Gemma 4-31B

Question

For the agent in the video performing "Checkout and scan barcode in the supermarket", what subtask is currently most plausible?

3D reconstruction + mask segmentation + matplotlib visualization to disambiguate subtask intent across frames.

Open reasoning trajectory →

Multi-viewMMSI· Gemma 4-31B

Question

The fireplace faces north; in which direction is the painting on the wall in the fitness area facing?

3D reconstruction across multiple views; uses SVD to recover the dominant wall plane and infer the painting's facing direction.

Open reasoning trajectory →

Browse all 50 reasoning trajectories →

07References

References

Spatial reasoning agents

SpaceTools-ToolshedCVPR 2026. Tool-Augmented Spatial Reasoning via Double Interactive RL.[page] [arXiv] [code]
pySpatialICLR 2026. Generating 3D Visual Programs for Zero-Shot Spatial Reasoning.[page] [arXiv] [code]
VADARCVPR 2025. Visual Agentic AI for Spatial Reasoning with a Dynamic API.[page] [arXiv] [code]

Perception tools

SAM 3ICLR 2026. Segment Anything with Concepts.[page] [arXiv] [code]
Depth Anything 3ICLR 2026 (Oral). Recovering the Visual Space from Any Views.[page] [arXiv] [code]

VLM backbones

Qwen3.5Alibaba Qwen Team, 2026. Multimodal large language model series.[page] [model card]
Gemma 4Google DeepMind, 2026. Open-weight multimodal foundation models.[page] [model card]

↳Cite this work

BibTeX

@article{cho2026spatialclaw,
  title   = {SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning},
  author  = {Cho, Seokju and Hachiuma, Ryo and Badki, Abhishek and
             Su, Hang and Lee, Byung-Kwan and Song, Chan Hee and
             Liu, Sifei and Radhakrishnan, Subhashree and Kim, Seungryong and
             Wang, Yu-Chiang Frank and Chen, Min-Hung},
  journal = {arXiv preprint},
  year    = {2026}
}