NVIDIA Research · 2026
Training-free spatial reasoning agent

SpatialClaw Rethinking the action interface for agentic spatial reasoning.

Code is the right action interface for spatial reasoning agents.

Affiliated with KAIST. Work done during Seokju Cho's internship at NVIDIA.

SpatialClaw
SpaceTools
pySpatial
No-tool
Single-image Multi-view Video & 4D General spatial General video
Presentation
Open full
01The thesis

Code is the right action interface for spatial reasoning.

SpatialClaw lets a VLM-backed agent write Python in a persistent kernel, composing perception modules, inspecting intermediate results, and revising its strategy across steps.

It is training-free, with no benchmark- or model-specific adaptation, yet it beats a recent prior agent by +11.2 points on 20 benchmarks and improves consistently across six VLM backbones.

02The problem

Why the action interface matters.

The capability of a tool-augmented agent is bounded not by which tools are available, but by how they can be composed.

Three action interfaces for spatial reasoning agents, compared

Three action interfaces for spatial reasoning agents, contrasted in panels (a)–(c) and detailed below.

(a) Single-pass code

Commit before observing

Writes a complete Python program in one shot. Cannot revise once execution starts, so any wrong assumption propagates straight to the answer.

(b) Structured tool-call

Limited composition

Dispatches typed tools through a fixed JSON interface. Cannot freely combine perception outputs with NumPy / SciPy primitives to express test-time computations.

(c) SpatialClaw

Compose, inspect, revise

Code as the action interface, backed by a persistent Python kernel. Perception outputs are ordinary variables: composable, inspectable, and reusable across steps.

03Live walkthrough

Watch SpatialClaw solve a benchmark question.

A real agent session, start to finish: it writes code, inspects the result, and revises. Plays automatically; swipe or use the arrows for more samples.

04Results

Consistent gains across every backbone.

A +11.2 pp margin over the prior spatial agent, and improvement on 19 of 20 benchmarks on the same backbone.

Main results across 20 benchmarks & 6 backbones

Per-category average accuracy. SpatialClaw consistently improves over the no-tool baseline across all six backbones, spanning two model families and 26B–397B parameters, with no benchmark- or model-specific tuning.

BackboneMethodSingle-imgMulti-viewGeneralVideo & 4DVideo Und.AverageΔ
Qwen 3.5-397B-A17BNo-tool59.053.965.153.358.157.3
SpatialClaw60.860.764.758.559.760.4+3.1
Qwen 3.5-122B-A10BNo-tool52.948.362.550.557.153.7
SpatialClaw58.854.662.354.356.556.9+3.2
Qwen 3.6-35B-A3BNo-tool54.543.760.349.855.752.6
SpatialClaw58.855.462.454.057.857.2+4.6
Qwen 3.6-27BNo-tool58.649.062.351.355.955.0
SpatialClaw61.764.067.060.662.762.7+7.7
Gemma 4-31BNo-tool55.650.262.447.655.553.4
SpatialClaw61.962.564.855.159.459.9+6.5
Gemma 4-26B-A4BNo-tool50.742.559.040.551.648.0
SpatialClaw56.056.862.748.752.854.3+6.3

All values are accuracy in %. The largest per-benchmark gains include DSI-Bench +17.6 pp (4D), MindCube +15.3 pp (multi-view), and MMSI +13.4 pp, in categories that benefit most from iterative multi-step geometric computation.

Action interface ablation

Same toolset, same prompt. Only the action interface differs. Gemma 4-31B, 20-benchmark average.

Action interfaceAvg.Δ
No-tool baseline53.4
Single-pass code55.2+1.8
Structured tool-call56.7+3.3
SpatialClaw (code as action)59.9+6.5

Comparison with prior spatial agents

All methods use the same Gemma 4-31B backbone with official implementations.

MethodInterfaceAvg.Δ
VADAR (Marsili '25)Single-pass40.5*−19.4
pySpatial (Luo '26)Single-pass47.8−12.1
SpaceTools-Toolshed (Chen '26)Tool-call48.7−11.2
SpatialClawCode as action59.9best

*VADAR does not support video / multi-image inputs; only single-image benchmarks averaged.

05Analysis & insights

Four findings on why code works.

The gains trace to the action interface itself, not to engineered utilities or to the perception tooling.

Finding 1

SpatialClaw generalizes across diverse spatial tasks even without pre-defined utility tools.

We ablate two design choices. (I) No utility functions: remove all utility wrappers (tools.Mask, tools.Geometry, …), keeping only core perception tools (SAM 3, DA 3) and scientific libraries (NumPy, SciPy). (II) No perception tools: remove SAM 3 / DA 3 entirely, leaving only the code-as-action interface with scientific libraries.

Variant (I) matches the full configuration (56.4 vs 56.9 avg). The persistent kernel with scientific primitives largely compensates for the absent utility tools. Variant (II) still improves over the no-tool baseline by +2.7 pp, isolating the contribution of the action interface itself, independent of the perception tools.

VariantSingle-imageMulti-viewGeneralVideo & 4DAverageΔ
SpatialClaw (Full)53.860.056.854.956.9
(I) No utility functions53.459.455.854.456.4−0.5
(II) No perception tools53.146.153.347.951.4−5.5
No-tool baseline48.143.353.146.148.7−8.2

Gemma 4-26B-A4B backbone, 15 benchmarks, 500 samples each.

Finding 2

The agent spontaneously adapts its tool composition to the question type.

Primitive usage by category

Primitive-usage frequency across 13 meta-categories.

Without any category-specific prompt or tool routing, the agent selects geometrically appropriate primitives purely from question semantics:

  • Distance questions invoke KDTree search and vector norms.
  • Direction questions rely on dot products and angular operations.
  • Camera motion draws on pose composition and transformation chains.

This spontaneous, task-adaptive composition is precisely the behavior structured tool-call interfaces struggle to elicit, and it is what an expressive action interface unlocks.

Finding 3

Gains are largest where chained geometric computation across frames and viewpoints is required.

Pairwise win/loss by meta-category

SpatialClaw secures a net advantage in 11 of 13 meta-categories over both Structured tool-call and Single-pass code. The largest lifts (+6 to +9 pp) concentrate in Camera motion, Multi-view / viewpoint reasoning, and Relative direction, the categories that require chained geometric computation across frames and viewpoints. Where gains are smaller, the bottleneck is perception quality on tasks already near-saturated by the backbone VLM.

Together, this breakdown confirms that the expressive action interface, rather than model capacity or tool coverage, is the primary driver of performance.

Finding 4

Composition is the main driver of SpatialClaw's gains over structured tool-call.

For every sample where SpatialClaw is correct but structured tool-call fails, an LLM judge (Gemini 3 Pro) reads both reasoning traces and the ground-truth answer, then assigns attribution categories:

  • Code composition (52.2%). Chaining multiple tool calls into a single coherent program.
  • Control flow (19.5%). if / for branching over intermediate results.
  • Interface-neutral (28.3%). Wins on perceptual tasks either interface could solve.

Over 70% of wins trace directly to capabilities a fixed-API tool interface cannot easily provide.

Win attribution breakdown

LLM-as-judge attribution of wins over structured tool-call.

Key takeaway

The gains come from the action interface itself, not from engineered utilities or perception tooling. Removing utility wrappers leaves performance essentially unchanged; removing perception tools still beats the no-tool baseline. Code is the right abstraction for spatial reasoning agents.

06Reasoning trajectories

Browse real agent sessions.

Every trajectory is a real run: the question, the code the agent wrote, the images it inspected, and the answer it submitted.

Multi-viewMMSI· Gemma 4-31B
Question
Assuming the wall where the sink is located faces east, in Figure 1 what is the position of the door relative to the sink?
Composes 3D reconstruction with mask segmentation, decomposes via SVD. 63 cross-step variable references; 7 targeted VLM queries.
Open reasoning trajectory →
Single-imageOmni3D· Gemma 4-31B
Question
If the right-most chair is 0.2 m wide, how tall is the wooden cabinet?
Composes 3D reconstruction with mask segmentation, reaches into scipy.spatial. 76 cross-step variable references; 6 targeted VLM queries.
Open reasoning trajectory →
Video & 4DVSI-Bench-U· Gemma 4-31B
Question
Measuring from the closest point of each object, what is the distance between the sofa and the toilet (in meters)?
Composes 3D reconstruction with mask segmentation, reaches into scipy.spatial. Iterative visual inspection: 12 show() calls.
Open reasoning trajectory →
Video & 4DVSI-Bench-U· Gemma 4-31B
Question
Measuring from the closest point of each object, what is the distance between the stool and the washer (in meters)?
101 cross-step variable references, 18 show() calls, 17 targeted VLM queries: the persistent kernel at its most compositional.
Open reasoning trajectory →
Single-imageSPBench· Gemma 4-31B
Question
Standing by the laptop and facing the chair, is the monitor to my left-front, left-back, right-front, or right-back?
3D reconstruction + mask segmentation; plots intermediate state with matplotlib; computes cross products for orientation.
Open reasoning trajectory →
General spatialViewSpatial· Gemma 4-31B
Question
Imagine standing at the desk looking towards the television. Where is the toilet?
Reconstruction + segmentation; uses matplotlib for intermediate visualization; cross-product for viewpoint orientation.
Open reasoning trajectory →
Video & 4DPAI-Bench· Gemma 4-31B
Question
For the agent in the video performing "Checkout and scan barcode in the supermarket", what subtask is currently most plausible?
3D reconstruction + mask segmentation + matplotlib visualization to disambiguate subtask intent across frames.
Open reasoning trajectory →
Multi-viewMMSI· Gemma 4-31B
Question
The fireplace faces north; in which direction is the painting on the wall in the fitness area facing?
3D reconstruction across multiple views; uses SVD to recover the dominant wall plane and infer the painting's facing direction.
Open reasoning trajectory →
07References

References

Spatial reasoning agents

  • SpaceTools-ToolshedCVPR 2026. Tool-Augmented Spatial Reasoning via Double Interactive RL.[page] [arXiv] [code]
  • pySpatialICLR 2026. Generating 3D Visual Programs for Zero-Shot Spatial Reasoning.[page] [arXiv] [code]
  • VADARCVPR 2025. Visual Agentic AI for Spatial Reasoning with a Dynamic API.[page] [arXiv] [code]

Perception tools

  • SAM 3ICLR 2026. Segment Anything with Concepts.[page] [arXiv] [code]
  • Depth Anything 3ICLR 2026 (Oral). Recovering the Visual Space from Any Views.[page] [arXiv] [code]

VLM backbones

  • Qwen3.5Alibaba Qwen Team, 2026. Multimodal large language model series.[page] [model card]
  • Gemma 4Google DeepMind, 2026. Open-weight multimodal foundation models.[page] [model card]
Cite this work

BibTeX

@article{cho2026spatialclaw,
  title   = {SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning},
  author  = {Cho, Seokju and Hachiuma, Ryo and Badki, Abhishek and
             Su, Hang and Lee, Byung-Kwan and Song, Chan Hee and
             Liu, Sifei and Radhakrishnan, Subhashree and Kim, Seungryong and
             Wang, Yu-Chiang Frank and Chen, Min-Hung},
  journal = {arXiv preprint},
  year    = {2026}
}