SpatialClaw
Rethinking the action interface for agentic spatial reasoning
Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen
NVIDIA
Affiliated with KAIST · Work done during Seokju Cho's internship at NVIDIA
01Motivation

Spatial reasoning is still hard
for vision-language models.

Determining where objects are, how they relate, and how they move in 3D: effortless for humans, unreliable for state-of-the-art VLMs.

"Is the car moving toward the camera?"
"Which object is closest to the table?"
"Did the person turn left or right?"
"How far apart are these two objects?"
02The core idea

Tools alone are not enough.
The action interface is the bottleneck.

The capability of a tool-augmented agent is bounded not by which tools are available, but by how those tools can be composed: which intermediate states are observable, and whether the agent can revise before committing.

03Three action interfaces

The same tools, three very different ceilings.

(a) Single-pass code

Commit before observing

Writes a complete program in one shot. Cannot revise once execution starts, so any wrong assumption propagates to the answer.

(b) Structured tool-call

Limited composition

Dispatches typed tools through a fixed JSON schema. Cannot freely combine perception outputs with NumPy / SciPy.

(c) SpatialClaw

Compose · inspect · revise

Code as the action interface, backed by a persistent kernel. Perception outputs are ordinary variables.

04SpatialClaw

Code is the right action interface
for spatial reasoning.

A training-free agent with a stateful Python kernel pre-loaded with input frames and perception + geometry primitives. The VLM writes one executable cell per step, conditioned on all prior outputs.

Persistent kernel

Masks, depth maps, point clouds and plots live as ordinary variables that persist across steps.

Visual inspection

show(...) embeds rendered images into the next observation, so the agent can see a mask before trusting it.

A new analysis ≠ a new API

A new computation is a new composition of primitives, assembled at test time.

05Method

A persistent kernel wrapped in a five-stage loop.

IPlanning
  • Separate LLM session
  • Sees question + tool docs
  • not the images
  • Outputs a plan
IICode gen
  • Purpose · Reasoning
  • Next goal
  • One Python cell
IIIExecution
  • AST security check
  • Persistent kernel
  • tools, show(), NumPy, SciPy
IVFeedback
  • stdout from print()
  • Errors / tracebacks
  • Variable summaries
  • Visuals from show()
VAnswer
  • If ReturnAnswer()
  • in code → terminate
  • (if valid)
Iterate: return to Stage II while step < Nmax
06The persistent workspace

Six entry points, one self-contained world.

InputImages

The sampled frames or images the agent reads observations from.

Metadata

Frame rate, duration, and indices, letting the agent reason about temporal structure.

tools

SAM 3 segmentation · Depth Anything 3 reconstruction · geometry utilities.

show(...)

Registers an image into the agent's next observation. plt.show() works too.

vlm

A separate VLM session: vlm.locate() for grounding, ask_with_thinking() for commonsense.

ReturnAnswer(...)

Submits the final candidate answer and terminates the loop.

07Live walkthrough · real trajectory

Watch SpatialClaw solve a real benchmark question.

MMSI · sample 76 · Multi-view · Gemma 4-31B
In Figure 4 the wall holding the sink faces east (outward = east). In Figure 1, what is the position of the door relative to the sink?
A · NortheastB · SouthwestC · SoutheastD · Northwest
Ground truth: A · Northeast
Fig 1
Fig 2
Fig 3
Fig 4
07The code–inspect–revise loop · real MMSI-76 run
spatialclaw · persistent kernel · mmsi/76
Plan
East = wall normal, North = Up × East. Segment, reconstruct, project door→sink onto the axes.
Step 1 · code
seg_sink = tools.SAM3.segment(InputImages[3], "sink")
seg_door = tools.SAM3.segment(InputImages[0], "door")
show([seg_sink.visualize(), seg_door.visualize()])
Feedback · show()
Both mask areas > 0 → no empty masks, verified.
Step 2 · compose
recon   = tools.Reconstruct(InputImages)
v_north = np.cross([0,1,0], v_east) # Up × East
ReturnAnswer("A")                  # Northeast
Feedback · answer
East +5.61 · North +4.33 → both positive: A · Northeast ✓
Question · MMSI 76
Door (Fig 1) relative to the sink (Fig 4), sink-wall facing east?
GT: A · Northeast
SAM 3 · sink mask (Fig 4)
SAM 3 · door mask (Fig 1)
08Results

Improves on 19 of 20 benchmarks.

  • Improves on 19 of 20 benchmarks
  • Average gain +6.5 pp over no-tool baseline
  • +11.2 pp over a recent prior agent

Each axis is rescaled so SpatialClaw lands at a constant radius, so the circle makes the consistency visible at a glance.

SpatialClaw
SpaceTools
pySpatial
No-tool
08Results · across backbones

Consistent gains across six VLM backbones.

BackboneSingle-imgMulti-viewGeneralVideo & 4DVideo Und.AverageΔ
Qwen 3.5-397B-A17B60.860.764.758.559.760.4+3.1
Qwen 3.5-122B-A10B58.854.662.354.356.556.9+3.2
Qwen 3.6-35B-A3B58.855.462.454.057.857.2+4.6
Qwen 3.6-27B61.764.067.060.662.762.7+7.7
Gemma 4-31B61.962.564.855.159.459.9+6.5
Gemma 4-26B-A4B56.056.862.748.752.854.3+6.3

Δ is the gain over each backbone's no-tool baseline. Two model families, 26B–397B parameters, no model-specific tuning.

08Ablation · same tools, same prompt

Only the action interface differs.

No-tool baseline
53.4
Single-pass code
55.2 +1.8
Structured tool-call
56.7 +3.3
SpatialClaw (code as action)
59.9 +6.5

Against prior agents on the same Gemma 4-31B backbone, SpatialClaw improves by +11.2 pp over SpaceTools-Toolshed.

09Analysis

Composition is the main driver.

52%
Code composition: chaining tool calls into one program
20%
Control flow: branching over intermediate results
70%+
Of wins trace to capabilities a fixed API cannot provide

Gains are largest precisely where chained geometric computation across frames and viewpoints is required, and the agent spontaneously adapts its tool composition to the question type, with no category-specific routing.

Key takeaway

The gains come from the action interface itself, not engineered utilities or perception tooling. Code is the right abstraction for spatial reasoning agents.

@article{cho2026spatialclaw, title = {SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning}, author = {Cho, Seokju and Hachiuma, Ryo and ... and Chen, Min-Hung}, year = {2026} }
NVIDIASpatialClaw · NVIDIA Research