Rethinking the action interface for agentic spatial reasoning
Seokju Cho†, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim†, Yu-Chiang Frank Wang, Min-Hung Chen
†Affiliated with KAIST · Work done during Seokju Cho's internship at NVIDIA
01Motivation
Spatial reasoning is still hard for vision-language models.
Determining where objects are, how they relate, and how they move in 3D: effortless for humans, unreliable for state-of-the-art VLMs.
"Is the car moving toward the camera?"
"Which object is closest to the table?"
"Did the person turn left or right?"
"How far apart are these two objects?"
02The core idea
Tools alone are not enough. The action interface is the bottleneck.
The capability of a tool-augmented agent is bounded not by which tools are available, but by how those tools can be composed: which intermediate states are observable, and whether the agent can revise before committing.
03Three action interfaces
The same tools, three very different ceilings.
(a) Single-pass code
Commit before observing
Writes a complete program in one shot. Cannot revise once execution starts, so any wrong assumption propagates to the answer.
(b) Structured tool-call
Limited composition
Dispatches typed tools through a fixed JSON schema. Cannot freely combine perception outputs with NumPy / SciPy.
(c) SpatialClaw
Compose · inspect · revise
Code as the action interface, backed by a persistent kernel. Perception outputs are ordinary variables.
04SpatialClaw
Code is the right action interface for spatial reasoning.
A training-free agent with a stateful Python kernel pre-loaded with input frames and perception + geometry primitives. The VLM writes one executable cell per step, conditioned on all prior outputs.
Persistent kernel
Masks, depth maps, point clouds and plots live as ordinary variables that persist across steps.
Visual inspection
show(...) embeds rendered images into the next observation, so the agent can see a mask before trusting it.
A new analysis ≠ a new API
A new computation is a new composition of primitives, assembled at test time.
05Method
A persistent kernel wrapped in a five-stage loop.
IPlanning
Separate LLM session
Sees question + tool docs
not the images
Outputs a plan
IICode gen
Purpose · Reasoning
Next goal
One Python cell
IIIExecution
AST security check
Persistent kernel
tools, show(), NumPy, SciPy
IVFeedback
stdout from print()
Errors / tracebacks
Variable summaries
Visuals from show()
VAnswer
If ReturnAnswer()
in code → terminate
(if valid)
Iterate: return to Stage II while step < Nmax
06The persistent workspace
Six entry points, one self-contained world.
InputImages
The sampled frames or images the agent reads observations from.
Metadata
Frame rate, duration, and indices, letting the agent reason about temporal structure.
East +5.61 · North +4.33 → both positive: A · Northeast ✓
Question · MMSI 76
Door (Fig 1) relative to the sink (Fig 4), sink-wall facing east?
GT: A · Northeast
SAM 3 · sink mask (Fig 4)SAM 3 · door mask (Fig 1)
08Results
Improves on 19 of 20 benchmarks.
✓Improves on 19 of 20 benchmarks
✓Average gain +6.5 pp over no-tool baseline
✓+11.2 pp over a recent prior agent
Each axis is rescaled so SpatialClaw lands at a constant radius, so the circle makes the consistency visible at a glance.
SpatialClaw
SpaceTools
pySpatial
No-tool
08Results · across backbones
Consistent gains across six VLM backbones.
Backbone
Single-img
Multi-view
General
Video & 4D
Video Und.
Average
Δ
Qwen 3.5-397B-A17B
60.8
60.7
64.7
58.5
59.7
60.4
+3.1
Qwen 3.5-122B-A10B
58.8
54.6
62.3
54.3
56.5
56.9
+3.2
Qwen 3.6-35B-A3B
58.8
55.4
62.4
54.0
57.8
57.2
+4.6
Qwen 3.6-27B
61.7
64.0
67.0
60.6
62.7
62.7
+7.7
Gemma 4-31B
61.9
62.5
64.8
55.1
59.4
59.9
+6.5
Gemma 4-26B-A4B
56.0
56.8
62.7
48.7
52.8
54.3
+6.3
Δ is the gain over each backbone's no-tool baseline. Two model families, 26B–397B parameters, no model-specific tuning.
08Ablation · same tools, same prompt
Only the action interface differs.
No-tool baseline
53.4
Single-pass code
55.2 +1.8
Structured tool-call
56.7 +3.3
SpatialClaw (code as action)
59.9 +6.5
Against prior agents on the same Gemma 4-31B backbone, SpatialClaw improves by +11.2 pp over SpaceTools-Toolshed.
09Analysis
Composition is the main driver.
52%
Code composition: chaining tool calls into one program
20%
Control flow: branching over intermediate results
70%+
Of wins trace to capabilities a fixed API cannot provide
Gains are largest precisely where chained geometric computation across frames and viewpoints is required, and the agent spontaneously adapts its tool composition to the question type, with no category-specific routing.
Key takeaway
The gains come from the action interface itself, not engineered utilities or perception tooling. Code is the right abstraction for spatial reasoning agents.
@article{cho2026spatialclaw,
title = {SpatialClaw: Rethinking Action Interface for
Agentic Spatial Reasoning},
author = {Cho, Seokju and Hachiuma, Ryo and ... and Chen, Min-Hung},
year = {2026}
}