SpatialClaw · Presentation

01Motivation

Spatial reasoning is still hard
for vision-language models.

Determining where objects are, how they relate, and how they move in 3D: effortless for humans, unreliable for state-of-the-art VLMs.

"Is the car moving toward the camera?"

"Which object is closest to the table?"

"Did the person turn left or right?"

"How far apart are these two objects?"

02The core idea

Tools alone are not enough.
The action interface is the bottleneck.

The capability of a tool-augmented agent is bounded not by which tools are available, but by how those tools can be composed: which intermediate states are observable, and whether the agent can revise before committing.

03Three action interfaces

The same tools, three very different ceilings.

(a) Single-pass code

Commit before observing

Writes a complete program in one shot. Cannot revise once execution starts, so any wrong assumption propagates to the answer.

(b) Structured tool-call

Limited composition

Dispatches typed tools through a fixed JSON schema. Cannot freely combine perception outputs with NumPy / SciPy.

Compose · inspect · revise

Code as the action interface, backed by a persistent kernel. Perception outputs are ordinary variables.

04SpatialClaw

Code is the right action interface
for spatial reasoning.

A training-free agent with a stateful Python kernel pre-loaded with input frames and perception + geometry primitives. The VLM writes one executable cell per step, conditioned on all prior outputs.

Persistent kernel

Masks, depth maps, point clouds and plots live as ordinary variables that persist across steps.

Visual inspection

show(...) embeds rendered images into the next observation, so the agent can see a mask before trusting it.

A new analysis ≠ a new API

A new computation is a new composition of primitives, assembled at test time.

05Method

A persistent kernel wrapped in a five-stage loop.

IPlanning

Separate LLM session
Sees question + tool docs
not the images
Outputs a plan

IICode gen

Purpose · Reasoning
Next goal
One Python cell

IIIExecution

AST security check
Persistent kernel
tools, show(), NumPy, SciPy

IVFeedback

stdout from print()
Errors / tracebacks
Variable summaries
Visuals from show()

VAnswer

If ReturnAnswer()
in code → terminate
(if valid)

Iterate: return to Stage II while step < N_max

06The persistent workspace

Six entry points, one self-contained world.

`InputImages`

The sampled frames or images the agent reads observations from.

`Metadata`

Frame rate, duration, and indices, letting the agent reason about temporal structure.

`tools`

SAM 3 segmentation · Depth Anything 3 reconstruction · geometry utilities.

`show(...)`

Registers an image into the agent's next observation. plt.show() works too.

`vlm`

A separate VLM session: vlm.locate() for grounding, ask_with_thinking() for commonsense.

`ReturnAnswer(...)`

Submits the final candidate answer and terminates the loop.

07Live walkthrough · real trajectory

Watch SpatialClaw solve a real benchmark question.

MMSI · sample 76 · Multi-view · Gemma 4-31B

In Figure 4 the wall holding the sink faces east (outward = east). In Figure 1, what is the position of the door relative to the sink?

A · NortheastB · SouthwestC · SoutheastD · Northwest

Ground truth: A · Northeast

Fig 1

Fig 2

Fig 3

Fig 4

07The code–inspect–revise loop · real MMSI-76 run

spatialclaw · persistent kernel · mmsi/76

Plan

East = wall normal, North = Up × East. Segment, reconstruct, project door→sink onto the axes.

Step 1 · code

seg_sink = tools.SAM3.segment(InputImages[3], "sink")
seg_door = tools.SAM3.segment(InputImages[0], "door")
show([seg_sink.visualize(), seg_door.visualize()])

Feedback · show()

Both mask areas > 0 → no empty masks, verified.

Step 2 · compose

recon   = tools.Reconstruct(InputImages)
v_north = np.cross([0,1,0], v_east) # Up × East
ReturnAnswer("A")                  # Northeast

Feedback · answer

East +5.61 · North +4.33 → both positive: A · Northeast ✓

Question · MMSI 76

Door (Fig 1) relative to the sink (Fig 4), sink-wall facing east?

GT: A · Northeast

08Results

Improves on 19 of 20 benchmarks.

✓Improves on 19 of 20 benchmarks
✓Average gain +6.5 pp over no-tool baseline
✓+11.2 pp over a recent prior agent

Each axis is rescaled so SpatialClaw lands at a constant radius, so the circle makes the consistency visible at a glance.

SpatialClaw

SpaceTools

pySpatial

No-tool

08Results · across backbones

Consistent gains across six VLM backbones.

Backbone	Single-img	Multi-view	General	Video & 4D	Video Und.	Average	Δ
Qwen 3.5-397B-A17B	60.8	60.7	64.7	58.5	59.7	60.4	+3.1
Qwen 3.5-122B-A10B	58.8	54.6	62.3	54.3	56.5	56.9	+3.2
Qwen 3.6-35B-A3B	58.8	55.4	62.4	54.0	57.8	57.2	+4.6
Qwen 3.6-27B	61.7	64.0	67.0	60.6	62.7	62.7	+7.7
Gemma 4-31B	61.9	62.5	64.8	55.1	59.4	59.9	+6.5
Gemma 4-26B-A4B	56.0	56.8	62.7	48.7	52.8	54.3	+6.3

Δ is the gain over each backbone's no-tool baseline. Two model families, 26B–397B parameters, no model-specific tuning.

08Ablation · same tools, same prompt

Only the action interface differs.

No-tool baseline

53.4

Single-pass code

55.2 +1.8

Structured tool-call

56.7 +3.3

SpatialClaw (code as action)

59.9 +6.5

Against prior agents on the same Gemma 4-31B backbone, SpatialClaw improves by +11.2 pp over SpaceTools-Toolshed.

09Analysis

Composition is the main driver.

52%

Code composition: chaining tool calls into one program

20%

Control flow: branching over intermediate results

70%+

Of wins trace to capabilities a fixed API cannot provide

Gains are largest precisely where chained geometric computation across frames and viewpoints is required, and the agent spontaneously adapts its tool composition to the question type, with no category-specific routing.

Key takeaway

The gains come from the action interface itself, not engineered utilities or perception tooling. Code is the right abstraction for spatial reasoning agents.

@article{cho2026spatialclaw, title = {SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning}, author = {Cho, Seokju and Hachiuma, Ryo and ... and Chen, Min-Hung}, year = {2026} }

Spatial reasoning is still hardfor vision-language models.

Tools alone are not enough.The action interface is the bottleneck.