NotEvolve: A Notebook as the World Model for Self-Evolving Agents

Why a Notebook?

Evolvable State

A branch is a full notebook state, denoted as N_t⁽ⁱ⁾, rather than a single source file. Selection can preserve whole research trajectories.

Multimodal Memory

Cells store plans, code, text outputs, tables, images, and summaries. The renderer exposes only the useful parts to the LLM context.

Executable Environment

The kernel keeps helper functions, cached data, solver states, and best-known variables alive, so later cells can build on earlier experiments.

How NotEvolve Runs

NotEvolve has an outer loop and an inner loop. The outer loop is a general evolution controller: it receives a population of notebook branches and their scores, then proposes the next population by selection, mutation, recombination, tournament search, MAP-Elites, or another evolution strategy. The inner loop runs one branch for one round.

NotEvolve outer-loop and inner-loop architecture — Each branch enters a three-phase notebook-agent round. First, the context renderer converts the notebook into compact text and the agent writes a plan. Second, the LLM calls notebook tools to add, edit, run, summarize, fold, and unfold cells. Third, a cleanup phase folds or deletes low-value cells, inserts a round summary, records a score, and checkpoints the resulting notebook. The scored checkpoints become inputs to the next outer-loop evolution step.

Context Management

A raw Jupyter notebook is a verbose JSON document with execution metadata, widget state, and potentially long cell outputs. NotEvolve instead renders the notebook into a compact text view, exposes only the cells that are useful for the current decision, and lets the agent actively summarize, fold, unfold, and delete cells as the notebook grows.

Beginning

Render notebook JSON into compact LLM context.

Execution

Append new cells, summarize runs, and clip long outputs.

Cleanup

Fold useful history and remove failed branches of work.

Cell 0Round plan try two stronger variants

Cell 1unfolded (50 lines) baseline() output: score = 0.8

Cell 0Round plan try two stronger variants

Cell 1unfolded (50 lines) baseline() output: score = 0.8

Cell 2unfolded (32 lines) methodA() output: score = 0.6

Cell 3unfolded (45 lines) methodB() output: score = 1.0

Cell 0Round plan try two stronger variants

Cell 1folded desc: baseline, score = 0.8

Cell xdeleted low-score attempt removed

Cell 2kept (45 lines) methodB() output: score = 1.0

Cell 3Round summary A deleted; B retained; next: refine B.

Context management is part of the agent loop. At the start of a round, the renderer converts notebook state into compact text. During execution, tool calls add and run cells while summaries and output clipping keep observations short. At the end of a round, cleanup folds reusable history, deletes low-value cells, and preserves a concise summary for the next branch state.

Results

We evaluate NotEvolve across three settings: mathematical optimization with directly evaluable objectives, Terminal Bench software tasks with a local open-weight model, and MLEBench-style Kaggle workflows with long-horizon data and modeling loops.

Mathematical Optimization

Under the same Gemini 3 Flash base model, NotEvolve matches or exceeds AlphaEvolve/OpenEvolve-style baselines on the tested mathematical optimization tasks.

Problem	AlphaEvolve	OpenEvolve	NotEvolve
Circle Packing ↑	2.635	2.4672	2.635983
Erdos Min-Overlap ↓	0.380923	0.4334	0.38089
Heilbronn Triangle ↑	0.03653	0.0349	0.03653
Min-Max Distribution ↑	0.24004706	0.2300	0.24005088

Terminal Bench

We adapt ten Terminal Bench tasks and run them with the same locally served Nemotron-3-Nano-30B model. The notebook harness gives the model persistent executable state, shell outputs, intermediate files, and error-recovery context.

The bare Nemotron model achieves only 33.8% pass rate (27/80 trials), failing entirely on 5 of 10 tasks. With the NotEvolve harness, the same model achieves 100% pass rate on all 10 tasks in a single trial each — often completing tasks faster than the bare model's average across 8 attempts. This demonstrates that the notebook harness can compensate for a weaker base model on practical software engineering tasks requiring multi-step reasoning, tool use, and error recovery.

Task	Bare Nemotron Pass Rate	Bare Nemotron Avg Time	NotEvolve Pass Rate	NotEvolve Time	Speedup
hello-world	8/8	41s	1/1	12s	3.4×
csv-to-parquet	8/8	156s	1/1	17s	9.2×
fix-permissions	8/8	45s	1/1	25s	1.8×
extract-safely	2/8	55s	1/1	40s	1.4×
simple-web-scraper	1/8	163s	1/1	51s	3.2×
download-youtube	0/8	311s	1/1	149s	2.1×
vim-terminal-task	0/8	303s	1/1	151s	2.0×
count-dataset-tokens	0/8	358s	1/1	254s	1.4×
oom	0/8	250s	1/1	26s	9.6×
train-fasttext	0/8	639s	1/1	82s	7.8×

MLEBench-style Kaggle Tasks

We compare NotEvolve with AIDE, CheetahHarness, and OpenEvolve on public Kaggle-style MLEBench tasks using the same locally hosted Nemotron-120B-Super model. This setting stresses notebook memory: the agent must inspect data, build pipelines, cache intermediate artifacts, recover from errors, and iterate on submissions.

NotEvolve is competitive with CheetahHarness, the strongest realistic code-development baseline, and is best or tied-best on several tasks including random-acts-of-pizza, tps-dec-2021, and dog-breed. The tradeoff is token budget and wall-clock time: NotEvolve exposes rich notebook state to the model, which increases context cost and leads to more compute-intensive pipelines. Stronger context compaction is the key next step.

Task	Metric	NotEvolve	AIDE	CheetahHarness	OpenEvolve
random-acts-of-pizza	AUC ↑	0.7799	0.7746	0.7689	0.531
nomad2018	RMSLE ↓	0.0621	0.0675	0.0603	0.259
spaceship-titanic	acc ↑	0.8149	0.7977	0.8218	0.802
spooky-author	LL ↓	0.4164	0.5494	0.4634	0.363
jigsaw-toxic	AUC ↑	0.9748	0.9748	0.9730	—
tps-dec-2021	acc ↑	0.9608	0.9405	0.8934	0.908
tps-may-2022	acc ↑	0.9270	0.9041	0.9082	0.982
dog-breed	LL ↓	0.7782	0.8275	4.1775	4.814
nyc-taxi (5.3 GB)	RMSE ↓	4.4589	5.3568	4.6838	3.119
dogs-vs-cats (1.2 GB)	LL ↓	0.3542	0.1213	0.6645	0.628

KernelBench GPU Kernels

KernelBench extends our evaluation to systems optimization. Each task asks an agent to replace a PyTorch workload with a correct and faster GPU kernel, so success requires more than code generation: the agent must compile, test, benchmark, and revise candidate kernels. This makes it a natural stress test for notebook-based memory, since a notebook can preserve candidate kernels, error traces, benchmark outputs, and debugging notes across attempts.

Across 300 trials, NotEvolve achieves higher overall mean speedup than CheetahHarness, 1.096 vs. 0.982, and finds more successful accelerations: 76 vs. 49 trials above 1.0x speedup and 15 vs. 4 trials above 2.0x speedup. The gain is strongest on Level 1, while Level 2 is essentially tied. NotEvolve uses fewer input tokens on average, but more output tokens and slightly higher wall-clock time, so this is not a wall-clock speedup claim. The result suggests that notebook-based state improves speedup discovery overall, but stronger harness engineering is still needed to reduce compile-run-debug overhead.

KernelBench radar

Speedup quality vs. efficiency

NotEvolve / Notebook CheetahHarness / Cheetah

Level 1

strongest gain

Raw values: speedup 1.170 vs. 0.992; tokens 22.8k vs. 46.5k; runtime 182.2s vs. 174.4s.

Level 2

essentially tied

Raw values: speedup 0.966 vs. 0.969; tokens 16.7k vs. 30.1k; runtime 182.2s vs. 163.4s.

1.096 vs 0.982 mean speedup 76 vs 49 trials >1.0x 15 vs 4 trials >=2.0x

All axes are normalized so that farther outward is better.

State Matters

The largest gains appear when progress depends on intermediate artifacts: evaluated layouts, shell outputs, trained models, cached data, and recoverable failed attempts.

Cost Tradeoff

The notebook state improves long-horizon behavior, but rich context can be expensive. Better state distillation is a key direction for future versions.

Case Study: Circle Packing

The Circle Packing task asks the agent to place 26 circles inside the unit square and maximize the sum of radii while keeping all circles non-overlapping and within the boundary. The notebook is useful here because the agent can repeatedly score layouts, visualize gaps, keep the best layout in kernel variables, and warm-start new solvers from previous cells.

R1 C2 2.080000

Current notebook behavior

Grid baseline

A simple grid gives a valid but loose packing.

Executable Scoring

Each candidate is evaluated in the notebook, so the score becomes an immediate signal for the next branch and the next outer-loop selection step.

Visual Diagnosis

Plots reveal empty regions and contact structure that are hard to infer from source code alone.

Warm-start Search

Later cells reuse variables such as the current best centers and radii, then try new local optimizers and relocation moves.

Discussion

These experiments support the main hypothesis of the project: for long-horizon agents, state representation is a core capability. The notebook gives the agent a persistent laboratory where partial solutions, failures, summaries, plots, and live objects remain available to future reasoning steps.

The next challenge is making this state more efficient and robust. Notebooks can accumulate many cells and long outputs, so stronger context distillation, folding policies, and cleanup are needed. Kernel state also needs better checkpointing, replay, dependency tracking, and sandboxing before notebook-state evolution can be deployed broadly across scientific, systems, and machine-learning engineering workloads.