Changelog#
Summary of all updates from UPDATES.md.
2026-03-20 (continued — PRs #23, #28, #29, #30, #31, #33)#
8. Principled Prover Confidence Scoring#
The Prover now replaces purely format-based heuristics with a structured LLM self-assessment block appended to every proof response.
Prompt change: the system prompt instructs the LLM to append a \``json` block after the proof body:
{
"confidence": 0.0-1.0,
"completeness": "complete|partial|sketch",
"gaps": ["...uncertain step descriptions..."],
"weakest_step": "...",
"techniques_used": ["..."]
}
Calibration guide embedded in the prompt:
Range |
Meaning |
|---|---|
≥ 0.95 |
Every step elementary or from a named theorem; no hand-waving |
0.80–0.94 |
All key steps present; at most one routine calculation left implicit |
0.60–0.79 |
Sketch correct but one non-trivial step not fully worked out |
0.40–0.59 |
Main idea right but genuine gap flagged with |
< 0.40 |
Incomplete proof or possibly wrong approach |
_parse_proof_attempt() pipeline (6 steps):
Extract inline
[GAP: ...]tagsParse the structured JSON self-assessment block
Heuristic fallback if no valid block (presence of “QED”, “□”, “this completes”)
Citation-integrity check: penalty −0.15 per uncited lemma ID (cap −0.30)
Weasel-word penalty: −0.05 per “clearly”/”obviously”/etc. (cap −0.15)
Strip JSON block from stored
proof_text; extract optional Lean4 sketch
9. Verifier Bug Fix: Format Code Error#
Verifier._peer_review() was raising Unknown format code 'f' for object of type 'str' because settings.verifier_pass_confidence returned a str via Pydantic’s env-var parsing. Fixed by explicit float() cast before :.2f formatting.
10. pdfTeX Font Expansion Fix#
! pdfTeX error (font expansion): auto expansion is only possible with scalable fonts caused by microtype’s automatic font expansion being incompatible with lmss (Latin Modern Sans) on TeX Live 2022. Fixed by passing expansion=false to \RequirePackage[expansion=false]{microtype} in eureka.cls.
11. Token Limit UI Sliders (Extended)#
Five additional token-limit controls added to the Settings tab:
Setting |
UI label |
Default |
|---|---|---|
|
Assembler |
4096 |
|
Crystallizer |
2500 |
|
Proof architect |
3000 |
|
Analysis stages |
1600 |
|
Proof sketch |
600 |
server.py’s _CONFIG_FIELDS mapping updated to expose all 12 token-limit knobs to the UI.
All hardcoded max_tokens=... literals across theory and learning agent files replaced with settings.max_tokens_* references.
12. Human Intervention for Empty Survey#
When the survey stage completes with zero papers found, the pipeline pauses and prompts the user:
⚠ Survey stage completed but found 0 papers.
Please provide a comma-separated list of paper IDs/titles to retry,
or press Enter to proceed without papers:
If the user provides titles, the survey task is re-injected with the paper list and re-executed once. Implemented in MetaOrchestrator._handle_empty_survey_fallback().
13. ProofCheckpoint Bug Fixes (Issue #27)#
Bug |
Fix |
|---|---|
|
Added |
|
Fixed to |
|
Fixed to |
14. Minimax LLM Backend (PR #23)#
New backend shortcut LLM_BACKEND=minimax:
Variable |
Description |
|---|---|
|
Minimax API key |
|
Minimax model name (e.g. |
active_model / active_fast_model properties added to Config — resolve the correct model string for any backend (Anthropic, Minimax, OpenAI-compat). All agents and theory stages updated to use settings.active_model / settings.active_fast_model instead of hardcoded model names.
15. PDF Extraction via Docling (PR #23)#
PaperReader._extract_from_paper_pdf() fetches the full arXiv PDF via Docling, filters to theorem/lemma sections with a regex pass, then runs the LLM extractor over the rich excerpt instead of the abstract only.
Enable with the [pdf] optional extra:
pip install eurekaclaw[pdf]
16. Memory Relevance-Based Retrieval (PR #30)#
Tier 4 domain memories are now ranked by cosine similarity to the current task query instead of most-recently-written ordering.
_index.jsonnow stores a"embedding"field per memory file (computed at write time usingembedding_utils.get_embedding())MemoryManager.load_for_injection(domain, k, query)accepts aqueryargument; if provided, ranks candidates by cosine similarity and returns the top-k most relevant filesNew helper module:
eurekaclaw/memory/embedding_utils.py(get_embedding(),cosine_similarity())
17. smile.sty Math Macro Package (PR #28)#
A comprehensive SMiLe-group math macro package (smile.sty) is now bundled with the template:
Blackboard bold shortcuts:
\RR,\EE,\PP,\NN,\ZZ,\QQ,\CC, …Calligraphic shortcuts:
\cA…\cZBold vectors (mathbf family):
\xb,\Wb, … and (bm family):\bx,\bW, …Bold Greek:
\balpha,\bbeta,\bGamma, …Norm/bracket macros:
\norm{x},\nbr{x},\bignorm{x},\opnorm{x}{p}
The writer agent system prompt is updated to list all available macros and instruct the LLM to use them instead of redefining equivalents.
18. OpenClaw Hub Integration (PR #31)#
Skills can now be installed directly from the ClawHub registry:
eurekaclaw install-skills <skillname>
# e.g. eurekaclaw install-skills steipete/github
install-skills with no argument continues to install all bundled seed skills. The clawhub CLI must be installed separately. The implementation lives in eurekaclaw/skills/from_hub.py.
19. Bug Fix: pause AttributeError (PR #33)#
cli.py referenced exc.paused_before_stage but ProofPausedException stores the field as stage_name. Fixed both call sites so pausing a run no longer crashes with AttributeError.
20. Direction Planning Fallback (PR #33)#
When DivergentConvergentPlanner.diverge() returns an empty list or throws an exception, the orchestrator now halts and prompts the user to enter a research direction manually instead of silently continuing with no direction. The survey’s open problems are shown as context. Empty input or Ctrl+C raises RuntimeError and exits cleanly.
2026-03-20#
1. Theory Review Gate#
A new theory_review_gate orchestrator task is inserted between the TheoryAgent and WriterAgent in default_pipeline.yaml. It is always shown, regardless of --gate mode.
Displays a numbered lemma chain (L1, L2, …) with
✓ verified/~ low confidencetags.y / Enter → proceed to writer.
n → user specifies the most problematic step (by number or ID) and describes the issue. The TheoryAgent re-runs once with the feedback injected into its task description, then shows the updated sketch once more.
2. Pause / Resume#
Two new CLI commands and a graceful SIGINT handler:
Command |
Description |
|---|---|
|
Write |
|
Load checkpoint and continue from the saved stage |
Ctrl+C during |
Same as |
Checkpoint file: ~/.eurekaclaw/sessions/<session_id>/checkpoint.json
3. ProofArchitect Improvements#
source=Nonecrash fixed:item.get("source") or ""instead ofitem.get("source", "")so a JSONnullvalue never passes through asNone.3-layer fallback: full plan (4–10 lemmas) → 3-lemma simplified plan → single
main_result. Single-goal is now truly a last resort.
4. Citation Quality Improvements#
Assembler prompt requires explicit
[lemma_id]citations in the assembled proof (e.g.,By [arm_pull_count_bound], …).ConsistencyChecker receives the list of proved lemma IDs and verifies all appear in the proof; returns
uncited_lemmas.Retry routing: when ConsistencyChecker failure is citation-related (“uncited” / “missing citation”), the retry loop re-runs the Assembler as well as the Crystallizer — not just the Crystallizer.
5. TheoremCrystallizer Fixes#
max_tokensraised from 1500 → 2500 to prevent mid-expression truncation of LaTeX formulas.Added explicit no-truncation constraint for math environments.
6. Knowledge Graph Write Timing#
Tier 3 KG writes now trigger whenever proven_count > 0 (i.e., after any lemma is proved), not only when the full consistency check passes. Lemma nodes and dependency edges are preserved even when crystallization fails.
7. Bug Fixes#
File |
Bug |
Fix |
|---|---|---|
|
|
|
|
|
Added |
|
|
Added explicit re-raise |
|
|
|
2026-03-19#
1. Robust Lemma Decomposer Parsing#
_parse_lemmas in agents/theory/decomposer.py now uses a 4-pass extraction strategy instead of 2, preventing the “Empty lemma list from decomposer” fallback in most cases:
Pass |
Strategy |
|---|---|
1 |
JSON inside |
2 |
First JSON object |
3 |
First JSON array |
4 |
Plain-text numbered/bulleted list heuristic |
_normalize_list accepts flexible field names per item (id/lemma_id/name/title, statement/formal_statement/hypothesis/content, etc.) so variant LLM output schemas are handled without falling back to single-theorem mode.
The same 4-pass strategy was also applied to ProofArchitect._parse_lemmas.
2. UI Polling Log Suppression#
GET /api/runs/<id> 200 status-poll requests are now logged at DEBUG level instead of INFO, removing repetitive log noise during long runs.
3. Bug Fixes#
File |
Bug |
Fix |
|---|---|---|
|
|
Wrapped |
|
|
Uses dynamic |
|
|
Added |
|
|
Demoted to |
4. Always-On Stage Summary Cards#
orchestrator/gate.py now prints a Rich summary card after every completed pipeline stage regardless of GATE_MODE:
Stage |
Card shows |
|---|---|
|
Papers found, open problems, key mathematical objects |
|
Proof status, per-lemma breakdown with confidence tags |
|
Alignment score, per-lemma numerical check results |
|
Full session summary before final output |
5. Human Gate Improvements#
Text feedback input: after approving a gate, users can type a correction or hint injected into the next agent’s task via
get_user_feedback()Auto-escalation: if ≥1 lemma has
verified=Falseafter theory stage, gate auto-escalates fromautotohumanwith full lemma confidence breakdownDefault changed:
GATE_MODEdefault changed fromnonetoauto
6. Proof Readability Enforcement (WriterAgent)#
ENFORCE_PROOF_STYLE=true (default) injects _PROOF_STYLE_RULES into writer prompts:
No skip words (“clearly”, “trivially”, “by standard arguments”) without immediate justification
Every inequality must name the lemma or theorem it uses
Each lemma proof opens with 1–2 sentences of informal explanation
Low-confidence lemmas tagged
\textcolor{orange}{\textbf{[Unverified step]}}in PDFLimitations section explains all unverified steps
Added
\usepackage{xcolor}toLATEX_PREAMBLE
7. Targeted Numerical Testing (ExperimentAgent)#
ExperimentAgent now separates proven lemmas into verified and low_confidence groups:
For each low-confidence lemma: runs dedicated numerical test, computes
violation_rateLemmas with
violation_rate > 1%flagged asnumerically_suspectExperiment stage summary card shows per-lemma check results with color coding
WriterAgent adds stronger warnings for
numerically_suspectlemmas
8. CLI Default Output Directory#
All three CLI commands (prove, explore, from-papers) now default to ./results if --output is not specified.
9. Bibliography & Citation Fix#
Removed \cite{} → ? bug in output PDFs:
_fix_missing_citations()surgically removes\cite{}keys that have no matching entry inreferences.bibLATEX_ENDsplit intoLATEX_END_WITH_BIB/LATEX_END_NO_BIB—\bibliography{references}only added when.bibis non-emptyWriterAgent prompt instructs LLM to use exact cite keys (same as
_generate_bibtex)
10. LaTeX Extraction Improvements#
Extended _extract_latex in WriterAgent with:
Markdown heading → LaTeX section conversion
tikzpictureenvironment removalEnvironment name normalization additions:
rem→remark,rema→remark,prop→proposition,defin→definition,corolary→corollary,Cor→corollary,Thm→theorem,Lem→lemma\endproof→\end{proof}substitutionQED box removal (
\begin{flushright}$\square$\end{flushright})Two-pass
_close_open_environments: (1) remove orphaned\end{X}, (2) append missing\end{X}
2026-03-18#
1. Context Compression#
Optimization summary:
Stage |
Before |
After |
Saving |
|---|---|---|---|
Formalizer model |
|
|
~90% cost per call |
Formalizer re-run |
Every iteration |
Only on change |
Saves N-1 calls |
Proven lemma context |
200-char proof text |
Statement only (120) |
~40% input tokens |
Verifier (high-conf) |
Always LLM call |
Auto-accept ≥ 0.85 |
~30% fewer calls |
Verifier proof text |
Raw (up to 3000 chars) |
Head+tail (1000 chars) |
~67% input tokens |
Agent loop history |
Accumulates forever |
Compressed every 6 turns |
~60% input tokens |
Stagnation |
Wastes 3+ iterations |
Forced refinement |
Saves full iterations |
Run performance: typical run now ~20 LLM calls (down from ~35); worst case ~55 (down from ~100).
Agent max_turns reductions:
Agent |
Before |
After |
|---|---|---|
SurveyAgent |
15 |
8 |
IdeationAgent |
5 |
3 |
ExperimentAgent |
10 |
5 |
WriterAgent |
5 |
3 |
New config knobs: CONTEXT_COMPRESS_AFTER_TURNS, AUTO_VERIFY_CONFIDENCE, STAGNATION_WINDOW
2. Configurable Token Limits Per Call Type#
7 new .env variables:
Variable |
Default |
Applies to |
|---|---|---|
|
|
All agent reasoning loops |
|
|
Proof generation |
|
|
Research direction planning |
|
|
Lemma decomposition |
|
|
Formalization, refiner, counterexample |
|
|
Proof verification |
|
|
Context compression |
UI sliders added in the Settings tab for all 7 values.
3. Multi-Backend LLM Support#
Three named backends in config.py and llm/factory.py:
Backend |
|
Notes |
|---|---|---|
Anthropic native |
|
Default |
OpenRouter |
|
Set |
Local (vLLM/Ollama) |
|
Defaults to |
ccproxy / OAuth fallback: if ANTHROPIC_API_KEY is empty, automatically reads ~/.claude/.credentials.json and routes through ccproxy (allows Claude Pro/Max users to run without a separate API key).
4. Additional Tuning Knobs#
Variable |
Default |
Effect |
|---|---|---|
|
|
Tool-use turns in SurveyAgent |
|
|
Turns per theory stage |
|
|
Turns for paper generation |
|
|
Hard cap on arXiv results |
|
|
Retry attempts on 5xx / rate-limit errors |
|
|
Exponential backoff bounds |
5. Domain Plugin Architecture#
New three-tier plugin system:
EurekaClaw (general pipeline)
└── DomainPlugin (e.g. MAB)
└── Workflow (per-domain prompt guidance)
New files:
domains/base.py—DomainPluginABC with 4 methodsdomains/__init__.py—@register_domaindecorator +resolve_domain()domains/mab/— MAB domain plugin (GaussianBandit, BernoulliBandit, 4 tools, 4 skills, 3-level benchmark)
Core changes:
SkillRegistry.add_skills_dir()— load skills from domain directoriesbuild_default_registry()— now domain-agnostic; domain tools registered via pluginMetaOrchestrator— acceptsdomain_pluginparameterEurekaSession.run()— auto-detects domain plugin fromInputSpec.domain
6. LaTeX Compilation Robustness#
Added 7
\newtheoremdeclarations toLATEX_PREAMBLE:assumption,maintheorem,conjecture,claim,example,fact,observationEnvironment name normalization in
_extract_latex(step 6): mis-cased names corrected_close_open_environments()— new static method, stack-based, prevents\begin{tabular}truncation from causing fatal errorsRemoved
_rescue_compileandpaper_rescue.texfallback entirelyFull bibtex-aware compile sequence:
pdflatex → bibtex → pdflatex → pdflatex
7. Bibliography & Reference Resolution#
Write order fixed:
references.bibsaved before_compile_pdf()is calledbibtex added to compile sequence: previously only
pdflatexran twiceCite key consistency:
_compute_cite_keys()in WriterAgent uses identical algorithm to_generate_bibtexinmain.pyDuplicate key handling:
_generate_bibtexdeduplicates witha,b, … suffixesResearchOutput.bibliography_jsonfield added
8. Experiment Mode Control#
New EXPERIMENT_MODE env var:
auto— skip experiment if formal statement has no quantitative signalstrue— always run experiment stagefalse— always skip experiment stage