开发日志#
开发日志
v0.2.0 — 2026-04-20#
1. 论文Q&A — 多轮问答与工具调用#
写作阶段完成后,Paper 标签页新增了交互式Q&A对话面板。用户可以对生成的论文提问,理解论文内容,并辅助撰写审稿回复(rebuttal)。
工作原理:
PaperQAAgent将完整LaTeX源码放入系统提示词(利用Prompt缓存实现高效多轮对话)工具调用支持:
arxiv_search、semantic_scholar、web_search、latex_section_read——Agent可检索最新文献并定位论文具体章节,使回答有据可查每个问题最多进行5轮工具调用;回答实时流式输出至CLI和浏览器UI
Q&A历史持久化至会话目录下的
paper_qa_history.jsonl,浏览器刷新和会话恢复后均不丢失
新工具 latex_section_read:
Agent可通过名称读取论文任意章节,无需在每条消息中嵌入完整文档:
latex_section_read(section="定理2")
→ 返回从 \section{} 到下一个标题之间的内容
CLI交互示例:
Review the paper? [y/N]: y
Question (Enter to accept): 定理2的bound看起来比较松——O(n²) 是紧的吗?
✓ arxiv_search: "spectral gap tight bound"
✓ latex_section_read: section "定理2"
O(n²) 的bound来自Weyl不等式。[Chen et al. 2024] 指出,对于正则图,
利用Cauchy交错定理可将bound收紧至 O(n log n)。当前证明未利用正则性,
仍有改进空间。
What next? [a]ccept [q]uestion [r]ewrite: q
新增文件: eurekaclaw/agents/paper_qa/agent.py(多轮 ask() 方法)、eurekaclaw/tools/latex_section.py、eurekaclaw/orchestrator/paper_qa_handler.py
2. 论文重写 — 基于Q&A反馈的定向修改#
用户现在可以直接在论文Q&A门控中触发完整的论文重写——浏览器UI和CLI均支持。重写操作将把修改指令注入为反馈,重新运行理论证明→实验验证→写作的完整流水线。
工作原理:
用户在Q&A对话框中输入修改指令(例如:"利用引理2.1中的Cauchy交错定理,将定理2的bound从O(n²)收紧至O(n log n)")
POST /api/runs/{id}/rewrite立即返回202 Accepted——流水线在后台运行Live 标签页实时显示理论→实验→写作的进度
成功后:保存新版本论文(
paper_v2.tex、paper_v3.tex……),Paper标签页自动刷新失败后:前一版本论文保持不变,对话框中出现错误标记
统一重写接口 —— 无论会话是否仍在运行(门控模式)还是已完成(历史会话),接口行为完全一致。彻底解决了历史会话重写中因代理超时(30秒)导致的 Failed to fetch 报错。
重写范围: 理论证明 → 实验验证(遵循 EXPERIMENT_MODE 配置)→ 写作。实验Agent自身的跳过逻辑(EXPERIMENT_MODE=false)依然有效,无需额外配置。
论文版本管理:
~/.eurekaclaw/runs/<session_id>/
├── paper_v1.tex # 原始版本
├── paper_v2.tex # 第一次重写后
├── paper_qa_history.jsonl # 完整Q&A及重写记录日志
└── rewrite_context_v2.json # 本次重写使用的Q&A摘要与修改指令
错误恢复:
失败场景 |
处理方式 |
|---|---|
理论Agent失败 |
回退至 |
写作Agent失败 |
回退至 |
重写已在进行中 |
返回 |
会话状态缺失/损坏 |
返回 |
修改文件: eurekaclaw/ui/server.py(新增 /rewrite 接口、后台任务、_ensure_bus_activated 辅助函数)、eurekaclaw/orchestrator/paper_qa_handler.py(_do_rewrite 扩展至实验阶段)、frontend/src/components/workspace/paper-review/usePaperSession.ts(简化状态机,统一 onRewrite)
2026-03-20 (continued — PRs #23, #28, #29, #30, #31, #33)#
8. Principled Prover Confidence Scoring#
The Prover now replaces purely format-based heuristics with a structured LLM self-assessment block appended to every proof response.
Prompt change: the system prompt instructs the LLM to append a \``json` block after the proof body:
{
"confidence": 0.0-1.0,
"completeness": "complete|partial|sketch",
"gaps": ["...uncertain step descriptions..."],
"weakest_step": "...",
"techniques_used": ["..."]
}
Calibration guide embedded in the prompt:
Range |
Meaning |
|---|---|
≥ 0.95 |
Every step elementary or from a named theorem; no hand-waving |
0.80–0.94 |
All key steps present; at most one routine calculation left implicit |
0.60–0.79 |
Sketch correct but one non-trivial step not fully worked out |
0.40–0.59 |
Main idea right but genuine gap flagged with |
< 0.40 |
Incomplete proof or possibly wrong approach |
_parse_proof_attempt() pipeline (6 steps):
Extract inline
[GAP: ...]tagsParse the structured JSON self-assessment block
Heuristic fallback if no valid block (presence of "QED", "□", "this completes")
Citation-integrity check: penalty −0.15 per uncited lemma ID (cap −0.30)
Weasel-word penalty: −0.05 per "clearly"/"obviously"/etc. (cap −0.15)
Strip JSON block from stored
proof_text; extract optional Lean4 sketch
9. Verifier Bug Fix: Format Code Error#
Verifier._peer_review() was raising Unknown format code 'f' for object of type 'str' because settings.verifier_pass_confidence returned a str via Pydantic's env-var parsing. Fixed by explicit float() cast before :.2f formatting.
10. pdfTeX Font Expansion Fix#
! pdfTeX error (font expansion): auto expansion is only possible with scalable fonts caused by microtype's automatic font expansion being incompatible with lmss (Latin Modern Sans) on TeX Live 2022. Fixed by passing expansion=false to \RequirePackage[expansion=false]{microtype} in eureka.cls.
11. Token Limit UI Sliders (Extended)#
Five additional token-limit controls added to the Settings tab:
Setting |
UI label |
Default |
|---|---|---|
|
Assembler |
4096 |
|
Crystallizer |
2500 |
|
Proof architect |
3000 |
|
Analysis stages |
1600 |
|
Proof sketch |
600 |
server.py's _CONFIG_FIELDS mapping updated to expose all 12 token-limit knobs to the UI.
All hardcoded max_tokens=... literals across theory and learning agent files replaced with settings.max_tokens_* references.
12. Human Intervention for Empty Survey#
When the survey stage completes with zero papers found, the pipeline pauses and prompts the user:
⚠ Survey stage completed but found 0 papers.
Please provide a comma-separated list of paper IDs/titles to retry,
or press Enter to proceed without papers:
If the user provides titles, the survey task is re-injected with the paper list and re-executed once. Implemented in MetaOrchestrator._handle_empty_survey_fallback().
13. ProofCheckpoint Bug Fixes (Issue #27)#
Bug |
Fix |
|---|---|
|
Added |
|
Fixed to |
|
Fixed to |
14. Minimax LLM Backend (PR #23)#
New backend shortcut LLM_BACKEND=minimax:
Variable |
Description |
|---|---|
|
Minimax API key |
|
Minimax model name (e.g. |
active_model / active_fast_model properties added to Config — resolve the correct model string for any backend (Anthropic, Minimax, OpenAI-compat). All agents and theory stages updated to use settings.active_model / settings.active_fast_model instead of hardcoded model names.
15. PDF Extraction via Docling (PR #23)#
PaperReader._extract_from_paper_pdf() fetches the full arXiv PDF via Docling, filters to theorem/lemma sections with a regex pass, then runs the LLM extractor over the rich excerpt instead of the abstract only.
Enable with the [pdf] optional extra:
pip install eurekaclaw[pdf]
16. Memory Relevance-Based Retrieval (PR #30)#
Tier 4 domain memories are now ranked by cosine similarity to the current task query instead of most-recently-written ordering.
_index.jsonnow stores a"embedding"field per memory file (computed at write time usingembedding_utils.get_embedding())MemoryManager.load_for_injection(domain, k, query)accepts aqueryargument; if provided, ranks candidates by cosine similarity and returns the top-k most relevant filesNew helper module:
eurekaclaw/memory/embedding_utils.py(get_embedding(),cosine_similarity())
17. smile.sty Math Macro Package (PR #28)#
A comprehensive SMiLe-group math macro package (smile.sty) is now bundled with the template:
Blackboard bold shortcuts:
\RR,\EE,\PP,\NN,\ZZ,\QQ,\CC, …Calligraphic shortcuts:
\cA…\cZBold vectors (mathbf family):
\xb,\Wb, … and (bm family):\bx,\bW, …Bold Greek:
\balpha,\bbeta,\bGamma, …Norm/bracket macros:
\norm{x},\nbr{x},\bignorm{x},\opnorm{x}{p}
The writer agent system prompt is updated to list all available macros and instruct the LLM to use them instead of redefining equivalents.
18. OpenClaw Hub Integration (PR #31)#
Skills can now be installed directly from the ClawHub registry:
eurekaclaw install-skills <skillname>
# e.g. eurekaclaw install-skills steipete/github
install-skills with no argument continues to install all bundled seed skills. The clawhub CLI must be installed separately. The implementation lives in eurekaclaw/skills/from_hub.py.
19. Bug Fix: pause AttributeError (PR #33)#
cli.py referenced exc.paused_before_stage but ProofPausedException stores the field as stage_name. Fixed both call sites so pausing a run no longer crashes with AttributeError.
20. Direction Planning Fallback (PR #33)#
When DivergentConvergentPlanner.diverge() returns an empty list or throws an exception, the orchestrator now halts and prompts the user to enter a research direction manually instead of silently continuing with no direction. The survey's open problems are shown as context. Empty input or Ctrl+C raises RuntimeError and exits cleanly.
2026-03-20#
1. Theory Review Gate#
A new theory_review_gate orchestrator task is inserted between the TheoryAgent and WriterAgent in default_pipeline.yaml. It is always shown, regardless of --gate mode.
Displays a numbered lemma chain (L1, L2, …) with
✓ verified/~ low confidencetags.y / Enter → proceed to writer.
n → user specifies the most problematic step (by number or ID) and describes the issue. The TheoryAgent re-runs once with the feedback injected into its task description, then shows the updated sketch once more.
2. Pause / Resume#
Two new CLI commands and a graceful SIGINT handler:
Command |
Description |
|---|---|
|
Write |
|
Load checkpoint and continue from the saved stage |
Ctrl+C during |
Same as |
Checkpoint file: ~/.eurekaclaw/sessions/<session_id>/checkpoint.json
3. ProofArchitect Improvements#
source=Nonecrash fixed:item.get("source") or ""instead ofitem.get("source", "")so a JSONnullvalue never passes through asNone.3-layer fallback: full plan (4–10 lemmas) → 3-lemma simplified plan → single
main_result. Single-goal is now truly a last resort.
4. Citation Quality Improvements#
Assembler prompt requires explicit
[lemma_id]citations in the assembled proof (e.g.,By [arm_pull_count_bound], …).ConsistencyChecker receives the list of proved lemma IDs and verifies all appear in the proof; returns
uncited_lemmas.Retry routing: when ConsistencyChecker failure is citation-related ("uncited" / "missing citation"), the retry loop re-runs the Assembler as well as the Crystallizer — not just the Crystallizer.
5. TheoremCrystallizer Fixes#
max_tokensraised from 1500 → 2500 to prevent mid-expression truncation of LaTeX formulas.Added explicit no-truncation constraint for math environments.
6. Knowledge Graph Write Timing#
Tier 3 KG writes now trigger whenever proven_count > 0 (i.e., after any lemma is proved), not only when the full consistency check passes. Lemma nodes and dependency edges are preserved even when crystallization fails.
7. Bug Fixes#
File |
Bug |
Fix |
|---|---|---|
|
|
|
|
|
Added |
|
|
Added explicit re-raise |
|
|
|
2026-03-19#
1. Robust Lemma Decomposer Parsing#
_parse_lemmas in agents/theory/decomposer.py now uses a 4-pass extraction strategy instead of 2, preventing the "Empty lemma list from decomposer" fallback in most cases:
Pass |
Strategy |
|---|---|
1 |
JSON inside |
2 |
First JSON object |
3 |
First JSON array |
4 |
Plain-text numbered/bulleted list heuristic |
_normalize_list accepts flexible field names per item (id/lemma_id/name/title, statement/formal_statement/hypothesis/content, etc.) so variant LLM output schemas are handled without falling back to single-theorem mode.
The same 4-pass strategy was also applied to ProofArchitect._parse_lemmas.
2. UI Polling Log Suppression#
GET /api/runs/<id> 200 status-poll requests are now logged at DEBUG level instead of INFO, removing repetitive log noise during long runs.
3. Bug Fixes#
File |
Bug |
Fix |
|---|---|---|
|
|
Wrapped |
|
|
Uses dynamic |
|
|
Added |
|
|
Demoted to |
4. Always-On Stage Summary Cards#
orchestrator/gate.py now prints a Rich summary card after every completed pipeline stage regardless of GATE_MODE:
Stage |
Card shows |
|---|---|
|
Papers found, open problems, key mathematical objects |
|
Proof status, per-lemma breakdown with confidence tags |
|
Alignment score, per-lemma numerical check results |
|
Full session summary before final output |
5. Human Gate Improvements#
Text feedback input: after approving a gate, users can type a correction or hint injected into the next agent's task via
get_user_feedback()Auto-escalation: if ≥1 lemma has
verified=Falseafter theory stage, gate auto-escalates fromautotohumanwith full lemma confidence breakdownDefault changed:
GATE_MODEdefault changed fromnonetoauto
6. Proof Readability Enforcement (WriterAgent)#
ENFORCE_PROOF_STYLE=true (default) injects _PROOF_STYLE_RULES into writer prompts:
No skip words ("clearly", "trivially", "by standard arguments") without immediate justification
Every inequality must name the lemma or theorem it uses
Each lemma proof opens with 1–2 sentences of informal explanation
Low-confidence lemmas tagged
\textcolor{orange}{\textbf{[Unverified step]}}in PDFLimitations section explains all unverified steps
Added
\usepackage{xcolor}toLATEX_PREAMBLE
7. Targeted Numerical Testing (ExperimentAgent)#
ExperimentAgent now separates proven lemmas into verified and low_confidence groups:
For each low-confidence lemma: runs dedicated numerical test, computes
violation_rateLemmas with
violation_rate > 1%flagged asnumerically_suspectExperiment stage summary card shows per-lemma check results with color coding
WriterAgent adds stronger warnings for
numerically_suspectlemmas
8. CLI Default Output Directory#
All three CLI commands (prove, explore, from-papers) now default to ./results if --output is not specified.
9. Bibliography & Citation Fix#
Removed \cite{} → ? bug in output PDFs:
_fix_missing_citations()surgically removes\cite{}keys that have no matching entry inreferences.bibLATEX_ENDsplit intoLATEX_END_WITH_BIB/LATEX_END_NO_BIB—\bibliography{references}only added when.bibis non-emptyWriterAgent prompt instructs LLM to use exact cite keys (same as
_generate_bibtex)
10. LaTeX Extraction Improvements#
Extended _extract_latex in WriterAgent with:
Markdown heading → LaTeX section conversion
tikzpictureenvironment removalEnvironment name normalization additions:
rem→remark,rema→remark,prop→proposition,defin→definition,corolary→corollary,Cor→corollary,Thm→theorem,Lem→lemma\endproof→\end{proof}substitutionQED box removal (
\begin{flushright}$\square$\end{flushright})Two-pass
_close_open_environments: (1) remove orphaned\end{X}, (2) append missing\end{X}
2026-03-18#
1. Context Compression#
Optimization summary:
Stage |
Before |
After |
Saving |
|---|---|---|---|
Formalizer model |
|
|
~90% cost per call |
Formalizer re-run |
Every iteration |
Only on change |
Saves N-1 calls |
Proven lemma context |
200-char proof text |
Statement only (120) |
~40% input tokens |
Verifier (high-conf) |
Always LLM call |
Auto-accept ≥ 0.85 |
~30% fewer calls |
Verifier proof text |
Raw (up to 3000 chars) |
Head+tail (1000 chars) |
~67% input tokens |
Agent loop history |
Accumulates forever |
Compressed every 6 turns |
~60% input tokens |
Stagnation |
Wastes 3+ iterations |
Forced refinement |
Saves full iterations |
Run performance: typical run now ~20 LLM calls (down from ~35); worst case ~55 (down from ~100).
Agent max_turns reductions:
Agent |
Before |
After |
|---|---|---|
SurveyAgent |
15 |
8 |
IdeationAgent |
5 |
3 |
ExperimentAgent |
10 |
5 |
WriterAgent |
5 |
3 |
New config knobs: CONTEXT_COMPRESS_AFTER_TURNS, AUTO_VERIFY_CONFIDENCE, STAGNATION_WINDOW
2. Configurable Token Limits Per Call Type#
7 new .env variables:
Variable |
Default |
Applies to |
|---|---|---|
|
|
All agent reasoning loops |
|
|
Proof generation |
|
|
Research direction planning |
|
|
Lemma decomposition |
|
|
Formalization, refiner, counterexample |
|
|
Proof verification |
|
|
Context compression |
UI sliders added in the Settings tab for all 7 values.
3. Multi-Backend LLM Support#
Three named backends in config.py and llm/factory.py:
Backend |
|
Notes |
|---|---|---|
Anthropic native |
|
Default |
OpenRouter |
|
Set |
Local (vLLM/Ollama) |
|
Defaults to |
ccproxy / OAuth fallback: if ANTHROPIC_API_KEY is empty, automatically reads ~/.claude/.credentials.json and routes through ccproxy (allows Claude Pro/Max users to run without a separate API key).
4. Additional Tuning Knobs#
Variable |
Default |
Effect |
|---|---|---|
|
|
Tool-use turns in SurveyAgent |
|
|
Turns per theory stage |
|
|
Turns for paper generation |
|
|
Hard cap on arXiv results |
|
|
Retry attempts on 5xx / rate-limit errors |
|
|
Exponential backoff bounds |
5. Domain Plugin Architecture#
New three-tier plugin system:
EurekaClaw (general pipeline)
└── DomainPlugin (e.g. MAB)
└── Workflow (per-domain prompt guidance)
New files:
domains/base.py—DomainPluginABC with 4 methodsdomains/__init__.py—@register_domaindecorator +resolve_domain()domains/mab/— MAB domain plugin (GaussianBandit, BernoulliBandit, 4 tools, 4 skills, 3-level benchmark)
Core changes:
SkillRegistry.add_skills_dir()— load skills from domain directoriesbuild_default_registry()— now domain-agnostic; domain tools registered via pluginMetaOrchestrator— acceptsdomain_pluginparameterEurekaSession.run()— auto-detects domain plugin fromInputSpec.domain
6. LaTeX Compilation Robustness#
Added 7
\newtheoremdeclarations toLATEX_PREAMBLE:assumption,maintheorem,conjecture,claim,example,fact,observationEnvironment name normalization in
_extract_latex(step 6): mis-cased names corrected_close_open_environments()— new static method, stack-based, prevents\begin{tabular}truncation from causing fatal errorsRemoved
_rescue_compileandpaper_rescue.texfallback entirelyFull bibtex-aware compile sequence:
pdflatex → bibtex → pdflatex → pdflatex
7. Bibliography & Reference Resolution#
Write order fixed:
references.bibsaved before_compile_pdf()is calledbibtex added to compile sequence: previously only
pdflatexran twiceCite key consistency:
_compute_cite_keys()in WriterAgent uses identical algorithm to_generate_bibtexinmain.pyDuplicate key handling:
_generate_bibtexdeduplicates witha,b, … suffixesResearchOutput.bibliography_jsonfield added
8. Experiment Mode Control#
New EXPERIMENT_MODE env var:
auto— skip experiment if formal statement has no quantitative signalstrue— always run experiment stagefalse— always skip experiment stage