hypothesizer updates, refactoring yaml stuff #168

ndebard · 2026-02-01T03:25:09Z

OK like most things complex, this touched a lot of code, sorry. Basically the plan was:

create a simpler YAML file in the two_example/plan_execute called hello world - it runs faster, good for quick tests, and new users might find it informational.
i then refactored the plan-execute-from-yaml file - there's a lot of functionality in there i wanted to use in another code (below), so I moved most of it to src/ursa/util/ such as checkpointing, workspace management, configuration (YAML) text file reading, stuff like that. this made it accessible to code below, but also tidied up the plan/execYAML.
I needed to build a better example using the hypothesizer agent that is able to read documents in a directory you give it (notice here I'm using the YAML, symlink dirs, like I've used before).... so that example is now in single_agent_examples/doc_critique_writer.py
This example uses some input docs which we instruct the user to download via a sci_fi_bill_of_rights/README.txt from the Gutenberg online repo - in this case, it's the US bill of rights and then HGWells The Time Machine (text, we don't include them in the repo, but the user can freely download them w/ that README / wget).
Then there's a YAML which instructs the hypothesizer (doc critique writer) to use the sci_fi doc (for which it has to find it in the directory, read it, etc.) as a future possibility and suggest edits to the Bill of Rights (btw, this is awesome :)). To accomplish this right, we tap into tool calling capability to list files in the directory as well as our existing tools for read_file (including testing on PDFs).

This all works and is awesome. It's a good example of this technique and then I privately swap it for other LLMs and private docs we don't want to share and the same workflow works.

ndebard · 2026-02-01T03:28:35Z

I am ready for this to be reviewed. Have fun testing it w/ the new SCI FI Bill of Rights hypothesizer example :). @mikegros , @awadell1 , @luiarthur

mikegros

A couple non-actionable thoughts and a couple suggestions. I think we should merge this soon though. I need to run the tests still because I havent confirmed that the hypothesizer still passes tests, though I suspect it does. But I wanted to send this back to you for a couple small changes (or if you dont think changes are needed, just toss a comment back).

mikegros · 2026-02-02T18:17:51Z

examples/two_agent_examples/plan_execute/hello_world.yaml

Should the "hello world" example have fewer comments so that it is very compact to read? We have a lot of those other comments in other yaml files, so I could imagine chopping them out of this yaml so it is easy to read (like typical hello world scripts are).

May even want to consider cutting out some of the options for cleanliness as well. Treating this more like a typical hello world than a comprehensively documented case.

mikegros · 2026-02-02T18:23:34Z

src/ursa/util/checkpoint_fs.py

Probably not for this PR, but should we think about how to organize this with the Checkpointer class? We may want to think about one, cohesive checkpointing structure that handles our typical cases and things like this step by step checkpointing.

mikegros · 2026-02-02T18:27:59Z

src/ursa/util/llm_factory.py

+    return any(s in n for s in _SECRET_KEY_SUBSTRS)
+
+
+def _mask_secret(value: str, keep_start: int = 6, keep_end: int = 4) -> str:


This is fail-unsafe, correct? (for instance, if an api key has a substring outside that list, it gets displayed in logs. Should displaying the sanitized key be an "opt-in" option?

mikegros · 2026-02-02T18:34:35Z

examples/single_agent_examples/hypothesizer_agent/doc_critique_writer.py

There is an "Issue" indicating "plan_execute_from_yaml" should probably be a workflow and/or integrated into the URSA CLI to make it better integrated into the overall, rather than being hidden off in the examples.

I could see this or something like it falling into that category as well. Two nearly ready PRs have an "experimental/agents" folder in src for agents that are being developed but not ready for primetime. I could see something like this being place in and imported from a folder like "experimental/workflow" or something like it.

Not actionable for this PR, but just a thought for the future.

awadell1

A few high level comments:

Merging the config loading here with what's in src/ursa/cli/config.py would be great. There's a decent amount of overlap. Plus the src/ursa/cli/config.py has the benefit of stronger type-checking/runtime validation (I'm biased)
Some clarity on how "workspaces" and "runs" relate would be fantastic. I have a longer comment on this on the setup_workspace function
HypothesizeAgent should be update to use AgentWithTools and the graph reworked

Tests! This thing adds a lot of functionality to src/ursa adding tests to validate it / make sure future changes don't break things would be huge. Codex is great at this, one prompt would be:

Review the changes introduced by this PR against main. Expand the test suite in `tests/` to cover all changes introduced here.

Make sure the test suite passes, if it doesn't iterate until it does. Do not modify any existing tests. If your tests fail, review the test and code and decide which should be corrected.

Remember to prefix you commands with `uv run`. I.e. `uv run python` or `uv run pytest`

awadell1 · 2026-02-02T17:57:10Z

src/ursa/agents/execution_agent.py

-            print(f"{RED}Symlinked {src} (source) --> {dst} (dest)")
-            new_state["symlinkdir"]["is_linked"] = True
+        linked = ensure_symlink(
+            workspace=new_state["workspace"], symlink_cfg=sd


Use self.workspace to get path to the workspace. Putting the workspace in the state is on the way out

awadell1 · 2026-02-02T17:58:48Z

src/ursa/agents/hypothesizer_agent.py

        self.strllm = self.llm | StrOutputParser()
        self.max_iterations = max_iterations

+    def _content_to_text(self, content) -> str:


Remove, use msg.text instead

awadell1 · 2026-02-02T18:03:03Z

src/ursa/agents/hypothesizer_agent.py

+from langchain_core.tools import BaseTool
+from langchain_core.tools import tool as lc_tool


Just import from langchain_core.tools import BaseTool, tool
There doesn't seem to be a reason for renaming tool to lc_tool

awadell1 · 2026-02-02T18:18:18Z

src/ursa/agents/hypothesizer_agent.py

+        default_tools = [read_file, list_input_docs]
+        if extra_tools:
+            default_tools.extend(extra_tools)
+
+        # ---- coerce any plain callables into BaseTool via lc_tool ----
+        coerced_tools: list[BaseTool] = []
+        for t in default_tools:
+            if isinstance(t, BaseTool):
+                coerced_tools.append(t)
+            elif callable(t):
+                # wrap plain function into a LangChain-style tool
+                coerced_tools.append(lc_tool(t))
+            else:
+                raise TypeError(f"Unsupported tool type: {type(t)} for {t}")
+
+        # pass tools to BaseAgent like ExecutionAgent does
+        self.tools = coerced_tools
+
+        # Bind LLM to tools so it can emit tool calls
+        # (this returns a tool-capable llm wrapper)
+        try:
+            self.llm = self.llm.bind_tools(self.tools)
+        except Exception:
+            # fallback: some LLMs/versions want an iterable of tool objects
+            self.llm = self.llm.bind_tools(self.tools)
+
+        self.tool_node = ToolNode(self.tools)
+
+        # bind tools to the LLM for function/tool calling
+        self.llm_with_tools = llm.bind_tools(self.tools)
+
+        # debug print tools enabled
+        print("[HypothesizerAgent] Tools enabled:")
+        for t in self.tools:
+            try:
+                print(f"  - {t.name}")
+            except Exception:
+                print(f"  - (unnamed tool) {t}")
+
+        # keep existing setup
        self.hypothesizer_prompt = hypothesizer_prompt
        self.critic_prompt = critic_prompt
        self.competitor_prompt = competitor_prompt
-        self.search_tool = DDGS()
+        # Only create DDGS if the import worked - this helps w/ offline mode too
+        self.search_tool = DDGS() if DDGS else None


Don't do this. Just Add the AgentWithTools mixin. Right now you're missing out on the binding when a new tool gets added (which requires a graph rebuild). The Mixin handles that.

See:

ursa/src/ursa/agents/execution_agent.py

Lines 214 to 226 in 5e697fc

default_tools = [

run_command,

write_code,

edit_code,

read_file,

run_web_search,

run_osti_search,

run_arxiv_search,

]

if extra_tools:

default_tools.extend(extra_tools)

super().__init__(llm=llm, tools=default_tools, **kwargs)

awadell1 · 2026-02-02T18:19:11Z

src/ursa/agents/hypothesizer_agent.py

-        self.search_tool = DDGS()
+        # Only create DDGS if the import worked - this helps w/ offline mode too
+        self.search_tool = DDGS() if DDGS else None
        self.strllm = self.llm | StrOutputParser()


Ditto here: self.strllm should be defined in _build_graph so updates get propagated correctly

awadell1 · 2026-02-02T19:49:23Z

src/ursa/util/workspace.py

+from rich.panel import Panel
+
+
+def setup_workspace(


Should this be part of BaseAgent?

Also what is the workspace for and how does it map to runs? Right now the workspace seems like a folder where both logging/metric stuff (Checkpoint) and agent actions (write_code) get thrown. I think that's not great as it (technically) means Ursa could modify its checkpoints.

I don't think that's a huge risk but:

it adds confusion (is checkpoints/ where URSA should save ckpts for models it's trained)

It makes it harder to separate what ursa did from it's outputs

I think it would be good to have something like :

BaseAgent.workspace/ - sandbox/ # Place where execution agent runs - checkpoints/ # Graph execution checkpoint - metrics/ # Currently ursa_metrics - .... # Other folders currently created by various agents ``` This might be out of scope for this PR, but it seems like this function should handle this. In which case idk if they rich text is warranted

This one is definitely out of scope for this PR, but also I think something we should do. The fact that we dump everything in the same workspace is messy.

I actually think we might want to sit down and have a good discussion on packaging up things like checkpointing etc into a cohesive UI. Right now this involves setting the previous thread_id and pointing at the right workspace/checkpoint, but this can be tricky with checkpoints floating around in different workspaces, things being appended to thread_ids etc. We should make it dead simple for a user to pick up a case they left off.

But for this PR I think allowing the plan_execute_from_yaml stuff having its own workspace handling is okay and bringing it all under one, nice interface can be a later update.

awadell1 · 2026-02-02T19:52:57Z

src/ursa/util/checkpoint_fs.py

+from typing import Callable, Optional
+
+
+def ckpt_dir(workspace: str) -> Path:


HITL is putting this at self.workspace/<agent_name>.db I like this more, but perhaps the HITL should update to match. See my comment on workspace.py

awadell1 · 2026-02-02T19:58:08Z

src/ursa/util/checkpoint_fs.py

+    return chooser(workspace, timeout=timeout, input_fn=input_fn)
+
+
+def restore_executor_from_snapshot(workspace: str, snapshot: Path) -> None:


I think this is taking a sqlite file (possibly with a wal file), opening it, writing it to a new path (to flush the WAL) then deleting the WAL.

That seems like a bad idea. How do we detect that the WAL is owned by the current process and not a different one?

awadell1 · 2026-02-02T20:00:04Z

src/ursa/util/checkpoint_fs.py

+    ws = Path(workspace)
+    ckdir = ckpt_dir(workspace)
+    seen = {}
+    for base in (ckdir, ws):


Just do this with a glob: list(self.workspace.glob("executor_*.db"))
idk if this should live in ursa

awadell1 · 2026-02-02T20:27:17Z

src/ursa/util/checkpoint_fs.py

This seems like a reimplementation of timetravelling? I'm not sure what the purpose of intermediate checkpoint files is

This is a form of time traveling, but I think it has a couple things that could be implemented with time travel, but work well for this workflow right now:

It sets different checkpoints after individual stages of the plan, so its easy to restart the plan after a specific step without having to hunt for the particular checkpoint ID where the handoff between steps happens

It ensures that if a particular checkpoint gets borked, that the others are stored separately to ensure the problem can be picked up. This is almost certainly a "URSA" issue and could be handled through time traveling, but at the moment its been useful for stability.

hypothesizer updates, refactoring yaml stuff

061e0a2

mikegros self-requested a review February 1, 2026 03:37

awadell1 self-requested a review February 1, 2026 13:48

improvements to support some SSL things as well

9b0fcdd

mikegros reviewed Feb 2, 2026

View reviewed changes

awadell1 requested changes Feb 2, 2026

View reviewed changes

		return any(s in n for s in _SECRET_KEY_SUBSTRS)


		def _mask_secret(value: str, keep_start: int = 6, keep_end: int = 4) -> str:

		from langchain_core.tools import BaseTool
		from langchain_core.tools import tool as lc_tool

	default_tools = [
	run_command,
	write_code,
	edit_code,
	read_file,
	run_web_search,
	run_osti_search,
	run_arxiv_search,
	]
	if extra_tools:
	default_tools.extend(extra_tools)

	super().__init__(llm=llm, tools=default_tools, **kwargs)

		from typing import Callable, Optional


		def ckpt_dir(workspace: str) -> Path:

		return chooser(workspace, timeout=timeout, input_fn=input_fn)


		def restore_executor_from_snapshot(workspace: str, snapshot: Path) -> None:

hypothesizer updates, refactoring yaml stuff #168

Are you sure you want to change the base?

hypothesizer updates, refactoring yaml stuff #168

Uh oh!

Conversation

ndebard commented Feb 1, 2026

Uh oh!

ndebard commented Feb 1, 2026

Uh oh!

mikegros left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

awadell1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants