-
Notifications
You must be signed in to change notification settings - Fork 20
hypothesizer updates, refactoring yaml stuff #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I am ready for this to be reviewed. Have fun testing it w/ the new SCI FI Bill of Rights hypothesizer example :). @mikegros , @awadell1 , @luiarthur |
mikegros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple non-actionable thoughts and a couple suggestions. I think we should merge this soon though. I need to run the tests still because I havent confirmed that the hypothesizer still passes tests, though I suspect it does. But I wanted to send this back to you for a couple small changes (or if you dont think changes are needed, just toss a comment back).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the "hello world" example have fewer comments so that it is very compact to read? We have a lot of those other comments in other yaml files, so I could imagine chopping them out of this yaml so it is easy to read (like typical hello world scripts are).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May even want to consider cutting out some of the options for cleanliness as well. Treating this more like a typical hello world than a comprehensively documented case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not for this PR, but should we think about how to organize this with the Checkpointer class? We may want to think about one, cohesive checkpointing structure that handles our typical cases and things like this step by step checkpointing.
| return any(s in n for s in _SECRET_KEY_SUBSTRS) | ||
|
|
||
|
|
||
| def _mask_secret(value: str, keep_start: int = 6, keep_end: int = 4) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fail-unsafe, correct? (for instance, if an api key has a substring outside that list, it gets displayed in logs. Should displaying the sanitized key be an "opt-in" option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an "Issue" indicating "plan_execute_from_yaml" should probably be a workflow and/or integrated into the URSA CLI to make it better integrated into the overall, rather than being hidden off in the examples.
I could see this or something like it falling into that category as well. Two nearly ready PRs have an "experimental/agents" folder in src for agents that are being developed but not ready for primetime. I could see something like this being place in and imported from a folder like "experimental/workflow" or something like it.
Not actionable for this PR, but just a thought for the future.
awadell1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few high level comments:
- Merging the config loading here with what's in src/ursa/cli/config.py would be great. There's a decent amount of overlap. Plus the src/ursa/cli/config.py has the benefit of stronger type-checking/runtime validation (I'm biased)
- Some clarity on how "workspaces" and "runs" relate would be fantastic. I have a longer comment on this on the
setup_workspacefunction - HypothesizeAgent should be update to use
AgentWithToolsand the graph reworked
Tests! This thing adds a lot of functionality to src/ursa adding tests to validate it / make sure future changes don't break things would be huge. Codex is great at this, one prompt would be:
Review the changes introduced by this PR against main. Expand the test suite in `tests/` to cover all changes introduced here.
Make sure the test suite passes, if it doesn't iterate until it does. Do not modify any existing tests. If your tests fail, review the test and code and decide which should be corrected.
Remember to prefix you commands with `uv run`. I.e. `uv run python` or `uv run pytest`| print(f"{RED}Symlinked {src} (source) --> {dst} (dest)") | ||
| new_state["symlinkdir"]["is_linked"] = True | ||
| linked = ensure_symlink( | ||
| workspace=new_state["workspace"], symlink_cfg=sd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.workspace to get path to the workspace. Putting the workspace in the state is on the way out
| self.strllm = self.llm | StrOutputParser() | ||
| self.max_iterations = max_iterations | ||
|
|
||
| def _content_to_text(self, content) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove, use msg.text instead
| from langchain_core.tools import BaseTool | ||
| from langchain_core.tools import tool as lc_tool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just import from langchain_core.tools import BaseTool, tool
There doesn't seem to be a reason for renaming tool to lc_tool
| default_tools = [read_file, list_input_docs] | ||
| if extra_tools: | ||
| default_tools.extend(extra_tools) | ||
|
|
||
| # ---- coerce any plain callables into BaseTool via lc_tool ---- | ||
| coerced_tools: list[BaseTool] = [] | ||
| for t in default_tools: | ||
| if isinstance(t, BaseTool): | ||
| coerced_tools.append(t) | ||
| elif callable(t): | ||
| # wrap plain function into a LangChain-style tool | ||
| coerced_tools.append(lc_tool(t)) | ||
| else: | ||
| raise TypeError(f"Unsupported tool type: {type(t)} for {t}") | ||
|
|
||
| # pass tools to BaseAgent like ExecutionAgent does | ||
| self.tools = coerced_tools | ||
|
|
||
| # Bind LLM to tools so it can emit tool calls | ||
| # (this returns a tool-capable llm wrapper) | ||
| try: | ||
| self.llm = self.llm.bind_tools(self.tools) | ||
| except Exception: | ||
| # fallback: some LLMs/versions want an iterable of tool objects | ||
| self.llm = self.llm.bind_tools(self.tools) | ||
|
|
||
| self.tool_node = ToolNode(self.tools) | ||
|
|
||
| # bind tools to the LLM for function/tool calling | ||
| self.llm_with_tools = llm.bind_tools(self.tools) | ||
|
|
||
| # debug print tools enabled | ||
| print("[HypothesizerAgent] Tools enabled:") | ||
| for t in self.tools: | ||
| try: | ||
| print(f" - {t.name}") | ||
| except Exception: | ||
| print(f" - (unnamed tool) {t}") | ||
|
|
||
| # keep existing setup | ||
| self.hypothesizer_prompt = hypothesizer_prompt | ||
| self.critic_prompt = critic_prompt | ||
| self.competitor_prompt = competitor_prompt | ||
| self.search_tool = DDGS() | ||
| # Only create DDGS if the import worked - this helps w/ offline mode too | ||
| self.search_tool = DDGS() if DDGS else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't do this. Just Add the AgentWithTools mixin. Right now you're missing out on the binding when a new tool gets added (which requires a graph rebuild). The Mixin handles that.
See:
ursa/src/ursa/agents/execution_agent.py
Lines 214 to 226 in 5e697fc
| default_tools = [ | |
| run_command, | |
| write_code, | |
| edit_code, | |
| read_file, | |
| run_web_search, | |
| run_osti_search, | |
| run_arxiv_search, | |
| ] | |
| if extra_tools: | |
| default_tools.extend(extra_tools) | |
| super().__init__(llm=llm, tools=default_tools, **kwargs) |
| self.search_tool = DDGS() | ||
| # Only create DDGS if the import worked - this helps w/ offline mode too | ||
| self.search_tool = DDGS() if DDGS else None | ||
| self.strllm = self.llm | StrOutputParser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here: self.strllm should be defined in _build_graph so updates get propagated correctly
| from rich.panel import Panel | ||
|
|
||
|
|
||
| def setup_workspace( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be part of BaseAgent?
Also what is the workspace for and how does it map to runs? Right now the workspace seems like a folder where both logging/metric stuff (Checkpoint) and agent actions (write_code) get thrown. I think that's not great as it (technically) means Ursa could modify its checkpoints.
I don't think that's a huge risk but:
- it adds confusion (is
checkpoints/where URSA should save ckpts for models it's trained) - It makes it harder to separate what ursa did from it's outputs
I think it would be good to have something like :
BaseAgent.workspace/
- sandbox/ # Place where execution agent runs
- checkpoints/ # Graph execution checkpoint
- metrics/ # Currently ursa_metrics
- .... # Other folders currently created by various agents
```
This might be out of scope for this PR, but it seems like this function should handle this. In which case idk if they rich text is warranted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is definitely out of scope for this PR, but also I think something we should do. The fact that we dump everything in the same workspace is messy.
I actually think we might want to sit down and have a good discussion on packaging up things like checkpointing etc into a cohesive UI. Right now this involves setting the previous thread_id and pointing at the right workspace/checkpoint, but this can be tricky with checkpoints floating around in different workspaces, things being appended to thread_ids etc. We should make it dead simple for a user to pick up a case they left off.
But for this PR I think allowing the plan_execute_from_yaml stuff having its own workspace handling is okay and bringing it all under one, nice interface can be a later update.
| from typing import Callable, Optional | ||
|
|
||
|
|
||
| def ckpt_dir(workspace: str) -> Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HITL is putting this at self.workspace/<agent_name>.db I like this more, but perhaps the HITL should update to match. See my comment on workspace.py
| return chooser(workspace, timeout=timeout, input_fn=input_fn) | ||
|
|
||
|
|
||
| def restore_executor_from_snapshot(workspace: str, snapshot: Path) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is taking a sqlite file (possibly with a wal file), opening it, writing it to a new path (to flush the WAL) then deleting the WAL.
That seems like a bad idea. How do we detect that the WAL is owned by the current process and not a different one?
| ws = Path(workspace) | ||
| ckdir = ckpt_dir(workspace) | ||
| seen = {} | ||
| for base in (ckdir, ws): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just do this with a glob: list(self.workspace.glob("executor_*.db"))
idk if this should live in ursa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a reimplementation of timetravelling? I'm not sure what the purpose of intermediate checkpoint files is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a form of time traveling, but I think it has a couple things that could be implemented with time travel, but work well for this workflow right now:
- It sets different checkpoints after individual stages of the plan, so its easy to restart the plan after a specific step without having to hunt for the particular checkpoint ID where the handoff between steps happens
- It ensures that if a particular checkpoint gets borked, that the others are stored separately to ensure the problem can be picked up. This is almost certainly a "URSA" issue and could be handled through time traveling, but at the moment its been useful for stability.
OK like most things complex, this touched a lot of code, sorry. Basically the plan was:
This all works and is awesome. It's a good example of this technique and then I privately swap it for other LLMs and private docs we don't want to share and the same workflow works.