Debugger integration #414

IsaacTrost · 2026-01-15T21:38:16Z

This is a pull request to integrate a debugger into the seed-gen component of buttercup. It works by building on the existing vuln-discovery component of seed-gen, which uses an llm to analyze the codebase and generate PoV inputs. When those PoV inputs fail, this program spawns a sub-task with the ability to use debug scripts and an interactive debug session to determine why the PoV inputs failed to trigger the hypothesized vulnerability. This is then fed back to the calling task to inform the next round of PoV inputs.

* Improve coverage tracking precision Previously coverage tracking included all regions. This changes coverage tracking to be more precise. Macros that result in code are represented as a single line in the function, and any executed line in the macro will make that line 'covered'. Beyound that, only CodeRegions will count. This will prevent code guarded by #if that aren't in the binary from being considered reachable for example. The idea is that seed-gen would have more accurate information when selecting functions to target. I've done some testing and it appears as if the new implementation in general reaches more lines and covers more functions. * Fix macro coverage leaking across files due to missing filename in key The expansion_coverage map used (line, col) as key without filename, causing coverage from one file to incorrectly appear in another when macros were at the same coordinates. * Recursively expand macros and add named types for coverage data - Process ExpansionRegion target_regions recursively instead of counting only the call site line - Add coordinate index for O(1) expansion lookups - Cache computed expansion lines to avoid recomputation across functions - Use bulk set operations for better performance - Prevent infinite loops with visited set for circular macro references - Add named types for coverage data structures: - RegionCoords, ExpansionKey, CachedExpansionLines - Type aliases: ExpansionMap, CoordToFilenames, ExpansionLinesCache - Add comprehensive tests for nested macros and edge cases

…ofbits#408) Co-authored-by: kevin-valerio <kevin-valerio@users.noreply.github.com>

reytchison · 2026-01-20T13:47:57Z

orchestrator/src/buttercup/orchestrator/ui/competition_api/services/challenge_service.py

+            # Pull Git LFS files if the repository uses LFS
+            logger.info("Pulling Git LFS files (if any)")
+            result = subprocess.run(
+                ["git", "lfs", "pull"],
+                cwd=sub_path,
+                capture_output=True,
+                text=True,
+                check=False,  # Don't fail if LFS is not used or not installed
+            )
+            if result.returncode == 0:
+                logger.info(f"Git LFS pull output: {result.stdout}")
+            else:
+                logger.debug(f"Git LFS pull failed or not needed: {result.stderr}")
+


No longer need to add lfs pull logic in PR since this was merged: #413

common/src/buttercup/common/build_selection.py

reytchison · 2026-01-20T15:20:41Z

seed-gen/src/buttercup/seed_gen/seed_gen_bot.py

+        # Read probability overrides from environment variables
+        self.TASK_SEED_INIT_PROB_FULL = float(os.getenv("BUTTERCUP_SEED_INIT_PROB_FULL", self.TASK_SEED_INIT_PROB_FULL))
+        self.TASK_VULN_DISCOVERY_PROB_FULL = float(
+            os.getenv("BUTTERCUP_VULN_DISCOVERY_PROB_FULL", self.TASK_VULN_DISCOVERY_PROB_FULL)


Could you make these new options configurable by the pydantic settings in config.py, please? When you add them to the pydantic settings you'll get built-in support for env variables, too.

hbrodin

Well done! I've done an initial pass on it and left a bunch of comments. Impressive work overall.

hbrodin · 2026-01-19T09:34:31Z

deployment/k8s/charts/seed-gen/values.yaml

+    cpu: 2000m
+    memory: 8Gi
  requests:
-    cpu: 200m
-    memory: 256Mi
+    cpu: 500m
+    memory: 1Gi


Are these changes necessary to run with the debugger integrated? Bumping the requests will require more resources for the system to run and might prevent some systems from running Buttercup.

~~This might be an artifact of debugging the OOM, before the coverage bot was identified as the source of the OOM?~~ Edit: responded under wrong comment

hbrodin · 2026-01-19T09:39:10Z

deployment/k8s/values.yaml

+  # Resource limits for LiteLLM container to prevent OOMKilled errors
+  resources:
+    limits:
+      cpu: 2000m
+      memory: 4Gi
+    requests:
+      cpu: 500m
+      memory: 1Gi
+


Are the OOMs due to LiteLLM? Or is LiteLLM killed due to high memory pressure situation?

This might be an artifact of debugging the OOM, before the coverage bot was identified as the source of the OOM?

hbrodin · 2026-01-19T09:39:42Z

orchestrator/Dockerfile


 RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
-    apt-get install -y git curl && \
+    apt-get install -y git git-lfs curl && \


Should rebase/merge once #413 is merged into main.

hbrodin · 2026-01-19T09:42:23Z

orchestrator/scripts/challenge.py

I don't think these changes are needed as you can just specify this on the command line. Or, am I missing something?

hbrodin · 2026-01-19T10:17:07Z

common/src/buttercup/common/docker_interactive.py

+            remaining_time = end_time - time.time()
+            if remaining_time <= 0:
+                logger.warning("Command timeout after %.1fs: %s", timeout, command)
+                lines.append(f"\n***timout waiting for end of output after {timeout} seconds***")


Suggested change

lines.append(f"\n***timout waiting for end of output after {timeout} seconds***")

lines.append(f"\n***timeout waiting for end of output after {timeout} seconds***")

hbrodin · 2026-01-20T21:43:17Z

common/src/buttercup/common/reproduce_multiple.py

+        # Log all available builds before testing
+        logger.info(f"Testing PoV '{pov.name}' against {len(self.build_outputs)} builds for harness '{harness_name}'")
+        for i, build in enumerate(self.build_outputs):
+            logger.info(
+                f"""  Build {i}: sanitizer={build.sanitizer}, engine={build.engine},
+                type={BuildType.Name(build.build_type)}, task_id={build.task_id}"""
+            )


Why? You are logging them again in the loop below I think.

hbrodin · 2026-01-20T21:46:55Z

common/src/buttercup/common/build_selection.py

+    return debug_binary_path is not None and debug_binary_path.exists()
+
+
+def resolve_actual_binary(


Consider breaking into smaller functions

hbrodin · 2026-01-20T21:52:09Z

seed-gen/src/buttercup/seed_gen/debug_subagent_unified.py

+        return "\n\n".join(str(attempt) for attempt in self.debug_attempts)
+
+
+class DebugSubagentUnified:


This is a very large class with some functions being very long. Is there maybe a way to break it down into smaller classes? What is the gain of having it like this over splitting it? Ideally you'd be able to compose the hybrid from the batch + interactive ones if splitting like that.

hbrodin · 2026-01-20T21:56:59Z

seed-gen/src/buttercup/seed_gen/seed_gen_bot.py

+            return
+
        build_dir = Path(builds[BuildType.FUZZER][0].task_dir)
+        logger.info(f"Build directory: {build_dir}")


hbrodin · 2026-01-20T22:02:39Z

seed-gen/src/buttercup/seed_gen/task.py

            function_name = call.arguments["function_name"]
-            result = Task._get_function_definition(function_name, state, tool_call_id)
-            results.append(result)
+            if isinstance(function_name, str):


if it is not a string, what happens then?

reytchison · 2026-01-21T13:56:00Z

common/src/buttercup/common/docker_interactive.py

+
+    def interrupt(self) -> list[str]:
+        """Interrupt the docker process. Overwrite this with program logic,
+        return any lines you need to explain what the interuption did."""


Suggested change

return any lines you need to explain what the interuption did."""

return any lines you need to explain what the interruption did."""

common/src/buttercup/common/build_selection.py

reytchison · 2026-01-21T14:04:50Z

seed-gen/src/buttercup/seed_gen/debug_subagent_unified.py

+            return Command(update={"debug_script": debug_script})
+        except Exception as e:
+            logger.error(f"Failed to extract debug script from LLM response: {e}")
+            if "llm_response" in locals():


instead of checking locals, initialize llm_response to None and check if not None

reytchison · 2026-01-21T14:07:27Z

seed-gen/src/buttercup/seed_gen/seed_gen_bot.py

-        return [BuildType.FUZZER]
+        return [BuildType.FUZZER, BuildType.FUZZER_DEBUG]
+
+    def _is_harness_whitelisted(self, harness_name: str) -> bool:


Suggested change

def _is_harness_whitelisted(self, harness_name: str) -> bool:

def _is_harness_allowlisted(self, harness_name: str) -> bool:

Use allowlist instead

reytchison · 2026-01-21T14:46:52Z

deployment/k8s/values.yaml


 coverage-bot:
-  enabled: true
+  enabled: false


We shouldn't disable the coverage bot by default

reytchison · 2026-01-21T14:47:58Z

deployment/k8s/values.yaml

+  seedExploreProbFull: .2
+  seedExploreProbDelta: .2
+  minSeedInitRuns: 0
+  minVulnDiscoveryRuns: 0


the defaults in K8s should preserve the defaults in main

reytchison · 2026-01-21T14:48:13Z

deployment/k8s/values.yaml

+  seedExploreProbDelta: .2
+  minSeedInitRuns: 0
+  minVulnDiscoveryRuns: 0
+  # Harness whitelist - comma-separated list of harness names/substrings to allow


Use allowlist instead

reytchison · 2026-01-21T14:55:03Z

common/src/buttercup/common/reproduce_multiple.py

+                f"""  Result: did_run={result.did_run()}, did_crash={result.did_crash()},
+                returncode={result.command_result.returncode if result.command_result else "N/A"}"""


Suggested change

f""" Result: did_run={result.did_run()}, did_crash={result.did_crash()},

returncode={result.command_result.returncode if result.command_result else "N/A"}"""

f"""Result: did_run={result.did_run()}, did_crash={result.did_crash()}, """

f"""returncode={result.command_result.returncode if result.command_result else "N/A"}"""

Make log one line instead of multiline with indentation

reytchison · 2026-01-21T14:56:21Z

common/src/buttercup/common/reproduce_multiple.py

+                f"""  Build {i}: sanitizer={build.sanitizer}, engine={build.engine},
+                type={BuildType.Name(build.build_type)}, task_id={build.task_id}"""


Suggested change

f""" Build {i}: sanitizer={build.sanitizer}, engine={build.engine},

type={BuildType.Name(build.build_type)}, task_id={build.task_id}"""

f"""Build {i}: sanitizer={build.sanitizer}, engine={build.engine}, """

f"""type={BuildType.Name(build.build_type)}, task_id={build.task_id}"""

Make log one line without indentation

reytchison · 2026-01-21T15:01:31Z

common/src/buttercup/common/docker_interactive.py

+            for src, dst in mount_dirs.items():
+                docker_cmd += ["-v", f"{src.resolve().as_posix()}:{dst.as_posix()}"]
+                logger.debug("Mounting %s -> %s", src, dst)


Could we make these volumes read-only? (At least the ones we can, I think scratchpad needs to be read/write).

reytchison · 2026-01-21T15:04:41Z

common/src/buttercup/common/docker_interactive.py

+            "Initializing DockerInteractive session: container=%s, timeout=%.1fs", container_image, global_timeout
+        )
+
+        docker_cmd = ["docker", "run", "--privileged", "--shm-size=2g", "-i", "--name", self.container_name]


Should it pass --rm to remove the container after it stops?

reytchison · 2026-01-21T15:17:47Z

common/src/buttercup/common/docker_interactive.py

+            "Initializing DockerInteractive session: container=%s, timeout=%.1fs", container_image, global_timeout
+        )
+
+        docker_cmd = ["docker", "run", "--privileged", "--shm-size=2g", "-i", "--name", self.container_name]


Could add --network none to isolate from network?

reytchison · 2026-01-21T15:23:18Z

seed-gen/src/buttercup/seed_gen/debug_subagent_unified.py

+                            for line in next_command.split("\n")
+                            if line.strip() and not line.strip().startswith("#")
+                        ]
+                        logger.info(


It might be worth adding a TODO to sanitize higher risk gdb commands (e.g. shell, python). The main focus is isolating the container since the LLM can run gdb but gdb command sanitization might be useful as defense in depth.

reytchison · 2026-01-21T15:31:41Z

common/src/buttercup/common/challenge_task.py

+        **Language Agnostic**: This method works for any language. For C++ projects,
+        it sets CFLAGS/CXXFLAGS to include full debug symbols. For Java or other languages,
+        setting CFLAGS/CXXFLAGS has no effect (which is fine).


This comment is a bit unclear because it says language agnostic, but really the method is just for C/C++.

I modified the comment to make it clear that it will still build for java projects, and we don't have to know the language when we do this process, but it will only actually add debug symbols for C/C++ projects.

reytchison · 2026-01-21T15:51:04Z

seed-gen/src/buttercup/seed_gen/debug_subagent_unified.py

+            # Verify all source files exist before mounting
+            logger.debug("Verifying source files before Docker mount:")
+            logger.debug("  Debug script:")
+            logger.debug(f"    Path: {debug_script_path}")
+            logger.debug(f"    Exists: {debug_script_path.exists()}")
+            logger.debug(f"    Is file: {debug_script_path.is_file()}")
+            logger.debug(f"    Is directory: {debug_script_path.is_dir()}")
+            if debug_script_path.exists():
+                logger.debug(f"    Size: {debug_script_path.stat().st_size} bytes")
+                logger.debug(f"    Absolute: {debug_script_path.resolve()}")
+
+            logger.info("  PoV input:")
+            logger.debug(f"    Path: {pov_input_path}")
+            logger.debug(f"    Exists: {pov_input_path.exists()}")
+            logger.debug(f"    Is file: {pov_input_path.is_file()}")
+            logger.debug(f"    Is directory: {pov_input_path.is_dir()}")
+            if pov_input_path.exists():
+                logger.debug(f"    Size: {pov_input_path.stat().st_size} bytes")
+                logger.debug(f"    Absolute: {pov_input_path.resolve()}")
+
+            logger.debug("  Build dir:")
+            logger.debug(f"    Path: {build_dir}")
+            logger.debug(f"    Exists: {build_dir.exists()}")
+            logger.debug(f"    Is directory: {build_dir.is_dir()}")
+            logger.debug(f"    Absolute: {build_dir.resolve()}")
+
+            logger.info("  Out dir (parent of build_dir):")
+            logger.debug(f"    Path: {out_dir}")
+            logger.debug(f"    Exists: {out_dir.exists()}")
+            logger.debug(f"    Is directory: {out_dir.is_dir()}")
+            logger.debug(f"    Absolute: {out_dir.resolve()}")


This logging might be too verbose

reytchison

I'm confirming I did my first pass of the PR, too.

IsaacTrost and others added 30 commits December 20, 2025 14:08

changed configuration to focus on seed-gen

cb715e5

added nomminally working debug_subagent_task

7c34f3d

debugging process

13639d9

debugging process

3d33d31

debugging process

5174f35

debugging process

f33b100

got debug subagent to work more reliably

b0d15d3

debugging why it cant find the seed file

ca33910

added more tests

3a8c4ba

more debugging of debugginf functionality

6143fd6

more debugging of debugger

8e7e276

debugging working, scripts bad because some symbols are not found

5343da1

mostly working live debugging, still a bit pricy though

ca3e946

Added hybrid mode that will do batch, then try interactive if that fails

c4eea75

fixes

ba47126

fixes, updating docker interactive

210ba28

updating mi parsing for gdb

337266a

modifications to the mi parser

c0e5ce0

fix: fixing line 110: set: -g: invalid option for Fish shell (trail…

391f323

…ofbits#408) Co-authored-by: kevin-valerio <kevin-valerio@users.noreply.github.com>

working state

fcfeb5d

working state for monolith, still needs refactor

5f1214a

limit grep output (whoops) and fix building to force optimization flags

81ab8ae

added function lookup tool

63e16ff

added better build selection logic, and avoided c ode duplication

6a06aff

fix fuzzer selection to ignore debug

c313b55

added debug builds as a seperate 'sanitizor'

1256fba

fixed build system, more in line with other dependancies now

364626b

adding new better debug targets, and logging

3bda549

final changes before testing

ec0b7b7

IsaacTrost and others added 2 commits January 18, 2026 20:45

linting

c607c69

Merge branch 'main' into final-pr

e94c560

IsaacTrost marked this pull request as ready for review January 19, 2026 02:08

IsaacTrost requested review from hbrodin, michaelbrownuc, ret2libc and reytchison as code owners January 19, 2026 02:08

reytchison reviewed Jan 20, 2026

View reviewed changes

common/src/buttercup/common/build_selection.py Show resolved Hide resolved

reytchison reviewed Jan 20, 2026

View reviewed changes

hbrodin reviewed Jan 20, 2026

View reviewed changes

reytchison reviewed Jan 21, 2026

View reviewed changes

common/src/buttercup/common/build_selection.py Show resolved Hide resolved

reytchison reviewed Jan 21, 2026

View reviewed changes

deployment/k8s/values.yaml

coverage-bot:

enabled: true

enabled: false

Copy link

Contributor

reytchison Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't disable the coverage bot by default

reytchison reviewed Jan 21, 2026

View reviewed changes

IsaacTrost added 2 commits January 22, 2026 01:10

Merge remote-tracking branch 'upstream/main' into final-pr

b57cc08

restore scripts to remove debugging changes

771c6df

IsaacTrost changed the base branch from main to winternship-debugger January 23, 2026 19:08

	lines.append(f"\n*timout waiting for end of output after {timeout} seconds*")
	lines.append(f"\n*timeout waiting for end of output after {timeout} seconds*")

		return debug_binary_path is not None and debug_binary_path.exists()


		def resolve_actual_binary(

		return "\n\n".join(str(attempt) for attempt in self.debug_attempts)


		class DebugSubagentUnified:

	return any lines you need to explain what the interuption did."""
	return any lines you need to explain what the interruption did."""

	def _is_harness_whitelisted(self, harness_name: str) -> bool:
	def _is_harness_allowlisted(self, harness_name: str) -> bool:

		f""" Result: did_run={result.did_run()}, did_crash={result.did_crash()},
		returncode={result.command_result.returncode if result.command_result else "N/A"}"""

		f""" Build {i}: sanitizer={build.sanitizer}, engine={build.engine},
		type={BuildType.Name(build.build_type)}, task_id={build.task_id}"""

Debugger integration #414

Are you sure you want to change the base?

Debugger integration #414

Uh oh!

Conversation

IsaacTrost commented Jan 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hbrodin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reytchison Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reytchison Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

reytchison left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

reytchison Jan 21, 2026 •

edited

Loading

reytchison Jan 21, 2026 •

edited

Loading