Skip to content

Conversation

@IsaacTrost
Copy link
Collaborator

This is a pull request to integrate a debugger into the seed-gen component of buttercup. It works by building on the existing vuln-discovery component of seed-gen, which uses an llm to analyze the codebase and generate PoV inputs. When those PoV inputs fail, this program spawns a sub-task with the ability to use debug scripts and an interactive debug session to determine why the PoV inputs failed to trigger the hypothesized vulnerability. This is then fed back to the calling task to inform the next round of PoV inputs.

IsaacTrost and others added 30 commits December 20, 2025 14:08
* Improve coverage tracking precision

Previously coverage tracking included all regions. This changes coverage tracking to be more precise.

Macros that result in code are represented as a single line in the function, and any executed line in the macro will make that line 'covered'. Beyound that, only CodeRegions will count. This will prevent code guarded by #if that aren't in the binary from being considered reachable for example. The idea is that seed-gen would have more accurate information when selecting functions to target.

I've done some testing and it appears as if the new implementation in general reaches more lines and covers more functions.

* Fix macro coverage leaking across files due to missing filename in key

The expansion_coverage map used (line, col) as key without filename,
causing coverage from one file to incorrectly appear in another when
macros were at the same coordinates.

* Recursively expand macros and add named types for coverage data

- Process ExpansionRegion target_regions recursively instead of counting
  only the call site line
- Add coordinate index for O(1) expansion lookups
- Cache computed expansion lines to avoid recomputation across functions
- Use bulk set operations for better performance
- Prevent infinite loops with visited set for circular macro references
- Add named types for coverage data structures:
  - RegionCoords, ExpansionKey, CachedExpansionLines
  - Type aliases: ExpansionMap, CoordToFilenames, ExpansionLinesCache
- Add comprehensive tests for nested macros and edge cases
…ofbits#408)

Co-authored-by: kevin-valerio <kevin-valerio@users.noreply.github.com>
@IsaacTrost IsaacTrost marked this pull request as ready for review January 19, 2026 02:08
Comment on lines 114 to 127
# Pull Git LFS files if the repository uses LFS
logger.info("Pulling Git LFS files (if any)")
result = subprocess.run(
["git", "lfs", "pull"],
cwd=sub_path,
capture_output=True,
text=True,
check=False, # Don't fail if LFS is not used or not installed
)
if result.returncode == 0:
logger.info(f"Git LFS pull output: {result.stdout}")
else:
logger.debug(f"Git LFS pull failed or not needed: {result.stderr}")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer need to add lfs pull logic in PR since this was merged: #413

# Read probability overrides from environment variables
self.TASK_SEED_INIT_PROB_FULL = float(os.getenv("BUTTERCUP_SEED_INIT_PROB_FULL", self.TASK_SEED_INIT_PROB_FULL))
self.TASK_VULN_DISCOVERY_PROB_FULL = float(
os.getenv("BUTTERCUP_VULN_DISCOVERY_PROB_FULL", self.TASK_VULN_DISCOVERY_PROB_FULL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make these new options configurable by the pydantic settings in config.py, please? When you add them to the pydantic settings you'll get built-in support for env variables, too.

Copy link
Collaborator

@hbrodin hbrodin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done! I've done an initial pass on it and left a bunch of comments. Impressive work overall.

Comment on lines +6 to +10
cpu: 2000m
memory: 8Gi
requests:
cpu: 200m
memory: 256Mi
cpu: 500m
memory: 1Gi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes necessary to run with the debugger integrated? Bumping the requests will require more resources for the system to run and might prevent some systems from running Buttercup.

Copy link
Contributor

@reytchison reytchison Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be an artifact of debugging the OOM, before the coverage bot was identified as the source of the OOM? Edit: responded under wrong comment

Comment on lines +234 to +242
# Resource limits for LiteLLM container to prevent OOMKilled errors
resources:
limits:
cpu: 2000m
memory: 4Gi
requests:
cpu: 500m
memory: 1Gi

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the OOMs due to LiteLLM? Or is LiteLLM killed due to high memory pressure situation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be an artifact of debugging the OOM, before the coverage bot was identified as the source of the OOM?


RUN DEBIAN_FRONTEND=noninteractive apt-get update && \
apt-get install -y git curl && \
apt-get install -y git git-lfs curl && \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should rebase/merge once #413 is merged into main.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these changes are needed as you can just specify this on the command line. Or, am I missing something?

remaining_time = end_time - time.time()
if remaining_time <= 0:
logger.warning("Command timeout after %.1fs: %s", timeout, command)
lines.append(f"\n***timout waiting for end of output after {timeout} seconds***")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lines.append(f"\n***timout waiting for end of output after {timeout} seconds***")
lines.append(f"\n***timeout waiting for end of output after {timeout} seconds***")

Comment on lines +47 to +53
# Log all available builds before testing
logger.info(f"Testing PoV '{pov.name}' against {len(self.build_outputs)} builds for harness '{harness_name}'")
for i, build in enumerate(self.build_outputs):
logger.info(
f""" Build {i}: sanitizer={build.sanitizer}, engine={build.engine},
type={BuildType.Name(build.build_type)}, task_id={build.task_id}"""
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? You are logging them again in the loop below I think.

return debug_binary_path is not None and debug_binary_path.exists()


def resolve_actual_binary(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider breaking into smaller functions

return "\n\n".join(str(attempt) for attempt in self.debug_attempts)


class DebugSubagentUnified:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very large class with some functions being very long. Is there maybe a way to break it down into smaller classes? What is the gain of having it like this over splitting it? Ideally you'd be able to compose the hybrid from the batch + interactive ones if splitting like that.

return

build_dir = Path(builds[BuildType.FUZZER][0].task_dir)
logger.info(f"Build directory: {build_dir}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug?

function_name = call.arguments["function_name"]
result = Task._get_function_definition(function_name, state, tool_call_id)
results.append(result)
if isinstance(function_name, str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is not a string, what happens then?


def interrupt(self) -> list[str]:
"""Interrupt the docker process. Overwrite this with program logic,
return any lines you need to explain what the interuption did."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return any lines you need to explain what the interuption did."""
return any lines you need to explain what the interruption did."""

return Command(update={"debug_script": debug_script})
except Exception as e:
logger.error(f"Failed to extract debug script from LLM response: {e}")
if "llm_response" in locals():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of checking locals, initialize llm_response to None and check if not None

return [BuildType.FUZZER]
return [BuildType.FUZZER, BuildType.FUZZER_DEBUG]

def _is_harness_whitelisted(self, harness_name: str) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _is_harness_whitelisted(self, harness_name: str) -> bool:
def _is_harness_allowlisted(self, harness_name: str) -> bool:

Use allowlist instead


coverage-bot:
enabled: true
enabled: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't disable the coverage bot by default

seedExploreProbFull: .2
seedExploreProbDelta: .2
minSeedInitRuns: 0
minVulnDiscoveryRuns: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the defaults in K8s should preserve the defaults in main

seedExploreProbDelta: .2
minSeedInitRuns: 0
minVulnDiscoveryRuns: 0
# Harness whitelist - comma-separated list of harness names/substrings to allow
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use allowlist instead

Comment on lines +68 to +69
f""" Result: did_run={result.did_run()}, did_crash={result.did_crash()},
returncode={result.command_result.returncode if result.command_result else "N/A"}"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f""" Result: did_run={result.did_run()}, did_crash={result.did_crash()},
returncode={result.command_result.returncode if result.command_result else "N/A"}"""
f"""Result: did_run={result.did_run()}, did_crash={result.did_crash()}, """
f"""returncode={result.command_result.returncode if result.command_result else "N/A"}"""

Make log one line instead of multiline with indentation

Comment on lines +51 to +52
f""" Build {i}: sanitizer={build.sanitizer}, engine={build.engine},
type={BuildType.Name(build.build_type)}, task_id={build.task_id}"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f""" Build {i}: sanitizer={build.sanitizer}, engine={build.engine},
type={BuildType.Name(build.build_type)}, task_id={build.task_id}"""
f"""Build {i}: sanitizer={build.sanitizer}, engine={build.engine}, """
f"""type={BuildType.Name(build.build_type)}, task_id={build.task_id}"""

Make log one line without indentation

Comment on lines +55 to +57
for src, dst in mount_dirs.items():
docker_cmd += ["-v", f"{src.resolve().as_posix()}:{dst.as_posix()}"]
logger.debug("Mounting %s -> %s", src, dst)
Copy link
Contributor

@reytchison reytchison Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make these volumes read-only? (At least the ones we can, I think scratchpad needs to be read/write).

"Initializing DockerInteractive session: container=%s, timeout=%.1fs", container_image, global_timeout
)

docker_cmd = ["docker", "run", "--privileged", "--shm-size=2g", "-i", "--name", self.container_name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it pass --rm to remove the container after it stops?

"Initializing DockerInteractive session: container=%s, timeout=%.1fs", container_image, global_timeout
)

docker_cmd = ["docker", "run", "--privileged", "--shm-size=2g", "-i", "--name", self.container_name]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could add --network none to isolate from network?

for line in next_command.split("\n")
if line.strip() and not line.strip().startswith("#")
]
logger.info(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth adding a TODO to sanitize higher risk gdb commands (e.g. shell, python). The main focus is isolating the container since the LLM can run gdb but gdb command sanitization might be useful as defense in depth.

Comment on lines +673 to +675
**Language Agnostic**: This method works for any language. For C++ projects,
it sets CFLAGS/CXXFLAGS to include full debug symbols. For Java or other languages,
setting CFLAGS/CXXFLAGS has no effect (which is fine).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a bit unclear because it says language agnostic, but really the method is just for C/C++.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the comment to make it clear that it will still build for java projects, and we don't have to know the language when we do this process, but it will only actually add debug symbols for C/C++ projects.

Comment on lines +773 to +803
# Verify all source files exist before mounting
logger.debug("Verifying source files before Docker mount:")
logger.debug(" Debug script:")
logger.debug(f" Path: {debug_script_path}")
logger.debug(f" Exists: {debug_script_path.exists()}")
logger.debug(f" Is file: {debug_script_path.is_file()}")
logger.debug(f" Is directory: {debug_script_path.is_dir()}")
if debug_script_path.exists():
logger.debug(f" Size: {debug_script_path.stat().st_size} bytes")
logger.debug(f" Absolute: {debug_script_path.resolve()}")

logger.info(" PoV input:")
logger.debug(f" Path: {pov_input_path}")
logger.debug(f" Exists: {pov_input_path.exists()}")
logger.debug(f" Is file: {pov_input_path.is_file()}")
logger.debug(f" Is directory: {pov_input_path.is_dir()}")
if pov_input_path.exists():
logger.debug(f" Size: {pov_input_path.stat().st_size} bytes")
logger.debug(f" Absolute: {pov_input_path.resolve()}")

logger.debug(" Build dir:")
logger.debug(f" Path: {build_dir}")
logger.debug(f" Exists: {build_dir.exists()}")
logger.debug(f" Is directory: {build_dir.is_dir()}")
logger.debug(f" Absolute: {build_dir.resolve()}")

logger.info(" Out dir (parent of build_dir):")
logger.debug(f" Path: {out_dir}")
logger.debug(f" Exists: {out_dir.exists()}")
logger.debug(f" Is directory: {out_dir.is_dir()}")
logger.debug(f" Absolute: {out_dir.resolve()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logging might be too verbose

Copy link
Contributor

@reytchison reytchison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confirming I did my first pass of the PR, too.

@IsaacTrost IsaacTrost changed the base branch from main to winternship-debugger January 23, 2026 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants