Skip to content

Conversation

@anpendyal
Copy link

Draft: moving over citation_exists.py requirement and corresponding test; also added eyecite useage for LLM output parsing

…est; also added eyecite useage for LLM output parsing
@nrfulton nrfulton self-requested a review November 4, 2025 15:41
Comment on lines +36 to +75
# might not be needed
# def ensure_list_of_dicts(obj: Any) -> list[dict]:
# """
# Normalize any JSON-like object into a list of dictionaries.

# Accepts:
# - A JSON string (object or array)
# - A single dict
# - A list of dicts

# Args:
# obj: Any data type, ideally something that can unpacked into a dictionary

# Returns:
# The unpacked object in list of dictionary form or raises an error.
# """
# # JSON string
# if isinstance(obj, str):
# try:
# obj = json.loads(obj)
# except json.JSONDecodeError as e:
# raise ValueError(f"Invalid JSON string: {e!s}")

# # Single dict
# if isinstance(obj, dict):
# return [obj]

# # List of dicts
# if isinstance(obj, list):
# if all(isinstance(item, dict) for item in obj):
# return obj
# else:
# raise ValueError("List contains non-dictionary elements")

# raise TypeError(f"Unsupported metadata format: {type(obj)}")

# alternatively:
# should this take in last_output instead of the whole context?
# get case name: take LLM output and extract case name --> a string which you get from ctx.last_output() is the input
# so the argument should be ctx.last_output.value: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What was the purpose of this?
  2. Remove commented out code.

Comment on lines +89 to +96
# install hyperscan if not already installed
# !pip install hyperscan
# tokenizer = HyperscanTokenizer(cache_dir=".test_cache")
# citations = get_citations(cleaned_text, tokenizer=tokenizer)

# or this?
# cleaned_text = clean_text(text, ["html", "all_whitespace"])
# citations = get_citations(cleaned_text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's going on here?

Comment on lines +103 to +108
plaintiff = citation.metadata.get("plaintiff")
defendant = citation.metadata.get("defendant")
if plaintiff and defendant:
case_names.add(f"{plaintiff} v. {defendant}")
# name = citation.metadata['plaintiff'] + " v. " + citation.metadata['defendant']
# case_names.add(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What's the purpose of this?
  2. Are these actually canonical names/references?
  3. What happens if you don't have a plaintiff and defendent? Can that ever happen? If not -> assert. If yes -> handle exceptiona lcases.

Comment on lines +163 to +173
# check if this code chunk is right later
# db_names = {normalize_case_name(c["name"]) for c in case_metadata if "name" in c}
# db_abbrevs = {
# normalize_case_name(c["name_abbreviation"]) for c in case_metadata if "name_abbreviation" in c
# }

# for name in normalized_output_names:
# if name not in db_names and name not in db_abbrevs:
# return ValidationResult(False, reason=f"Case '{name}' not found in database")

# return ValidationResult(True, reason="All case names found in database")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented out code.

case_names = extract_case_names(ctx)

if not case_names or not isinstance(case_names, list[str]):
return ValidationResult(False, reason="No case names provided in output")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably return True. The reason is good.

Comment on lines +148 to +152
for case in case_metadata:
if 'name' in case:
case_names.add(normalize_case_name(case['name']))
if 'name_abbreviation' in case:
case_name_abb.add(normalize_case_name(case['name_abbreviation']))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach seems like it will make a lot of errors.

What about cases where the case name isn't verbatim? E.g., sometimes if there is a large set of parties on one side or the other there will be an abbreviation of that in the cite. State names are often also different in the formal cite vs how they're cited inline.

Additionally, there is a lot of string manipulation here that I think can be streamlined or done in a more principles way. Should we implement this as something like likely_equivalent(c1, c2) where c1 and c2 are eyecite citation objects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants