-
Notifications
You must be signed in to change notification settings - Fork 6
Draft: moving over citation_exists.py requirement #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…est; also added eyecite useage for LLM output parsing
| # might not be needed | ||
| # def ensure_list_of_dicts(obj: Any) -> list[dict]: | ||
| # """ | ||
| # Normalize any JSON-like object into a list of dictionaries. | ||
|
|
||
| # Accepts: | ||
| # - A JSON string (object or array) | ||
| # - A single dict | ||
| # - A list of dicts | ||
|
|
||
| # Args: | ||
| # obj: Any data type, ideally something that can unpacked into a dictionary | ||
|
|
||
| # Returns: | ||
| # The unpacked object in list of dictionary form or raises an error. | ||
| # """ | ||
| # # JSON string | ||
| # if isinstance(obj, str): | ||
| # try: | ||
| # obj = json.loads(obj) | ||
| # except json.JSONDecodeError as e: | ||
| # raise ValueError(f"Invalid JSON string: {e!s}") | ||
|
|
||
| # # Single dict | ||
| # if isinstance(obj, dict): | ||
| # return [obj] | ||
|
|
||
| # # List of dicts | ||
| # if isinstance(obj, list): | ||
| # if all(isinstance(item, dict) for item in obj): | ||
| # return obj | ||
| # else: | ||
| # raise ValueError("List contains non-dictionary elements") | ||
|
|
||
| # raise TypeError(f"Unsupported metadata format: {type(obj)}") | ||
|
|
||
| # alternatively: | ||
| # should this take in last_output instead of the whole context? | ||
| # get case name: take LLM output and extract case name --> a string which you get from ctx.last_output() is the input | ||
| # so the argument should be ctx.last_output.value: str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What was the purpose of this?
- Remove commented out code.
| # install hyperscan if not already installed | ||
| # !pip install hyperscan | ||
| # tokenizer = HyperscanTokenizer(cache_dir=".test_cache") | ||
| # citations = get_citations(cleaned_text, tokenizer=tokenizer) | ||
|
|
||
| # or this? | ||
| # cleaned_text = clean_text(text, ["html", "all_whitespace"]) | ||
| # citations = get_citations(cleaned_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's going on here?
| plaintiff = citation.metadata.get("plaintiff") | ||
| defendant = citation.metadata.get("defendant") | ||
| if plaintiff and defendant: | ||
| case_names.add(f"{plaintiff} v. {defendant}") | ||
| # name = citation.metadata['plaintiff'] + " v. " + citation.metadata['defendant'] | ||
| # case_names.add(name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What's the purpose of this?
- Are these actually canonical names/references?
- What happens if you don't have a plaintiff and defendent? Can that ever happen? If not -> assert. If yes -> handle exceptiona lcases.
| # check if this code chunk is right later | ||
| # db_names = {normalize_case_name(c["name"]) for c in case_metadata if "name" in c} | ||
| # db_abbrevs = { | ||
| # normalize_case_name(c["name_abbreviation"]) for c in case_metadata if "name_abbreviation" in c | ||
| # } | ||
|
|
||
| # for name in normalized_output_names: | ||
| # if name not in db_names and name not in db_abbrevs: | ||
| # return ValidationResult(False, reason=f"Case '{name}' not found in database") | ||
|
|
||
| # return ValidationResult(True, reason="All case names found in database") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented out code.
| case_names = extract_case_names(ctx) | ||
|
|
||
| if not case_names or not isinstance(case_names, list[str]): | ||
| return ValidationResult(False, reason="No case names provided in output") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should probably return True. The reason is good.
| for case in case_metadata: | ||
| if 'name' in case: | ||
| case_names.add(normalize_case_name(case['name'])) | ||
| if 'name_abbreviation' in case: | ||
| case_name_abb.add(normalize_case_name(case['name_abbreviation'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach seems like it will make a lot of errors.
What about cases where the case name isn't verbatim? E.g., sometimes if there is a large set of parties on one side or the other there will be an abbreviation of that in the cite. State names are often also different in the formal cite vs how they're cited inline.
Additionally, there is a lot of string manipulation here that I think can be streamlined or done in a more principles way. Should we implement this as something like likely_equivalent(c1, c2) where c1 and c2 are eyecite citation objects.
Draft: moving over citation_exists.py requirement and corresponding test; also added eyecite useage for LLM output parsing