AI Engineer Challenge: Trait Ontology Mapping

Background

This challenge involves evaluating the correctness of gene-trait mappings extracted by LLM models using data stored in a Trait Ontology (TO) vector database. The goal is to assess the alignment of trait names to their corresponding TO IDs and propose a systematic approach to correct errors.

Your Tasks

You will analyse outputs from three LLM models stored in separate CSV files. Each row contains a trait_name and trait_id. Use the accompanying trait_ontology_details.txt file to validate these mappings. Your submission should answer the following:

Task 1: Develop a Correctness Metric

Goal: Define a quantitative metric that evaluates the correctness of each row in the CSV files.
Considerations:
- Does the trait_name correspond to the trait_id based on the trait_ontology_details.txt?
- Partial matches or synonyms (if applicable).
- Confidence in mapping correctness based on string similarity or semantic context (e.g., Levenshtein distance, embeddings).

Task 2: Rank CSVs Based on Correctness

Goal: Aggregate the correctness scores across rows and rank the 3 CSVs based on the overall quality of their mappings.
Deliverable: A ranked list of the files, supported by a brief explanation of your scoring methodology.

Task 3: Propose a Method to Correct Errors

Goal: Outline a systematic approach to identify and correct mismatches.
Key Requirements:
- Use trait_ontology_details.txt to suggest correct trait_ids for mismapped terms.
- Justify your proposed solution (e.g., by leveraging similarity measures, ontology hierarchies, or semantic embeddings).

Task 4: Predict TO Mappings for Given Trait Terms

Goal: Apply your methods to predict TO mappings for a set of challenging trait terms that do not have exact matches in TO. The aim is to provide a well-structured output for a manual data curation step, helping to reduce the time required for manual review.
Key Requirements:
- Use validation_trait_names.txt to evaluate your model or approach.
- Provide a confidence value for the mapping, where:
  - 0 indicates that the trait term does not exist in TO (e.g. vernalization requirement).
  - 1 indicates that the identified TO term is 100% correct (e.g. spike length -> TO:0002768 -> spikelet length).
- Generate explanations for low-confidence predictions using an LLM as the judge.
- Submit a CSV with the following columns: trait_term, to_id, to_term, confidence, explanation, along with your code.

Deliverables

Interactive Notebook:
- Include code, visualisations, and explanations of your approach.
- Document key functions and decisions thoroughly.
Summary Presentation:
- Prepare a 5-slide deck explaining:
  - Your methodology for each task.
  - Results (e.g., correctness metric, ranked CSVs, error corrections).
  - Insights or challenges encountered.
- Use visualisations to support your findings (e.g., charts, tables).
Code Documentation:
- Clean, well-structured code with clear comments.
- Include instructions for running your notebook.

Evaluation Criteria

Technical Proficiency:
- Innovative metric definition and accurate validation logic.
- Effective error correction methodology.
Analytical Rigor:
- Sound justification for ranking and corrections.
- Clear and interpretable visualisations.
Communication:
- Clarity in documentation and presentation.
- Ability to explain technical concepts to a non-technical audience.

Submission

Submit the notebook and presentation in a compressed folder.

Use the following structure:

submission/
  ├── analysis_notebook.ipynb
  ├── predicted_trait_mappings.csv
  ├── presentation.pdf
  ├── README.md

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
LICENSE		LICENSE
README.md		README.md
response_model1.csv		response_model1.csv
response_model2.csv		response_model2.csv
response_model3.csv		response_model3.csv
starter_code.ipynb		starter_code.ipynb
trait_ontology_details.txt		trait_ontology_details.txt
validation_trait_names.txt		validation_trait_names.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Engineer Challenge: Trait Ontology Mapping

Background

Your Tasks

Task 1: Develop a Correctness Metric

Task 2: Rank CSVs Based on Correctness

Task 3: Propose a Method to Correct Errors

Task 4: Predict TO Mappings for Given Trait Terms

Deliverables

Evaluation Criteria

Submission

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Engineer Challenge: Trait Ontology Mapping

Background

Your Tasks

Task 1: Develop a Correctness Metric

Task 2: Rank CSVs Based on Correctness

Task 3: Propose a Method to Correct Errors

Task 4: Predict TO Mappings for Given Trait Terms

Deliverables

Evaluation Criteria

Submission

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages