This directory contains tools for performing analogy reasoning on knowledge graph embeddings to analyze relationships between microbial taxa and their physical growth preferences.
The analogy reasoning follows the vector arithmetic pattern:
query_taxon - physical_preference + opposite_physical_preference = predicted_taxon
For example:
E. coli - aerobe + anaerobe = ?Bacillus subtilis - high_pH + low_pH = ?Thermotoga maritima - high_temp + low_temp = ?
The analysis uses:
- Embeddings:
output/DeepWalkSkipGramEnsmallen_degreenorm_embedding_500_2025-04-07_03_18_35.tsv.gz(500-dimensional node embeddings) - Oxygen preferences:
output/NCBITaxon_to_oxygen.tsv - Salinity preferences:
taxa_media/NCBITaxon_to_salinity_v3.tsv - pH preferences:
taxa_media/taxa_pH_opt_mapping_adjusted_v2.tsv - Temperature preferences:
taxa_media/NCBITaxon_to_temp_opt_v2.tsv
- aerobe ↔ anaerobe
- facultative_anaerobe ↔ aerobe
- microaerophile ↔ aerobe
- halophilic ↔ non_halophilic
- moderately_halophilic ↔ non_halophilic
- high ↔ low
- mid1 ↔ mid2
- high ↔ low
- mid1 ↔ mid4
- mid2 ↔ mid3
cd neurosymbolreason
python analogy_reasoning.py- Loads Embeddings: Reads the 500-dimensional DeepWalk embeddings for all nodes
- Loads Trait Data: Loads physical growth preference data for oxygen, salinity, pH, and temperature
- Identifies Query Taxa: Finds all taxa with both embeddings and known traits
- Performs Analogies: For each taxon-trait combination:
- Computes trait representative vectors (mean of all taxa with that trait)
- Performs vector arithmetic:
query - trait + opposite_trait - Finds top 10 closest NCBITaxon:, strain:, ph_, and nacl_ nodes using cosine similarity
- Calculates self-match score for quality assessment
- Analyzes Results: Creates visualizations and statistics
- High-Quality Analysis: Identifies predictions with similarity scores above or statistically close to self-match
analogy_reasoning_results.csv: Complete results for all querieshigh_quality_matches_detailed.csv: Predictions above self-match thresholdhigh_quality_matches_summary.json: Summary of high-quality matchesanalysis_stats.json: General summary statisticsanalogy_reasoning_analysis.png/pdf: Main visualization plotshigh_quality_matches_analysis.png/pdf: High-quality matches specific plots
Each result contains:
query_taxon: Original taxon ID (e.g., NCBITaxon:562)trait_type: Type of trait (oxygen, salinity, ph_opt, temp_opt)trait_value: Current trait value (e.g., aerobe, high, low)opposite_trait: The opposite trait used in analogyrank: Ranking of this prediction (1-10)predicted_taxon: The predicted taxon from analogy reasoning (NCBITaxon:, strain:, ph_, or nacl_)similarity_score: Cosine similarity score (0-1)self_match_score: Similarity between original query taxon and analogy result vector (taxon - phenotype1 + phenotype2)above_self_match: Boolean indicating if prediction exceeds self-match score
Query: NCBITaxon:562 (E. coli) with oxygen:aerobe
Analogy: E. coli - aerobe + anaerobe = ?
Top predictions might include:
- NCBITaxon:1234 (similarity: 0.89)
- strain:bacdive_5678 (similarity: 0.85)
- NCBITaxon:9012 (similarity: 0.82)
High-Quality Matches are those with similarity scores above or statistically close to the self-match score:
- Above self-match: Predictions more similar to the analogy result vector than the original query taxon (indicating strong analogy)
- Statistically close: Within 95% of self-match score (indicating reasonable analogy)
Self-Match Score: Represents how similar the original query taxon is to the analogy result vector (query - phenotype1 + phenotype2). This serves as a baseline - predictions should ideally be more similar to the analogy result than the original query is.
Node Type Distribution:
- NCBITaxon: and strain:: Biological organisms with similar properties
- ph_*: pH-related concept nodes (e.g., ph_acidic, ph_alkaline)
- nacl_*: Salinity-related concept nodes (e.g., nacl_high, nacl_tolerance)
Lower quality predictions may indicate:
- Insufficient training data for that trait
- Complex biological relationships not captured in embeddings
- Need for more sophisticated analogy methods
The framework can be extended to:
- Include more physical traits (pH range, temperature range, etc.)
- Test different analogy formulations
- Validate predictions against known biology
- Compare with other embedding methods
pandas
numpy
scikit-learn
matplotlib
seaborn- ~1.55M embeddings loaded
- Analysis covers taxa with known traits across 4 physical preference types
- Runtime: ~10-30 minutes depending on data size
- Memory usage: ~4-8GB for full embedding matrix
- Validation: Compare predictions against experimental data
- Multi-trait analogies: E.g., aerobe + high_temp - mesophile + anaerobe
- Semantic evaluation: Assess biological meaningfulness of predictions
- Comparison studies: Test against other knowledge graph embedding methods