This repository contains scripts and tools for training and evaluating a Sparse Autoencoder (SAE) approach to extract genomic features from biological sequence data.
GOSAE (Genome Ocean Sparse Autoencoder) is designed to:
- Extract meaningful features from genomic sequences
- Interpret functional significance of these features
- Support the GenomeOcean platform for biological data analysis
The repository includes scripts for:
- Removing duplicate sequences (
scripts/remove_duplicates.py) - Selecting representative sequences per species (
scripts/select_one_per_species.py) - Splitting data into training and validation sets (
scripts/split_train_val.py)
-
Remove duplicates from your FASTA files:
python scripts/remove_duplicates.py input.fasta deduplicated.fasta -
Select one sequence per species (optional):
python scripts/select_one_per_species.py deduplicated.fasta representative.fasta -
Split into training and validation sets:
python scripts/split_train_val.py deduplicated.fasta train.fasta val.fasta --ratio 0.7
[Training instructions to be added]
[Evaluation instructions to be added]
- Python 3.6+
- BioPython
- [Other dependencies]
If you use this code in your research, please cite: [Citation information to be added]