Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ jobs:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
python-version: '3.14'
- name: Install Ruff
run: |
pip install ruff
Expand All @@ -50,7 +50,7 @@ jobs:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
python-version: '3.14'
- name: Install Sphinx
run: |
pip install -r docs/requirements.txt
Expand All @@ -65,12 +65,12 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest]
python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']
python-version: ['3.9', '3.10', '3.11', '3.12', '3.13', '3.14']
include:
- os: macos-latest
python-version: '3.12'
python-version: '3.14'
- os: windows-latest
python-version: '3.12'
python-version: '3.14'

env:
JOB_ID: ${{ strategy.job-index }}
Expand Down Expand Up @@ -127,7 +127,7 @@ jobs:
strategy:
matrix:
os: [ubuntu-latest]
python-version: ['3.8', '3.9', '3.10', '3.11']
python-version: ['3.9', '3.10', '3.11']

steps:
- uses: actions/checkout@v4
Expand Down
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -103,10 +103,12 @@ ENV/

# development
.idea/
.DS_Store

# snps
output*/
resources/*
synthetic_23andme.txt.gz

# snps tests
tests/output/*
Expand Down
30 changes: 30 additions & 0 deletions .vscode/tasks.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"version": "2.0.0",
"tasks": [
{
"label": "Run All Tests",
"type": "shell",
"command": "source .venv/bin/activate && python -m pytest tests README.rst",
"group": {
"kind": "test",
"isDefault": true
},
"presentation": {
"reveal": "always",
"panel": "dedicated"
},
"problemMatcher": []
},
{
"label": "Test README Examples",
"type": "shell",
"command": "source .venv/bin/activate && python -m doctest README.rst -v",
"group": "test",
"presentation": {
"reveal": "always",
"panel": "dedicated"
},
"problemMatcher": []
}
]
}
209 changes: 64 additions & 145 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

snps
====
tools for reading, writing, merging, and remapping SNPs 🧬
tools for reading, writing, generating, merging, and remapping SNPs 🧬

``snps`` *strives to be an easy-to-use and accessible open-source library for working with
genotype data*
Expand All @@ -19,8 +19,9 @@ Input / Output
- Read and write VCF files (e.g., convert `23andMe <https://www.23andme.com>`_ to VCF)
- Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
- Read data in a variety of formats (e.g., files, bytes, compressed with `gzip` or `zip`)
- Handle several variations of file types, validated via
`openSNP parsing analysis <https://github.com/apriha/snps/tree/main/analysis/parse-opensnp-files>`_
- Handle several variations of file types, historically validated using
data from `openSNP <https://opensnp.org>`_
- Generate synthetic genotype data for testing and examples

Build / Assembly Detection and Remapping
````````````````````````````````````````
Expand Down Expand Up @@ -68,7 +69,7 @@ Additionally, ``snps`` can read a variety of "generic" CSV and TSV files.

Dependencies
------------
``snps`` requires `Python <https://www.python.org>`_ 3.8+ and the following Python
``snps`` requires `Python <https://www.python.org>`_ 3.9+ and the following Python
packages:

- `numpy <http://www.numpy.org>`_
Expand All @@ -90,187 +91,105 @@ capability, ``snps`` can be installed with `ezancestry <https://github.com/arvke

Examples
--------
Download Example Data
`````````````````````
First, let's setup logging to get some helpful output:

>>> import logging, sys
>>> logger = logging.getLogger()
>>> logger.setLevel(logging.INFO)
>>> logger.addHandler(logging.StreamHandler(sys.stdout))

Now we're ready to download some example data from `openSNP <https://opensnp.org>`_:
To try these examples, first generate some sample data:

>>> from snps.resources import Resources
>>> r = Resources()
>>> paths = r.download_example_datasets()
Downloading resources/662.23andme.340.txt.gz
Downloading resources/662.ftdna-illumina.341.csv.gz
>>> paths = Resources().create_example_datasets()

Load Raw Data
`````````````
Load a `23andMe <https://www.23andme.com>`_ raw data file:
Load a Raw Data File
````````````````````
Load a raw data file exported from a DNA testing source (e.g.,
`23andMe <https://www.23andme.com>`_, `AncestryDNA <https://www.ancestry.com>`_,
`Family Tree DNA <https://www.familytreedna.com>`_):

>>> from snps import SNPs
>>> s = SNPs("resources/662.23andme.340.txt.gz")
>>> s = SNPs("resources/sample1.23andme.txt.gz")

``snps`` automatically detects the source format and `normalizes
<https://snps.readthedocs.io/en/stable/snps.html#snps.snps.SNPs.snps>`_ the data:

>>> s.source
'23andMe'
>>> s.count
991786
991767
>>> s.build
37
>>> s.assembly
'GRCh37'

The ``SNPs`` class accepts a path to a file or a bytes object. A ``Reader`` class attempts to
infer the data source and load the SNPs. The loaded SNPs are
`normalized <https://snps.readthedocs.io/en/stable/snps.html#snps.snps.SNPs.snps>`_ and
available via a ``pandas.DataFrame``:
The SNPs are available as a ``pandas.DataFrame``:

>>> df = s.snps
>>> df.columns.values
array(['chrom', 'pos', 'genotype'], dtype=object)
>>> df.index.name
'rsid'
>>> df.chrom.dtype.name
'object'
>>> df.pos.dtype.name
'uint32'
>>> df.genotype.dtype.name
'object'
>>> len(df)
991786

``snps`` also attempts to detect the build / assembly of the data:

>>> s.build
37
>>> s.build_detected
True
>>> s.assembly
'GRCh37'
991767

Merge Raw Data Files
````````````````````
The dataset consists of raw data files from two different DNA testing sources - let's combine
these files. Specifically, we'll update the ``SNPs`` object with SNPs from a
`Family Tree DNA <https://www.familytreedna.com>`_ file.

>>> merge_results = s.merge([SNPs("resources/662.ftdna-illumina.341.csv.gz")])
Merging SNPs('662.ftdna-illumina.341.csv.gz')
SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37
Downloading resources/NCBI36_GRCh37.tar.gz
27 SNP positions were discrepant; keeping original positions
151 SNP genotypes were discrepant; marking those as null
>>> s.source
'23andMe, FTDNA'
>>> s.count
1006960
>>> s.build
37
>>> s.build_detected
True
Combine SNPs from multiple files (e.g., combine data from different testing companies):

If the SNPs being merged have a build that differs from the destination build, the SNPs to merge
will be remapped automatically. After this example merge, the build is still detected, since the
build was detected for all ``SNPs`` objects that were merged.
>>> results = s.merge([SNPs("resources/sample2.ftdna.csv.gz")])
>>> s.count
1006949

As the data gets added, it's compared to the existing data, and SNP position and genotype
discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.) These
discrepant SNPs are available for inspection after the merge via properties of the ``SNPs`` object.
SNPs are compared during the merge. Position and genotype discrepancies are identified and
can be inspected via properties of the ``SNPs`` object:

>>> len(s.discrepant_merge_positions)
27
>>> len(s.discrepant_merge_genotypes)
151

Additionally, any non-called / null genotypes will be updated during the merge, if the file
being merged has a called genotype for the SNP.

Moreover, ``merge`` takes a ``chrom`` parameter - this enables merging of only SNPs associated
with the specified chromosome (e.g., "Y" or "MT").

Finally, ``merge`` returns a list of ``dict``, where each ``dict`` has information corresponding
to the results of each merge (e.g., SNPs in common).

>>> sorted(list(merge_results[0].keys()))
['common_rsids', 'discrepant_genotype_rsids', 'discrepant_position_rsids', 'merged']
>>> merge_results[0]["merged"]
True
>>> len(merge_results[0]["common_rsids"])
692918
156

Remap SNPs
``````````
Now, let's remap the merged SNPs to change the assembly / build:
Convert SNPs between genome assemblies (Build 36/NCBI36, Build 37/GRCh37, Build 38/GRCh38):

>>> s.snps.loc["rs3094315"].pos
752566
>>> chromosomes_remapped, chromosomes_not_remapped = s.remap(38)
Downloading resources/GRCh37_GRCh38.tar.gz
>>> s.build
38
>>> s.assembly
'GRCh38'
>>> s.snps.loc["rs3094315"].pos
817186

SNPs can be remapped between Build 36 (``NCBI36``), Build 37 (``GRCh37``), and Build 38
(``GRCh38``).

Save SNPs
`````````
Ok, so far we've merged the SNPs from two files (ensuring the same build in the process and
identifying discrepancies along the way). Then, we remapped the SNPs to Build 38. Now, let's save
the merged and remapped dataset consisting of 1M+ SNPs to a tab-separated values (TSV) file:

>>> saved_snps = s.to_tsv("out.txt")
Saving output/out.txt
>>> print(saved_snps)
output/out.txt

Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:

>>> saved_snps = s.to_vcf("out.vcf")
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.2.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.3.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.4.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.5.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.6.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.7.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.8.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.9.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.10.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.14.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.15.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.16.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.18.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz
Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz
Saving output/out.vcf
1 SNP positions were found to be discrepant when saving VCF

When saving a VCF, if any SNPs have positions outside of the reference sequence, they are marked
as discrepant and are available via a property of the ``SNPs`` object.

All `output files <https://snps.readthedocs.io/en/stable/output_files.html>`_ are saved to the
output directory.
Save SNPs to common file formats:

>>> _ = s.to_tsv("output.txt")
>>> _ = s.to_csv("output.csv")

To save as VCF, ``snps`` automatically downloads the required reference sequences for the
assembly. This ensures the REF alleles in the VCF are accurate:

>>> _ = s.to_vcf("output.vcf") # doctest: +SKIP

All output files are saved to the `output directory
<https://snps.readthedocs.io/en/stable/output_files.html>`_.

Generate Synthetic Data
```````````````````````
Generate synthetic genotype data for testing, examples, or demonstrations:

>>> from snps.io import SyntheticSNPGenerator
>>> gen = SyntheticSNPGenerator(build=37, seed=123)
>>> gen.save_as_23andme("synthetic_23andme.txt.gz", num_snps=10000)
'synthetic_23andme.txt.gz'

The generator supports multiple output formats (23andMe, AncestryDNA, FTDNA) and
automatically injects build-specific marker SNPs to ensure accurate build detection.

Documentation
-------------
Documentation is available `here <https://snps.readthedocs.io/>`_.

Acknowledgements
----------------
Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, `openSNP <https://opensnp.org>`_,
`Open Humans <https://www.openhumans.org>`_, and `Sano Genetics <https://sanogenetics.com>`_.
Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, `Open Humans <https://www.openhumans.org>`_,
and `Sano Genetics <https://sanogenetics.com>`_. This project was historically validated using
data from `openSNP <https://opensnp.org>`_.

``snps`` incorporates code and concepts generated with the assistance of
`OpenAI's <https://openai.com>`_ `ChatGPT <https://chatgpt.com>`_. ✨
``snps`` incorporates code and concepts generated with the assistance of various
generative AI tools (including but not limited to `ChatGPT <https://chatgpt.com>`_,
`Grok <https://grok.com>`_, and `Claude <https://claude.ai>`_). ✨

License
-------
Expand Down
15 changes: 0 additions & 15 deletions analysis/parse-opensnp-files/README.rst

This file was deleted.

Loading