apriha · apriha · Jan 23, 2026 · Jan 10, 2026 · Jan 23, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -32,7 +32,7 @@ jobs:
     - uses: actions/checkout@v4
     - uses: actions/setup-python@v5
       with:
-        python-version: '3.12'
+        python-version: '3.14'
     - name: Install Ruff
       run: |
         pip install ruff
@@ -50,7 +50,7 @@ jobs:
     - uses: actions/checkout@v4
     - uses: actions/setup-python@v5
       with:
-        python-version: '3.12'
+        python-version: '3.14'
     - name: Install Sphinx
       run: |
         pip install -r docs/requirements.txt
@@ -65,12 +65,12 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-latest]
-        python-version: ['3.8', '3.9', '3.10', '3.11', '3.12']
+        python-version: ['3.9', '3.10', '3.11', '3.12', '3.13', '3.14']
         include:
           - os: macos-latest
-            python-version: '3.12'
+            python-version: '3.14'
           - os: windows-latest
-            python-version: '3.12'
+            python-version: '3.14'
 
     env:
       JOB_ID: ${{ strategy.job-index }}
@@ -127,7 +127,7 @@ jobs:
     strategy:
       matrix:
         os: [ubuntu-latest]
-        python-version: ['3.8', '3.9', '3.10', '3.11']
+        python-version: ['3.9', '3.10', '3.11']
 
     steps:
     - uses: actions/checkout@v4

diff --git a/.gitignore b/.gitignore
@@ -103,10 +103,12 @@ ENV/
 
 # development
 .idea/
+.DS_Store
 
 # snps
 output*/
 resources/*
+synthetic_23andme.txt.gz
 
 # snps tests
 tests/output/*

diff --git a/.vscode/tasks.json b/.vscode/tasks.json
@@ -0,0 +1,30 @@
+{
+    "version": "2.0.0",
+    "tasks": [
+        {
+            "label": "Run All Tests",
+            "type": "shell",
+            "command": "source .venv/bin/activate && python -m pytest tests README.rst",
+            "group": {
+                "kind": "test",
+                "isDefault": true
+            },
+            "presentation": {
+                "reveal": "always",
+                "panel": "dedicated"
+            },
+            "problemMatcher": []
+        },
+        {
+            "label": "Test README Examples",
+            "type": "shell",
+            "command": "source .venv/bin/activate && python -m doctest README.rst -v",
+            "group": "test",
+            "presentation": {
+                "reveal": "always",
+                "panel": "dedicated"
+            },
+            "problemMatcher": []
+        }
+    ]
+}
diff --git a/README.rst b/README.rst
@@ -4,7 +4,7 @@
 
 snps
 ====
-tools for reading, writing, merging, and remapping SNPs 🧬
+tools for reading, writing, generating, merging, and remapping SNPs 🧬
 
 ``snps`` *strives to be an easy-to-use and accessible open-source library for working with
 genotype data*
@@ -19,8 +19,9 @@ Input / Output
 - Read and write VCF files (e.g., convert `23andMe <https://www.23andme.com>`_ to VCF)
 - Merge raw data files from different DNA tests, identifying discrepant SNPs in the process
 - Read data in a variety of formats (e.g., files, bytes, compressed with `gzip` or `zip`)
-- Handle several variations of file types, validated via
-  `openSNP parsing analysis <https://github.com/apriha/snps/tree/main/analysis/parse-opensnp-files>`_
+- Handle several variations of file types, historically validated using
+  data from `openSNP <https://opensnp.org>`_
+- Generate synthetic genotype data for testing and examples
 
 Build / Assembly Detection and Remapping
 ````````````````````````````````````````
@@ -68,7 +69,7 @@ Additionally, ``snps`` can read a variety of "generic" CSV and TSV files.
 
 Dependencies
 ------------
-``snps`` requires `Python <https://www.python.org>`_ 3.8+ and the following Python
+``snps`` requires `Python <https://www.python.org>`_ 3.9+ and the following Python
 packages:
 
 - `numpy <http://www.numpy.org>`_
@@ -90,187 +91,105 @@ capability, ``snps`` can be installed with `ezancestry <https://github.com/arvke
 
 Examples
 --------
-Download Example Data
-`````````````````````
-First, let's setup logging to get some helpful output:
 
->>> import logging, sys
->>> logger = logging.getLogger()
->>> logger.setLevel(logging.INFO)
->>> logger.addHandler(logging.StreamHandler(sys.stdout))
-
-Now we're ready to download some example data from `openSNP <https://opensnp.org>`_:
+To try these examples, first generate some sample data:
 
 >>> from snps.resources import Resources
->>> r = Resources()
->>> paths = r.download_example_datasets()
-Downloading resources/662.23andme.340.txt.gz
-Downloading resources/662.ftdna-illumina.341.csv.gz
+>>> paths = Resources().create_example_datasets()
 
-Load Raw Data
-`````````````
-Load a `23andMe <https://www.23andme.com>`_ raw data file:
+Load a Raw Data File
+````````````````````
+Load a raw data file exported from a DNA testing source (e.g.,
+`23andMe <https://www.23andme.com>`_, `AncestryDNA <https://www.ancestry.com>`_,
+`Family Tree DNA <https://www.familytreedna.com>`_):
 
 >>> from snps import SNPs
->>> s = SNPs("resources/662.23andme.340.txt.gz")
+>>> s = SNPs("resources/sample1.23andme.txt.gz")
+
+``snps`` automatically detects the source format and `normalizes
+<https://snps.readthedocs.io/en/stable/snps.html#snps.snps.SNPs.snps>`_ the data:
+
 >>> s.source
 '23andMe'
 >>> s.count
-991786
+991767
+>>> s.build
+37
+>>> s.assembly
+'GRCh37'
 
-The ``SNPs`` class accepts a path to a file or a bytes object. A ``Reader`` class attempts to
-infer the data source and load the SNPs. The loaded SNPs are
-`normalized <https://snps.readthedocs.io/en/stable/snps.html#snps.snps.SNPs.snps>`_ and
-available via a ``pandas.DataFrame``:
+The SNPs are available as a ``pandas.DataFrame``:
 
 >>> df = s.snps
 >>> df.columns.values
 array(['chrom', 'pos', 'genotype'], dtype=object)
->>> df.index.name
-'rsid'
->>> df.chrom.dtype.name
-'object'
->>> df.pos.dtype.name
-'uint32'
->>> df.genotype.dtype.name
-'object'
 >>> len(df)
-991786
-
-``snps`` also attempts to detect the build / assembly of the data:
-
->>> s.build
-37
->>> s.build_detected
-True
->>> s.assembly
-'GRCh37'
+991767
 
 Merge Raw Data Files
 ````````````````````
-The dataset consists of raw data files from two different DNA testing sources - let's combine
-these files. Specifically, we'll update the ``SNPs`` object with SNPs from a
-`Family Tree DNA <https://www.familytreedna.com>`_ file.
-
->>> merge_results = s.merge([SNPs("resources/662.ftdna-illumina.341.csv.gz")])
-Merging SNPs('662.ftdna-illumina.341.csv.gz')
-SNPs('662.ftdna-illumina.341.csv.gz') has Build 36; remapping to Build 37
-Downloading resources/NCBI36_GRCh37.tar.gz
-27 SNP positions were discrepant; keeping original positions
-151 SNP genotypes were discrepant; marking those as null
->>> s.source
-'23andMe, FTDNA'
->>> s.count
-1006960
->>> s.build
-37
->>> s.build_detected
-True
+Combine SNPs from multiple files (e.g., combine data from different testing companies):
 
-If the SNPs being merged have a build that differs from the destination build, the SNPs to merge
-will be remapped automatically. After this example merge, the build is still detected, since the
-build was detected for all ``SNPs`` objects that were merged.
+>>> results = s.merge([SNPs("resources/sample2.ftdna.csv.gz")])
+>>> s.count
+1006949
 
-As the data gets added, it's compared to the existing data, and SNP position and genotype
-discrepancies are identified. (The discrepancy thresholds can be tuned via parameters.) These
-discrepant SNPs are available for inspection after the merge via properties of the ``SNPs`` object.
+SNPs are compared during the merge. Position and genotype discrepancies are identified and
+can be inspected via properties of the ``SNPs`` object:
 
+>>> len(s.discrepant_merge_positions)
+27
 >>> len(s.discrepant_merge_genotypes)
-151
-
-Additionally, any non-called / null genotypes will be updated during the merge, if the file
-being merged has a called genotype for the SNP.
-
-Moreover, ``merge`` takes a ``chrom`` parameter - this enables merging of only SNPs associated
-with the specified chromosome (e.g., "Y" or "MT").
-
-Finally, ``merge`` returns a list of ``dict``, where each ``dict`` has information corresponding
-to the results of each merge (e.g., SNPs in common).
-
->>> sorted(list(merge_results[0].keys()))
-['common_rsids', 'discrepant_genotype_rsids', 'discrepant_position_rsids', 'merged']
->>> merge_results[0]["merged"]
-True
->>> len(merge_results[0]["common_rsids"])
-692918
+156
 
 Remap SNPs
 ``````````
-Now, let's remap the merged SNPs to change the assembly / build:
+Convert SNPs between genome assemblies (Build 36/NCBI36, Build 37/GRCh37, Build 38/GRCh38):
 
->>> s.snps.loc["rs3094315"].pos
-752566
 >>> chromosomes_remapped, chromosomes_not_remapped = s.remap(38)
-Downloading resources/GRCh37_GRCh38.tar.gz
->>> s.build
-38
 >>> s.assembly
 'GRCh38'
->>> s.snps.loc["rs3094315"].pos
-817186
-
-SNPs can be remapped between Build 36 (``NCBI36``), Build 37 (``GRCh37``), and Build 38
-(``GRCh38``).
 
 Save SNPs
 `````````
-Ok, so far we've merged the SNPs from two files (ensuring the same build in the process and
-identifying discrepancies along the way). Then, we remapped the SNPs to Build 38. Now, let's save
-the merged and remapped dataset consisting of 1M+ SNPs to a tab-separated values (TSV) file:
-
->>> saved_snps = s.to_tsv("out.txt")
-Saving output/out.txt
->>> print(saved_snps)
-output/out.txt
-
-Moreover, let's get the reference sequences for this assembly and save the SNPs as a VCF file:
-
->>> saved_snps = s.to_vcf("out.vcf")
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.2.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.3.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.4.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.5.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.6.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.7.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.8.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.9.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.10.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.11.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.12.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.14.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.15.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.16.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.17.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.18.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.19.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.20.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.22.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.X.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.Y.fa.gz
-Downloading resources/fasta/GRCh38/Homo_sapiens.GRCh38.dna.chromosome.MT.fa.gz
-Saving output/out.vcf
-1 SNP positions were found to be discrepant when saving VCF
-
-When saving a VCF, if any SNPs have positions outside of the reference sequence, they are marked
-as discrepant and are available via a property of the ``SNPs`` object.
-
-All `output files <https://snps.readthedocs.io/en/stable/output_files.html>`_ are saved to the
-output directory.
+Save SNPs to common file formats:
+
+>>> _ = s.to_tsv("output.txt")
+>>> _ = s.to_csv("output.csv")
+
+To save as VCF, ``snps`` automatically downloads the required reference sequences for the
+assembly. This ensures the REF alleles in the VCF are accurate:
+
+>>> _ = s.to_vcf("output.vcf")  # doctest: +SKIP
+
+All output files are saved to the `output directory
+<https://snps.readthedocs.io/en/stable/output_files.html>`_.
+
+Generate Synthetic Data
+```````````````````````
+Generate synthetic genotype data for testing, examples, or demonstrations:
+
+>>> from snps.io import SyntheticSNPGenerator
+>>> gen = SyntheticSNPGenerator(build=37, seed=123)
+>>> gen.save_as_23andme("synthetic_23andme.txt.gz", num_snps=10000)
+'synthetic_23andme.txt.gz'
+
+The generator supports multiple output formats (23andMe, AncestryDNA, FTDNA) and
+automatically injects build-specific marker SNPs to ensure accurate build detection.
 
 Documentation
 -------------
 Documentation is available `here <https://snps.readthedocs.io/>`_.
 
 Acknowledgements
 ----------------
-Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, `openSNP <https://opensnp.org>`_,
-`Open Humans <https://www.openhumans.org>`_, and `Sano Genetics <https://sanogenetics.com>`_.
+Thanks to Mike Agostino, Padma Reddy, Kevin Arvai, `Open Humans <https://www.openhumans.org>`_,
+and `Sano Genetics <https://sanogenetics.com>`_. This project was historically validated using
+data from `openSNP <https://opensnp.org>`_.
 
-``snps`` incorporates code and concepts generated with the assistance of
-`OpenAI's <https://openai.com>`_ `ChatGPT <https://chatgpt.com>`_. ✨
+``snps`` incorporates code and concepts generated with the assistance of various
+generative AI tools (including but not limited to `ChatGPT <https://chatgpt.com>`_,
+`Grok <https://grok.com>`_, and `Claude <https://claude.ai>`_). ✨
 
 License
 -------

diff --git a/analysis/parse-opensnp-files/README.rst b/analysis/parse-opensnp-files/README.rst