This repository provides Nextflow workflows for analyzing PacBio Kinnex long-read RNA sequencing data from 206 individuals in the Human Pangenome Reference Consortium (HPRC) Release 2.
The workflows cover the full pipeline from preprocessing raw reads to downstream QTL mapping, enabling reproducible analysis of expression and regulatory variation.
We use GENCODE Release 48 as the reference gene annotation. The GTF file provides comprehensive annotation on the primary assembly.
To ensure consistency with the reference genome, scaffold names are converted from GenBank accession numbers to UCSC-style names using the assembly report. For example, KI270706.1 is renamed to chr1_KI270706v1_random.
For each of the 206 HPRC R2 samples, PacBio Kinnex cDNA libraries were sequenced in two runs. Depending on the sequencing facility, the resulting FLNC BAM files were either provided separately (two per sample) or already concatenated into one (see the index file for download links).
For consistency, we concatenate the two runs into a single FLNC BAM per sample. After concatenation, we update the SM field in the @RG header line to match the sample ID (instead of default values like BioSample_1) and generate the corresponding PBI index.
To align FLNC reads, each BAM file is converted to FASTQ format. We add the sample ID as a prefix to each read name to ensure uniqueness after pooling across samples. Each FASTQ file is then aligned to the reference genome using minimap2 with GENCODE v48 annotations in BED12 format.
Finally, the 206 aligned BAM files are merged into a single BAM file for unified transcript model construction across samples.
A unified transcript model across samples is built using the merged BAM file as input to IsoQuant. Because this step is computationally intensive, the merged BAM is split into 25 chromosome-level BAM files. For chromosome 14, only reads within positions 1–104,474,600 are used due to extremely high read depth in the IGH region. This depth arises because the samples are lymphoblastoid cell lines (LCLs), derived from B cells that strongly express IGH genes. The current model construction algorithm cannot handle this region, but all reads on chromosome 14 are still included later during read assignment. IsoQuant is run on each chromosome, and the resulting extended GTFs are combined into a single extended GTF for downstream analysis.
The extended GTF is then used as the unified transcript model. Each per-sample BAM file is processed with IsoQuant to assign reads to known and novel transcripts for qunatification.