Skip to content
Eun Ji Kim edited this page Jan 2, 2018 · 3 revisions

  1. Running PORT
    a. Recommended workflow
    b. Usage
    c. Monitoring a PORT run
  2. PORT Output Files
    a. Normalized feature count spreadsheet
    b. Normalized SAM/BAM
    c. Normalized Coverage/Junction
    d. Normalization Factors Statistics

1. Running PORT

PORT has two parts: PART1 and PART2.

  • In PART1, PORT preprocesses the data using the normalization factors.
  • In PART2, PORT performs normalization and quantification.

a. Recommended workflow

First, run run_normalization with no pipeline option.
If you do not provide any pipeline options, PORT will pause when all steps in PART1 completes.

Next, check expected number of reads and highly expressed features (exons, introns, and genes).
You will have a chance to check the expected number of reads after normalization and the list of highly expressed exons and introns for Exon-Intron-Junction Normalization and the list of highly expressed genes for Gene Normalization. Samples that lower the normalized read depth can be removed from at this point.

Finally, run_normalization with -part2 option.
Use -cutoff_highexp option if you choose to filter the high expressers.

b. Usage

Run PORT from the command line as follows:

run_normalization --sample_dirs <file of sample_dirs> --loc <s> \
--unaligned <file of fa/fqfiles> --alignedfilename <s> --cfg <cfg file> [options]

Arguments and Options:

[Required Arguments]

--sample_dirs <file of sample dirs>
Create a text file listing the names of the sample directories (without path).
example:

sample1
sample2
sample3
sample4

--loc <s>
Provide a full path of the directory with the sample directories. See Input directory structure for detail.

--unaligned <file of fa/fqfiles>
Create a text file listing the full paths of the input fasta/fastq files.
example:

/path/to/Sample_1.fwd.fq/fa
/path/to/Sample_1.rev.fq/fa
/path/to/Sample_2.fwd.fq/fa
/path/to/Sample_2.rev.fq/fa
/path/to/Sample_3.fwd.fq/fa
/path/to/Sample_3.rev.fq/fa
/path/to/Sample_4.fwd.fq/fa
/path/to/Sample_4.rev.fq/fa

--alignedfilename <s>
Name of aligned file. Remember, PORT expects all alignment files to have the same name across samples. See Alignment files for detail.

--cfg <cfg file>
A configuration file for PORT. See Configuration file for detail.

[Pipeline Options]

-part1_part2
Use this option if you want to run steps in PART1 and PART2 without pausing.

-part2
Use this option to resume the pipeline at PART2 after running PORT without any pipeline options.

-alt_out <s>
Use this option to redirect the normalized data to an alternate output directory (full path)
(Default: /path/to/studydir/NORMALIZED_DATA/)

-h
Get a summary of the arguments and options

-v
print version of PORT

[Resume Options]

You may not change the [Normalization Parameters] with resume option.

-resume
Use this if you have a job that crashed or stopped. This runs a job that has already been initialized or partially run after the last completed step. It may repeat the last completed step if necessary.

-resume_at "<step>"
Use this if you have a job that crashed or stopped. This resumes a job at "<step>". make sure full step name (found in log file) is given in quotes. (e.g. -resume_at "1 "STUDY.get_total_num_reads"")

[Normalization Parameters]

-cutoff_highexp <n>
This option sets the cutoff % value to identify highly expressed genes/exons/introns.
The script will consider individual features (genes/exons/introns) accounting for greater than n(%) of the total reads as high expressers. The pipeline will normalize the reads mapping to those features separately.
(Default = 100; with the default cutoff, features (genes/exons/introns) expressed >3% will be reported, but will not remove any reads)

-cutoff_lowexp <n>
This option set the cutoff counts to identify low expressers in the final spreadsheets (exon, intron, junction and gene). The script will remove features with sum of counts less than the set value from all samples.
(Default = 0; with the default cutoff, features with sum of counts = 0 will be removed from all samples)

[Exon-Intron-Junction normalization only]

-novel_off
set this if you DO NOT want to use the inferred exons/introns for quantification
(By default, the pipeline will use inferred exons/introns)

-min <n>
is minimum size of inferred exon for get_novel_exons.pl script (Default = 10)

-max <n>
is maximum size of inferred exon for get_novel_exons.pl script (Default = 1200)

-depthExon <n>
The pipeline splits filtered sam files into reads mapping to 1,2,3,...,n exons and downsamples each separately (Default = 20)

-depthIntron <n>
The pipeline splits filtered sam files into reads mapping to 1,2,3,...,n introns and downsamples each separately (Default = 10)

-flanking_region <n>
This is used for generating list of flanking regions.By default, 5000 bp up/downstream of each gene will be considered a flanking region.

c. Monitoring a PORT run

In addition to the STDOUT and STDERR files in STUDY/logs, PORT will create a log file called STUDY/logs/STUDY.run_normalization.log, which you can use to check the status.

All PORT job names begin with the unique STUDY name (e.g. "STUDY.get_total_num_reads"). You can stop/kill a PORT run by killing jobs with the names that begin with STUDY (e.g. bkill -J "STUDY*").

2. PORT Output Files

Once PORT finishes, your directory structure will look like this if you run both Gene and Exon-Intron-Junction Normalization (If your data are stranded, each FINAL_SAM directory will have sense and antisense directory inside). You may use '-alt_out' option to output NORMALIZED_DATA to an alternate location.

STUDY
│── READS
│   ├── sample1
│   ├── sample2
│   ├── sample3
│   └── sample4
├── STATS
│   ├── EXON_INTRON_JUNCTION
│   └── GENE
│── NORMALIZED_DATA
│   ├── EXON_INTRON_JUNCTION
│   │   ├── COV
│   │   ├── FINAL_SAM
│   │   │   ├── exonmappers
│   │   │   ├── intergenicmappers
│   │   │   ├── intronmappers
│   │   │   └── exon_inconsistent 
│   │   ├── JUNCTIONS
│   │   └── SPREADSHEETS
│   └── GENE
│       ├── COV
│       ├── FINAL_SAM
│       ├── JUNCTIONS
│       └── SPREADSHEETS
│── logs
└── shell_scripts

a. Normalized feature count spreadsheet

[Exon-Intron-Junction]

  • PORT outputs feature (exon, intron, junctions) counts spreadsheets to STUDY/NORMALIZED_DATA/EXON_INTRON_JUNCTION/SPREADSHEETS.
  • FINAL MIN spreadsheet has counts from Unique reads and FINAL MAX spreadsheet has counts from Unique+Non-Unique reads.
  • If the data are stranded, you will find sense and antisense spreadsheets for exon and intron counts.

[Gene]

  • PORT outputs gene counts spreadsheets to STUDY/NORMALIZED_DATA/GENE/SPREADSHEETS.
  • FINAL MIN spreadsheet has counts from Unique reads that only map to one gene and FINAL MAX spreadsheet has counts from Unique+Non-Unique reads/multiple gene mappers.
  • If the data are stranded, you will find sense and antisense spreadsheets.

b. Normalized SAM/BAM

[Exon-Intron-Junction]

  • PORT outputs normalized exonmappers, intronmappers, intergenicmappers and exon inconsistent (exonmappers with inconsistent junctions) files to STUDY/NORMALIZED_DATA/EXON_INTRON_JUNCTION/FINAL_SAM directory.
  • If the data are stranded, you will find sense and antisense exonmappers and intronmappers.

[Gene]

  • PORT outputs normalized genemappers to STUDY/NORMALIZED_DATA/GENE/FINAL_SAM directory.
  • If the data are stranded, you will find sense and antisense genemappers.

c. Normalized Coverage/Junction

Coverage (STUDY/NORMALIZED_DATA/*/COV) and Junctions (STUDY/NORMALIZED_DATA/*/JUNCTION) files are generated from uniquely merged sam files for each sample and can be used for data visualization.

d. Normalization Factors Statistics

  • STUDY/STATS/*_normalization_factors.txt file provides summary statistics of the normalization factors used.
  • Percentage of reads mapping to each chromosome (STUDY/STATS/percent_reads_chr*txt) and percentage of highly expressed features (STUDY/STATS/*/percent_high_expresser_*.txt) are also provided for both normalization types.