Skip to content

Error with AnchoredGTFDl  #102

@PelFritz

Description

@PelFritz

Hi,
I am using AnchoredGTFDl to extract promoter sequences, however when I call the load_all() I get the following error
"ValueError: all input arrays must have the same shape".

My assumption is that this arises from the fact that some gene coordinates are too close to the end of the chromosome and hence we do not get the appropriate extraction length. My code is below

`import numpy as np
from kipoiseq.dataloaders import AnchoredGTFDl

fasta_path = 'Zea_mays.Zm-B73-REFERENCE-NAM-5.0.dna.toplevel.fa'
gtf_path = 'Zea_mays.Zm-B73-REFERENCE-NAM-5.0.51.gtf'

dl = AnchoredGTFDl(gtf_path, fasta_path, num_upstream=1000, num_downstream=500,
gtf_filter='gene_biotype == "protein_coding"')

data = dl.load_all()`

As a work around I used the code below but I don't know if this is okay or there is some function to check extracted sequence length automatically.

`sequence = []
gene_id = []
for seq in dl:
if len(seq['inputs']) == 1500:
gene_id.append(seq['metadata']['gene_id'])
sequence.append(seq['inputs'])

sequence = np.array(sequence)
print(sequence.shape)`

Is there some way to assert sequence length to be the same?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions