-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Hi @hanyangii,
Could you share more details about the preparation of the DMR.csv? I tried to fine-tune based on a dorado called BAM file, but the fine-tune doesn't seem to work:
methylbert preprocess_finetune --methylcaller dorado --input_file with_cell_type.labeled.bam --f_dmr $BED --f_ref $REF --split_ratio 0.8 --n_cores 23 -o methylbert.test/
MethylBERT v2.0.2
Could not find any statistics to sort DMRs
Number of DMRs to extract sequence reads: 220
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.53s/it]Fine-tuning data generated: name flag ref_name ref_pos map_quality cigar ... CT dna_seq methyl_seq dmr_ctype dmr_label ctype
0 23afdc5b-12f3-4e16-8ba1-bb8c42a21a51 0 chr6 50851183 60 713M ... prostate_epithelial AAA AAC ACG CGT GTT TTT TTC TCA CAA AAG AGG GG... 2202222222222222222222222222222222222220222222... T 170 NA
[1 rows x 46 columns]
Total sequences per cell type
ctype
NA 1
Name: count, dtype: int64
Traceback (most recent call last):
File "/home/4470655/.conda/envs/methylbert/bin/methylbert", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 313, in main
run_preprocess(args)
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 238, in run_preprocess
finetune_data_generate(f_dmr=args.f_dmr,
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/data/finetune_data_generate.py", line 384, in finetune_data_generate
train_files, test_files = train_test_split(
^^^^^^^^^^^^^^^^^
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2780, in train_test_split
n_train, n_test = _validate_shuffle_split(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2410, in _validate_shuffle_split
raise ValueError(
ValueError: With n_samples=1, test_size=0.19999999999999996 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.66s/it]
Here is my $BED. It was a DMR based on tumor normal paired comparison. Does ctype means cell type as your example shows?
head $BED
chr start end ctype
chr1 828727 829648 T
chr1 19361106 19361591 T
chr1 37734952 37735434 T
chr1 74543166 74544241 T
chr1 87151545 87152349 T
chr1 91736103 91737407 T
chr1 106081022 106081218 T
chr1 121019929 121021641 T
chr1 156845668 156845999 T
Thanks!
Metadata
Metadata
Assignees
Labels
No labels