Skip to content

Specification about DMR.csv #20

@Yijun-Tian

Description

@Yijun-Tian

Hi @hanyangii,
Could you share more details about the preparation of the DMR.csv? I tried to fine-tune based on a dorado called BAM file, but the fine-tune doesn't seem to work:

 methylbert preprocess_finetune --methylcaller dorado --input_file  with_cell_type.labeled.bam --f_dmr $BED --f_ref $REF --split_ratio 0.8 --n_cores 23 -o methylbert.test/
MethylBERT v2.0.2
Could not find any statistics to sort DMRs
Number of DMRs to extract sequence reads: 220
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.53s/it]Fine-tuning data generated:                                    name flag ref_name   ref_pos map_quality cigar  ...                   CT                                            dna_seq                                         methyl_seq dmr_ctype dmr_label ctype
0  23afdc5b-12f3-4e16-8ba1-bb8c42a21a51    0     chr6  50851183          60  713M  ...  prostate_epithelial  AAA AAC ACG CGT GTT TTT TTC TCA CAA AAG AGG GG...  2202222222222222222222222222222222222220222222...         T       170    NA

[1 rows x 46 columns]
Total sequences per cell type
ctype
NA    1
Name: count, dtype: int64
Traceback (most recent call last):
  File "/home/4470655/.conda/envs/methylbert/bin/methylbert", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 313, in main
    run_preprocess(args)
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/cli.py", line 238, in run_preprocess
    finetune_data_generate(f_dmr=args.f_dmr,
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/methylbert/data/finetune_data_generate.py", line 384, in finetune_data_generate
    train_files, test_files = train_test_split(
                              ^^^^^^^^^^^^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 213, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2780, in train_test_split
    n_train, n_test = _validate_shuffle_split(
                      ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/4470655/.conda/envs/methylbert/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 2410, in _validate_shuffle_split
    raise ValueError(
ValueError: With n_samples=1, test_size=0.19999999999999996 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.
Collecting reads from .bam files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.66s/it]

Here is my $BED. It was a DMR based on tumor normal paired comparison. Does ctype means cell type as your example shows?

head $BED
chr     start   end     ctype
chr1    828727  829648  T
chr1    19361106        19361591        T
chr1    37734952        37735434        T
chr1    74543166        74544241        T
chr1    87151545        87152349        T
chr1    91736103        91737407        T
chr1    106081022       106081218       T
chr1    121019929       121021641       T
chr1    156845668       156845999       T

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions