Skip to content

Onboard Project fails when extract is empty #978

@TaperChipmunk32

Description

@TaperChipmunk32

The most recent version errors on attempting to extract an empty project. It could be reasonable to just return a notification in this case instead of continuing. My reason for extracting this project was actually because I suspected that it was empty and shouldn't be activated for drafting. I can't think of a time we'd actually need to do further processing on a truly empty project. Example below:

2026-03-24 11:01:54,855 - silnlp.common.onboard_project - INFO - Onboarding main project 'AAI_2026_03_24'
INFO:silnlp.common.onboard_project:Onboarding main project 'AAI_2026_03_24'
2026-03-24 11:01:54,936 - silnlp.common.onboard_project - INFO - Extracted corpus '/root/M/MT/scripture/anp-AAI_2026_03_24.txt' already exists. Skipping corpus extraction.
2026-03-24 11:01:54,998 - silnlp.common.onboard_project - INFO - Collecting verse counts for project 'AAI_2026_03_24'
  0%|                                                                          | 0/1 [00:00<?, ?it/s]2026-03-24 11:01:55,129 - silnlp.common.collect_verse_counts - INFO - Found verse counts for /root/M/MT/scripture/anp-AAI_2026_03_24.txt
100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1390.68it/s]


2026-03-24 11:01:54,736 - silnlp.common.collect_verse_counts - INFO - No files smaller than 41KB were found.
2026-03-24 11:01:54,736 - silnlp.common.collect_verse_counts - INFO - All files were found.
2026-03-24 11:01:54,738 - silnlp.common.onboard_project - INFO - Running Wildebeest analysis on /root/M/MT/scripture/anp-AAI_2026_03_24.txt.
2026-03-24 11:01:55,601 - silnlp.common.onboard_project - INFO - Calculating tokenization stats for project 'AAI_2026_03_24'
2026-03-24 11:01:59,445 - silnlp.nmt.config - INFO - Preprocessing anp-AAI_2026_03_24 -> anp-AAI_2026_03_24
2026-03-24 11:01:59,685 - silnlp.nmt.config - INFO - train size: 0, val size: 0, test size: 0,
2026-03-24 11:01:59,686 - silnlp.nmt.config - WARNING - Glosses could not be included. No source or target language matches any of the supported gloss language codes: fr, en, id, es, pt.
2026-03-24 11:01:59,686 - silnlp.nmt.config - INFO - terms train size: 0
2026-03-24 11:01:59,686 - silnlp.nmt.config - INFO - Calculating tokenization statistics
Traceback (most recent call last):
  File "/root/miniconda3/envs/silnlp/lib/python3.10/runpy.py", line 196, in _run_module_as_main      
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/silnlp/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/silnlp/silnlp/common/onboard_project.py", line 775, in <module>
    main()
  File "/root/silnlp/silnlp/common/onboard_project.py", line 771, in main
    onboarding_request.process_onboarding_request()
  File "/root/silnlp/silnlp/common/onboard_project.py", line 439, in process_onboarding_request      
    self.main_project.calculate_tokenization_stats(
  File "/root/silnlp/silnlp/common/onboard_project.py", line 209, in calculate_tokenization_stats    
    config.preprocess(stats=True, force_align=True)
  File "/root/silnlp/silnlp/nmt/config.py", line 261, in preprocess
    self._build_corpora(tokenizer, stats, force_align)
  File "/root/silnlp/silnlp/nmt/config.py", line 326, in _build_corpora
    self._calculate_tokenization_stats()
  File "/root/silnlp/silnlp/nmt/config.py", line 406, in _calculate_tokenization_stats
    tokens_verse_df = distribution_df(top_header, src_tokens_per_verse, trg_tokens_per_verse)        
  File "/root/silnlp/silnlp/nmt/config.py", line 389, in distribution_df
    min(src_data),
ValueError: min() arg is an empty sequence

Metadata

Metadata

Labels

bugSomething isn't workingonboardingServal onboarding

Type

No type

Projects

Status

🏗 In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions