Implement the filters/heurstics from:
https://arxiv.org/pdf/2405.01582
| filter_name |
heuristic |
description |
| has_first_letter_caps |
First character capitalized |
Check if first character of each line is capitalized. |
| no_all_caps |
All characters capitalised |
Check if all the characters in the line are capitalized |
| word_repetetion_ratio_ge_0_2 |
Word repetition ratio |
Check if ratio of repetition for word in line is > 0.2 |
| digit_punctuation_ratio_0_25 |
Digit/punctuation to word ratio |
Identify lines with ratio of digits/punctuation to words in a line is > 0.25. |
| no_special_characters |
Has { character |
Flower brackets are usually common in code as we are curating for text only content this filter identifies text that might contain code. |
| terminal_punctuation |
Has terminal punctuation |
Check if the lines end with one of these puntuation marks - ’.’, ’!’, ’?’, ’"’. |
| stop_word_match_2 |
Has 2 stop words |
Check if the line contains at least 2 stop words among ’the’, ’be’, ’to’, ’of’, ’and’, ’that’, ’have’, ’with’. |
| javascript_flag |
Contains special phrases C |
Check if text contains phrases ’javascript’ or ’lorem ipsum’ to identify docs with code. |
| token_count_ge_3 |
Token count |
Check if the token count is > 3 |
| word_count_3_256 |
Word count range |
Check if line word count is > 3 and < 256. |
| has_object |
Has object |
check if there is object identified by parser |
| has_noun |
Has noun |
Check if there is at least one noun in the line. |
| has_determiner |
Has determiner |
Check if the line contains determiner based on results from text parser |
| text_complexity_c1 |
Text complexity |
For this we use setup similar to CAT filter(Radenovic et al., 2023), where lines with atleast one edge from object are flagged as positive. |
Combine into the scores given:
$$\text{score}\text{line} = \frac{\sum{i=1}^{F} w_iI_i(line)}{\sum_{i=1}^{F} w_i}$$
$$\text{score}_\text{doc} = \frac{\sum_{\text{line}=1}^{n} tc_\text{line}\text{score}_\text{line}}{\sum_{\text{line=1}}^{n} tc_\text{line}} $$
Implement the filters/heurstics from:
https://arxiv.org/pdf/2405.01582
Combine into the scores given:
$$\text{score}\text{line} = \frac{\sum{i=1}^{F} w_iI_i(line)}{\sum_{i=1}^{F} w_i}$$