Text quality heuristics

Implement the filters/heurstics from:
https://arxiv.org/pdf/2405.01582

| filter_name                  | heuristic                       | description                                                                                                                                 |
| ---------------------------- | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- |
| has_first_letter_caps        | First character capitalized     | Check if first character of each line is capitalized.                                                                                       |
| no_all_caps                  | All characters capitalised      | Check if all the characters in the line are capitalized                                                                                     |
| word_repetetion_ratio_ge_0_2 | Word repetition ratio           | Check if ratio of repetition for word in line is > 0.2                                                                                      |
| digit_punctuation_ratio_0_25 | Digit/punctuation to word ratio | Identify lines with ratio of digits/punctuation to words in a line is > 0.25.                                                               |
| no_special_characters        | Has { character                 | Flower brackets are usually common in code as we are curating for text only content this filter identifies text that might contain code.    |
| terminal_punctuation         | Has terminal punctuation        | Check if the lines end with one of these puntuation marks - ’.’, ’!’, ’?’, ’"’.                                                             |
| stop_word_match_2            | Has 2 stop words                | Check if the line contains at least 2 stop words among ’the’, ’be’, ’to’, ’of’, ’and’, ’that’, ’have’, ’with’.                              |
| javascript_flag              | Contains special phrases C      | Check if text contains phrases ’javascript’ or ’lorem ipsum’ to identify docs with code.                                                    |
| token_count_ge_3             | Token count                     | Check if the token count is > 3                                                                                                             |
| word_count_3_256             | Word count range                | Check if line word count is > 3 and < 256.                                                                                                  |
| has_object                   | Has object                      | check if there is object identified by parser                                                                                               |
| has_noun                     | Has noun                        | Check if there is at least one noun in the line.                                                                                            |
| has_determiner               | Has determiner                  | Check if the line contains determiner based on results from text parser                                                                     |
| text_complexity_c1           | Text complexity                 | For this we use setup similar to CAT filter(Radenovic et al., 2023), where lines with atleast one edge from object are flagged as positive. |



Combine into the scores given:
$$\text{score}_\text{line} = \frac{\sum_{i=1}^{F} w_iI_i(line)}{\sum_{i=1}^{F} w_i}$$

$$\text{score}_\text{doc} = \frac{\sum_{\text{line}=1}^{n} tc_\text{line}\text{score}_\text{line}}{\sum_{\text{line=1}}^{n} tc_\text{line}} $$

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text quality heuristics #65

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

filter_name	heuristic	description
has_first_letter_caps	First character capitalized	Check if first character of each line is capitalized.
no_all_caps	All characters capitalised	Check if all the characters in the line are capitalized
word_repetetion_ratio_ge_0_2	Word repetition ratio	Check if ratio of repetition for word in line is > 0.2
digit_punctuation_ratio_0_25	Digit/punctuation to word ratio	Identify lines with ratio of digits/punctuation to words in a line is > 0.25.
no_special_characters	Has { character	Flower brackets are usually common in code as we are curating for text only content this filter identifies text that might contain code.
terminal_punctuation	Has terminal punctuation	Check if the lines end with one of these puntuation marks - ’.’, ’!’, ’?’, ’"’.
stop_word_match_2	Has 2 stop words	Check if the line contains at least 2 stop words among ’the’, ’be’, ’to’, ’of’, ’and’, ’that’, ’have’, ’with’.
javascript_flag	Contains special phrases C	Check if text contains phrases ’javascript’ or ’lorem ipsum’ to identify docs with code.
token_count_ge_3	Token count	Check if the token count is > 3
word_count_3_256	Word count range	Check if line word count is > 3 and < 256.
has_object	Has object	check if there is object identified by parser
has_noun	Has noun	Check if there is at least one noun in the line.
has_determiner	Has determiner	Check if the line contains determiner based on results from text parser
text_complexity_c1	Text complexity	For this we use setup similar to CAT filter(Radenovic et al., 2023), where lines with atleast one edge from object are flagged as positive.

Text quality heuristics #65

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions