Skip to content

Text quality heuristics #65

@jpcompartir

Description

@jpcompartir

Implement the filters/heurstics from:
https://arxiv.org/pdf/2405.01582

filter_name heuristic description
has_first_letter_caps First character capitalized Check if first character of each line is capitalized.
no_all_caps All characters capitalised Check if all the characters in the line are capitalized
word_repetetion_ratio_ge_0_2 Word repetition ratio Check if ratio of repetition for word in line is > 0.2
digit_punctuation_ratio_0_25 Digit/punctuation to word ratio Identify lines with ratio of digits/punctuation to words in a line is > 0.25.
no_special_characters Has { character Flower brackets are usually common in code as we are curating for text only content this filter identifies text that might contain code.
terminal_punctuation Has terminal punctuation Check if the lines end with one of these puntuation marks - ’.’, ’!’, ’?’, ’"’.
stop_word_match_2 Has 2 stop words Check if the line contains at least 2 stop words among ’the’, ’be’, ’to’, ’of’, ’and’, ’that’, ’have’, ’with’.
javascript_flag Contains special phrases C Check if text contains phrases ’javascript’ or ’lorem ipsum’ to identify docs with code.
token_count_ge_3 Token count Check if the token count is > 3
word_count_3_256 Word count range Check if line word count is > 3 and < 256.
has_object Has object check if there is object identified by parser
has_noun Has noun Check if there is at least one noun in the line.
has_determiner Has determiner Check if the line contains determiner based on results from text parser
text_complexity_c1 Text complexity For this we use setup similar to CAT filter(Radenovic et al., 2023), where lines with atleast one edge from object are flagged as positive.

Combine into the scores given:
$$\text{score}\text{line} = \frac{\sum{i=1}^{F} w_iI_i(line)}{\sum_{i=1}^{F} w_i}$$

$$\text{score}_\text{doc} = \frac{\sum_{\text{line}=1}^{n} tc_\text{line}\text{score}_\text{line}}{\sum_{\text{line=1}}^{n} tc_\text{line}} $$

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions