TextProc

This is a polars plugin to enable various standard text processing algorithms in polars, including gopher repetition signals, minhash calculation, fasttext classifiers, and tokenization.

polars_textproc.repetition_signals(expr) applies the gopher repetetition signals to each text in the given expr (e.g. a dataframe column). Returns a struct containing top_1_gram_char_ratio, ... top_4_gram_char_ratio, dup_5_gram_char_ratio ... dup_10_gram_char_ratio. The underlying tokenization can be controlled using the tokenizer_pattern kwargs, a regexp which by default is r"\w+". Note that the pattern is compiled by the rust regex crate, which doesn't match pythons re module.

polars_textproc.fasttext(expr, path, labels) applies the fasttext model at path to each text in the given expr (e.g. a column). By default it returns a struct with the fields top_label, top_score, and total_score. The returned values can be controlled with output_aggregate (default: True), and output_scores (default: False). With output_scores=True, the score for all supplied labels will be returned (with the label as the struct field name). With output_aggregate=False, top_label, top_score, and total_score will not be returned.

polars_textproc.minhash(expr, tokenizer_pattern=r"\w+", seed=SEED, buckets=14, bsize=8, window=5) constructs a hex minhash signature of each text given by expr. It produces window-shingles of the extracted tokens, as specified by tokenizer_pattern, and hashes each shingle into buckets * bsize hashes. If bsize>1, the final minhashes are themselves hashed into 128-bit bucket hashes and returned as a hex encoded buckets*32-byte string. If bsize=1, it returns the raw 64-bit minhashes hex encoded as a buckets*16-byte string.

polars_textproc.scrub(expr, patterns, replacement="REDACTED") replaces all matches of the given regex patterns with the replacement string. Overlapping matches are merged. Regexes use the Rust regex crate.

polars_textproc.compression_ratio(expr, level=6) returns original_size / compressed_size using deflate compression at the given level.

polars_textproc.compressed_size(expr, level=6) returns the compressed size in bytes (deflate, excluding the 2-byte zlib header).

polars_textproc.samplebyte(expr) returns a random UInt8 per row (derived from a random 64-bit sample), such that the probability of that byte being exactly x is 2^(-x) for x > 0.

polars_textproc.uuid4(expr) returns a random UUID v4 string per row.

polars_textproc.tokenize(expr, tokenizer) returns the tokenization of the text in expr, using the supplied tokenizer. The tokenizer can be supplied either as a path to a json dump of a tokenizers.Tokenizer, or as a tokenizers.Tokenizer.

The plugin can also be registered as a namespace using polars_textproc.register_namespace(name='textproc'),
which registers the polars expression namespace textproc, and enables calling the function that way, e.g. lf.select(pl.col('text').str.to_lowercase().textproc.minhash()).

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github/workflows		.github/workflows
polars_textproc		polars_textproc
src		src
tests		tests
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextProc

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TextProc

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages