feat: add bio-bait spam detection with profile bio scanning by rezhajulio · Pull Request #10 · rezhajulio/PythonID-bot

rezhajulio · 2026-05-01T15:59:36Z

Summary

Detects two related spam vectors that have been showing up in the Indonesian Telegram community:

Bait phrases in messages — e.g. cek bio aku, liat byoh, open my bio. Spammers obfuscate bio with misspellings, separators, and Cyrillic look-alikes (b.i.o, b1o, bioohh, Ьіо). The handler normalizes the text (NFKC + lowercase + zero-width strip), canonicalizes obfuscated variants back to bio, then matches a small set of imperative + bio + possessive patterns.
Promo/scam links inside the user's Telegram profile bio — e.g. private t.me/+<invite-hash> invite links combined with promo hint words (VIP, promo, open, ready, …) and/or non-whitelisted @mentions. Some spammers send innocuous group messages while their bio carries the actual links. The user's bio is fetched once per hour via bot.get_chat() and cached in bot_data.

On match the handler deletes the message, restricts the user, clears the cached bio, and posts a notification (separate templates for message-bait vs bio-link cases) to the warning topic, then raises ApplicationHandlerStop.

Detection logic

Message bait

NFKC normalize → lowercase → strip zero-width chars
Canonicalize bio/byo obfuscations to bio (handles Cyrillic look-alikes)
80-char length cap on normalized text (real bait is short)
4 narrow regex patterns gated on imperative cue + bio and/or first-person possessive

Profile bio scan

Always flag t.me/+... private invite links
Flag non-whitelisted t.me/{username} links (reuses is_url_whitelisted)
Flag 2+ non-whitelisted @username mentions, OR 1 mention combined with a promo hint (vip, bcl, asp, open, ready, …)
Single bare @mention alone is not enough (avoids false positives)

Changes

New handler: src/bot/handlers/bio_bait.py (registered at group=2; contact_spam/new_user_spam/duplicate_spam/message_handler shifted to 3/4/5/6).
New config flag: bio_bait_enabled (Settings + GroupConfig, default True).
New Indonesian templates: BIO_BAIT_SPAM_NOTIFICATION (+ _NO_RESTRICT) and BIO_LINK_SPAM_NOTIFICATION (+ _NO_RESTRICT).
New tests: tests/test_bio_bait.py — 79 tests covering normalization, true positives (cek bio kak, lihat bio dong, bio aku update, Cyrillic/obfuscated forms), false positives (biology, bioinformatics, bio aku ada di README, thank you my bro), bio-link detection, per-user TTL cache, all handler branches (admin/bot/disabled/no-text/delete-fail/restrict-fail/notify-fail).

Verification

uv run pytest → 626 passed (was 547 → +79)
bio_bait.py at 100% coverage
Overall coverage: 99%
ruff check clean

Notes

Off-by-default lewd-keyword filter was not included in this PR (oracle recommended observing real samples first).
Handler ordering ensures bio-bait runs before contact/new-user/duplicate/profile checks; ApplicationHandlerStop short-circuits downstream when a match fires.
Bio fetch errors are swallowed and not cached, so transient API issues don't permanently mask a spam bio.

Detects two related spam vectors common in Indonesian Telegram groups: 1. Bait phrases in messages (e.g. "cek bio aku", "liat byoh", "open my bio"). Spammers obfuscate the word "bio" with misspellings, separators (b.i.o, b1o), and Cyrillic look-alikes (Ьіо). The handler normalizes (NFKC + lowercase + zero-width strip) and canonicalizes obfuscated variants back to "bio" before matching a small set of imperative + bio + possessive patterns. 2. Promo/scam links inside the user's Telegram profile bio. Some spammers send innocuous group messages while their bio carries t.me/+invite links, non-whitelisted t.me/{user} links, or multiple non-whitelisted @mentions (sometimes paired with promo hint words like VIP, BCL, ASP, open). The user's bio is fetched once per hour via bot.get_chat() and cached in bot_data. On match the handler deletes the message, restricts the user, clears the cached bio, and posts a notification (separate templates for message-bait vs bio-link cases) to the warning topic. - New handler: src/bot/handlers/bio_bait.py (registered at group=2, shifts contact/new_user/duplicate/message handlers to 3/4/5/6). - New config: bio_bait_enabled (Settings + GroupConfig, default True). - New templates: BIO_BAIT_SPAM_NOTIFICATION (+ NO_RESTRICT) and BIO_LINK_SPAM_NOTIFICATION (+ NO_RESTRICT) in constants.py. - Tests: tests/test_bio_bait.py covers normalization, true positives (incl. Cyrillic / obfuscated forms), false positives (biology, bioinformatics, "bio aku ada di README"), bio-link detection, per-user TTL cache, all handler branches. 626 tests pass, bio_bait.py at 100% coverage, ruff clean.

Replace real-looking Telegram invite hashes and @username from spam examples in code comments and tests with obvious placeholders so the repository does not propagate (or appear to endorse) actual scam links.

rezhajulio mentioned this pull request May 1, 2026

chore: sanitize spam example references #11

Closed

chore: sanitize spam example references

a6b0329

Replace real-looking Telegram invite hashes and @username from spam examples in code comments and tests with obvious placeholders so the repository does not propagate (or appear to endorse) actual scam links.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add bio-bait spam detection with profile bio scanning#10

feat: add bio-bait spam detection with profile bio scanning#10
rezhajulio wants to merge 2 commits intomainfrom
feat/bio-bait-spam-detection

rezhajulio commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rezhajulio commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Detection logic

Message bait

Profile bio scan

Changes

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rezhajulio commented May 1, 2026 •

edited

Loading