Skip to content

feat: add bio-bait spam detection with profile bio scanning#10

Open
rezhajulio wants to merge 2 commits intomainfrom
feat/bio-bait-spam-detection
Open

feat: add bio-bait spam detection with profile bio scanning#10
rezhajulio wants to merge 2 commits intomainfrom
feat/bio-bait-spam-detection

Conversation

@rezhajulio
Copy link
Copy Markdown
Owner

@rezhajulio rezhajulio commented May 1, 2026

Summary

Detects two related spam vectors that have been showing up in the Indonesian Telegram community:

  1. Bait phrases in messages — e.g. cek bio aku, liat byoh, open my bio. Spammers obfuscate bio with misspellings, separators, and Cyrillic look-alikes (b.i.o, b1o, bioohh, Ьіо). The handler normalizes the text (NFKC + lowercase + zero-width strip), canonicalizes obfuscated variants back to bio, then matches a small set of imperative + bio + possessive patterns.

  2. Promo/scam links inside the user's Telegram profile bio — e.g. private t.me/+<invite-hash> invite links combined with promo hint words (VIP, promo, open, ready, …) and/or non-whitelisted @mentions. Some spammers send innocuous group messages while their bio carries the actual links. The user's bio is fetched once per hour via bot.get_chat() and cached in bot_data.

On match the handler deletes the message, restricts the user, clears the cached bio, and posts a notification (separate templates for message-bait vs bio-link cases) to the warning topic, then raises ApplicationHandlerStop.

Detection logic

Message bait

  • NFKC normalize → lowercase → strip zero-width chars
  • Canonicalize bio/byo obfuscations to bio (handles Cyrillic look-alikes)
  • 80-char length cap on normalized text (real bait is short)
  • 4 narrow regex patterns gated on imperative cue + bio and/or first-person possessive

Profile bio scan

  • Always flag t.me/+... private invite links
  • Flag non-whitelisted t.me/{username} links (reuses is_url_whitelisted)
  • Flag 2+ non-whitelisted @username mentions, OR 1 mention combined with a promo hint (vip, bcl, asp, open, ready, …)
  • Single bare @mention alone is not enough (avoids false positives)

Changes

  • New handler: src/bot/handlers/bio_bait.py (registered at group=2; contact_spam/new_user_spam/duplicate_spam/message_handler shifted to 3/4/5/6).
  • New config flag: bio_bait_enabled (Settings + GroupConfig, default True).
  • New Indonesian templates: BIO_BAIT_SPAM_NOTIFICATION (+ _NO_RESTRICT) and BIO_LINK_SPAM_NOTIFICATION (+ _NO_RESTRICT).
  • New tests: tests/test_bio_bait.py — 79 tests covering normalization, true positives (cek bio kak, lihat bio dong, bio aku update, Cyrillic/obfuscated forms), false positives (biology, bioinformatics, bio aku ada di README, thank you my bro), bio-link detection, per-user TTL cache, all handler branches (admin/bot/disabled/no-text/delete-fail/restrict-fail/notify-fail).

Verification

  • uv run pytest626 passed (was 547 → +79)
  • bio_bait.py at 100% coverage
  • Overall coverage: 99%
  • ruff check clean

Notes

  • Off-by-default lewd-keyword filter was not included in this PR (oracle recommended observing real samples first).
  • Handler ordering ensures bio-bait runs before contact/new-user/duplicate/profile checks; ApplicationHandlerStop short-circuits downstream when a match fires.
  • Bio fetch errors are swallowed and not cached, so transient API issues don't permanently mask a spam bio.

Detects two related spam vectors common in Indonesian Telegram groups:

1. Bait phrases in messages (e.g. "cek bio aku", "liat byoh",
   "open my bio"). Spammers obfuscate the word "bio" with
   misspellings, separators (b.i.o, b1o), and Cyrillic look-alikes
   (Ьіо). The handler normalizes (NFKC + lowercase + zero-width strip)
   and canonicalizes obfuscated variants back to "bio" before matching
   a small set of imperative + bio + possessive patterns.

2. Promo/scam links inside the user's Telegram profile bio. Some
   spammers send innocuous group messages while their bio carries
   t.me/+invite links, non-whitelisted t.me/{user} links, or multiple
   non-whitelisted @mentions (sometimes paired with promo hint words
   like VIP, BCL, ASP, open). The user's bio is fetched once per hour
   via bot.get_chat() and cached in bot_data.

On match the handler deletes the message, restricts the user, clears
the cached bio, and posts a notification (separate templates for
message-bait vs bio-link cases) to the warning topic.

- New handler: src/bot/handlers/bio_bait.py (registered at group=2,
  shifts contact/new_user/duplicate/message handlers to 3/4/5/6).
- New config: bio_bait_enabled (Settings + GroupConfig, default True).
- New templates: BIO_BAIT_SPAM_NOTIFICATION (+ NO_RESTRICT) and
  BIO_LINK_SPAM_NOTIFICATION (+ NO_RESTRICT) in constants.py.
- Tests: tests/test_bio_bait.py covers normalization, true positives
  (incl. Cyrillic / obfuscated forms), false positives (biology,
  bioinformatics, "bio aku ada di README"), bio-link detection,
  per-user TTL cache, all handler branches.

626 tests pass, bio_bait.py at 100% coverage, ruff clean.
Replace real-looking Telegram invite hashes and @username from spam
examples in code comments and tests with obvious placeholders so the
repository does not propagate (or appear to endorse) actual scam links.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant