A Telegram chatbot for recommending tourist destinations in Yogyakarta.
This repository contains an intent-classification chatbot and utilities to enrich place data using Google Places API. The development notebook contains experiments with a hybrid TF‑IDF + SVM pipeline (used during research), while the production bot now uses the SVM hybrid pipeline by default (configurable to use an LSTM if you prefer).
- Bot:
src/telegram_bot.py— loads a pre-trained LSTM intent classifier, Word2Vec embeddings, and a label encoder to classify user messages and reply using canned responses indata/intents_diy_full.json. - Bot (recommended):
src/telegram_bot_v3.py— latest SVM-based hybrid pipeline (greeting detector + 88-class SVM). Usesrc/run_telegram_v3.pyto run this version (it readsTELEGRAM_TOKEN). - Bot (alternate):
src/telegram_bot.py— supports both SVM and LSTM via theCLASSIFIER_TYPEenvironment variable (defaultsvm) and kept for compatibility. - Scraper:
scripts/fetch_places_details.py— uses Google Places APIs to retrieve details (ratings, opening hours, address components) for places parsed from the intents file. - Notebook:
YOGA_Chatbot_Complete.ipynb— EDA, preprocessing, feature extraction, training pipeline (dual TF-IDF hybrid pipeline and experiments). Can be used to re-train or reproduce model artifacts.
- Intent classification for ~88 classes (greeting, goodbye, many
kecamatan_intents for district-level recommendations). - Response mapping and preview generation.
- Prediction & conversation logging for monitoring and retraining (
logs/predictions.jsonl,logs/conversations.jsonl). - Google Places scraper to enrich datasets from the intents "Top 5" lists.
├─ data/
│ ├─ intents_diy_full.json # intent definitions & responses
│ └─ features/ # expected: label_encoder.pickle, tfidf vectorizers
├─ models/
│ └─ greeting_detector.pkl, svm_model.pkl # SVM artifacts expected by v3
├─ scripts/
│ └─ fetch_places_details.py # Google Places scraping utility
├─ src/
│ ├─ telegram_bot_v3.py # recommended SVM-based bot
│ ├─ telegram_bot.py # alternate bot (svm/lstm)
│ └─ run_telegram_v3.py # runner for v3
├─ logs/ # auto-created: predictions.jsonl, conversations.jsonl
├─ requirements.txt
└─ YOGA_Chatbot_Complete.ipynb
See requirements.txt for pinned dependencies. Minimum / tested versions include:
- Python 3.10+ (recommended)
- tensorflow==2.15.0
- python-telegram-bot==20.7
- sastrawi, numpy, pandas, scikit-learn
- python-dotenv
Install requirements:
PowerShell:
python -m pip install -r requirements.txt- Create a
.envfile at the repository root (optional) OR set environment variables directly.
Examples (PowerShell):
# For v3 runner (recommended):
$env:TELEGRAM_TOKEN = "your_token_here"
# If using src/telegram_bot.py directly (alternate):
$env:TELEGRAM_BOT_TOKEN = "your_token_here"
# Optional: choose classifier type (svm or lstm)
$env:CLASSIFIER_TYPE = "svm"- Required model artifacts (v3 / SVM pipeline):
-
In
models/:greeting_detector.pkl— binary greeting detector (pickled sklearn model)svm_model.pkl— main 88-class SVM classifier (pickled sklearn model)
-
In
data/features/:tfidf_greeting.pickle— TF-IDF vectorizer for greeting detectortfidf_vectorizer.pickle— TF-IDF vectorizer for main classifierlabel_encoder.pickle— sklearn LabelEncoder
Optional LSTM artifacts (only if CLASSIFIER_TYPE=lstm):
models/yoga_lstm_best.h5data/features/word2vec.modeldata/features/feature_extraction_info.json(withmax_lengthandvector_size)
If you don't have the SVM artifacts, follow the notebook section below to export them from training outputs.
- Set token in PowerShell for the current session (v3 uses
TELEGRAM_TOKEN):
$env:TELEGRAM_TOKEN = "<your_token_here>"- Run the v3 runner (recommended):
python src/run_telegram_v3.pyNotes:
run_telegram_v3.pyloadsTELEGRAM_TOKENand startssrc/telegram_bot_v3.py(the SVM hybrid pipeline).- If you prefer the alternate bot entrypoint, you can set
TELEGRAM_BOT_TOKENand usesrc/telegram_bot.pydirectly; that file supportsCLASSIFIER_TYPE=svm|lstmvia env varCLASSIFIER_TYPE.
Usage (PowerShell):
python scripts/fetch_places_details.py --api-key "YOUR_GOOGLE_PLACES_API_KEY" --intents "data/intents_diy_full.json" --output-csv "data/tourism_places_details.csv" --output-json "data/tourism_places_details.json"Options:
--max-places Nto run on a smaller subset (useful for testing).
The script will parse Top 5 lists from kecamatan_ intents and query Google Places for details. Outputs: CSV and JSON files with columns like place_name, nama, kecamatan, kabupaten, provinsi, harga, jam_buka, rating, reviews_count, phone, website, address, google_maps_url, place_id, types.
The training & evaluation pipeline is in YOGA_Chatbot_Complete.ipynb.
Summary steps to produce artifacts used by the bot:
- Run notebook cells (or run parts in a Python script) to:
- Preprocess and build TF-IDF / Word2Vec / LSTM pipeline
- Train and evaluate models
- Export/save artifacts with these filenames/paths expected by the bot:
# Example snippets to save SVM artifacts from notebook training
with open('data/features/label_encoder.pickle', 'wb') as f:
pickle.dump(label_encoder, f)
with open('data/features/tfidf_vectorizer.pickle', 'wb') as f:
pickle.dump(tfidf_main, f)
with open('data/features/tfidf_greeting.pickle', 'wb') as f:
pickle.dump(tfidf_greeting, f)
with open('models/svm_model.pkl', 'wb') as f:
pickle.dump(main_classifier, f)
with open('models/greeting_detector.pkl', 'wb') as f:
pickle.dump(greeting_detector, f)
# Optional: save LSTM artifacts (if you train an LSTM)
model.save('models/yoga_lstm_best.h5')
word2vec_model.save('data/features/word2vec.model')
json.dump({'max_length': max_len, 'vector_size': vector_size}, open('data/features/feature_extraction_info.json','w'), ensure_ascii=False)Make sure vector_size matches the Word2Vec vectors used when training the LSTM and max_length is the same padding length used in preprocessing.
- Predictions & model output:
logs/predictions.jsonl - Full conversations:
logs/conversations.jsonl
Each entry is a JSON object (newline-delimited). Useful for error analysis and re-training.
- Missing model files: The bot raises errors if any expected artifact is absent; ensure the correct filenames and paths.
- Token issues:
TELEGRAM_BOT_TOKENmust be set in.envor environment. run_telegram_v3.pyis legacy and expects different variable names — prefer the directpython -c "from src.telegram_bot import main; main()"approach.- Gensim Word2Vec versions must be compatible between training and inference (same vector sizes).
- TensorFlow may need a specific Python version and CPU/GPU config depending on the environment.
Add images to an images/ directory and reference them here for the README. Example categories and filenames:
- [Screenshot halaman telegram]
images/telegram_chat_screenshot.png— example bot interaction in Telegram ✅ - [Screenshot: training loss & accuracy]
images/notebook_training_plots.png— plots from the notebook during training 📈 - [Screenshot: scraper CSV preview]
images/scraper_csv_preview.png— sample rows of scraped place data 📋 - [Screenshot: logs preview]
images/logs_preview.png— example entries fromlogs/predictions.jsonlandlogs/conversations.jsonl📝
When adding screenshots, include concise captions and alt text like: .
- Send: "Rekomendasi wisata di Bantul" → expect an intent for general tourism and a response from the corresponding intent.
- Send short greetings: "Hai" / "Selamat pagi" → greeting intents should be detected.
- If the bot replies with a clarification message, check
logs/predictions.jsonlto see predicted probabilities.
- Add new intents to
data/intents_diy_full.json(keep structure consistent). - If you add
Top 5lists in kecamatan responses, re-run the scraper to collect place metadata.
- Muhammad Akbar Pradana (Devaaldo)
- Link Repository : github.com/Devaaldo/yoga