A personalized news-article recommendation Android app, backed by a Django server that crawls Naver News, extracts keyword vectors, and ranks articles by cosine similarity against a per-user keyword profile built from reading history.
Built originally as a college capstone; modernized in 2026 with current tooling (AGP 8.7, Gradle 8.9, Realm 10.19, Django 4.2+) and security hardening (HMAC request signing, HTTPS-only, path-traversal defenses).
- Architecture
- Data policy
- How the recommender works
- Request lifecycle
- Data schemas
- Server setup
- Android setup
- Security model
- File map
- Known limitations
┌──────────────────────────┐ ┌────────────────────────┐
│ Android client │ HTTPS + HMAC-SHA256 │ Django server │
│ (Java) │ ◀───────────────────▶ │ │
│ │ │ ┌──────────────────┐ │
│ ┌────────────────────┐ │ GET recommenddb │ │ views.py │ │
│ │ StartScreen │ │ ─────────────────────▶│ │ (HTTP gateway) │ │
│ │ └─▶ MainActivity │ │ ◀─────────────────── │ └────────┬─────────┘ │
│ │ ├ Articles │ │ POST recorddb │ │ │
│ │ ├ Bookmarks │ │ ─────────────────────▶│ ▼ │
│ │ └ User │ │ │ ┌──────────────────┐ │
│ └────────────────────┘ │ │ │ Celery worker │ │
│ │ │ │ (User_update + │ │
│ Local storage: │ │ │ DB_similarity) │ │
│ • SQLite recorddb │ │ └────────┬─────────┘ │
│ • SQLite recommenddb │ │ │ │
│ • Realm bookmarks │ │ ▼ │
└──────────────────────────┘ │ ┌──────────────────┐ │
│ │ Naver News │ │
│ │ crawler │ │
│ │ (offline cron) │ │
│ └──────────────────┘ │
└────────────────────────┘
The client uploads a SQLite file of reading history when the app closes; the server triggers a Celery task that updates the user's keyword profile and recomputes their personal article ranking, then the client downloads the resulting SQLite on next launch.
This repository keeps crawler and recommender code, not crawled article
corpora. Generated files under server/crawler/crawlling/articleInfo/,
server/crawler/crawlling/articleDB/, and
server/recommender/py/articleDB/ are intentionally ignored because they can
contain copied article text, publisher metadata, and reporter contact details.
To recreate the data locally, run the crawler and pivot step from the server package; the resulting files stay on your machine and should not be committed:
cd server
python -m crawler.crawlling.crolling
python -m crawler.crawlling.Make_DBA pure content-based recommender: each article is reduced to a sparse keyword-weight vector at crawl time, each user is represented as a single keyword-weight vector built from the articles they've read, and ranking is the cosine similarity between the user vector and every article vector.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 1. Crawl │──▶│ 2. Pivot & │──▶│ 3. Update │──▶│ 4. Score & │
│ Naver News │ │ normalise │ │ user vec │ │ rank │
│ │ │ keywords │ │ │ │ │
│ crolling │ │ Make_DB │ │ User_update │ │ DB_similarity│
└──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
per-article articleDB UserData recommenddb
CSV per pivoted CSV CSV per SQLite per
sub-category per day user user
Step 1 — Crawl (crawler/crawlling/crolling.py)
Walks Naver News across six top-level categories (politics, economy, society,
life/culture, world, IT/science) and ~50 sub-categories. For each article it
uses newspaper3k for body extraction
and gensim for TextRank-based keyword
extraction. Output: one CSV per (date, category, sub-category) with rows
of aid, oid, sid1, sid2, body, summary, keyword 0..4, keyword N weight.
Step 2 — Pivot & normalise (crawler/crawlling/Make_DB.py)
Per day, melts every per-sub-category CSV into a single matrix:
- Rows: articles (uniquely identified by
aid) - Columns: keywords, namespaced as
sid1_sid2_<keyword>so the same surface string in different categories doesn't collide - Cells: TextRank weight (0..1), or 0 if the article doesn't mention that keyword
This is the article × keyword sparse matrix the rest of the pipeline uses.
Step 3 — Update the user vector (recommender/py/User_update.py)
For each newly-read article in the user's recorddb:
- Look up that article's keyword vector in the article DB.
- Re-weight:
new = old × (article × 100 + 1)— articles you've read amplify the keywords they contain in your profile, with a+1floor so any keyword you've encountered survives even if its weight was tiny. - Renormalise so the user's keyword weights sum to 1.
- Drop the lowest-weight keyword to keep the vector bounded.
The result is a user "interest" vector that drifts toward the keywords of articles the user actually opens.
Step 4 — Score & rank (recommender/py/DB_similarity.py)
- Stack the user vector on top of the article matrix (so they share columns).
sklearn.metrics.pairwise.cosine_similarityproduces the user-vs-articles row of the similarity matrix.- Sort articles by similarity, take the top 100.
- Write
(link, similarity)rows into the user'srecommenddbSQLite, which the client downloads on next launch.
- No cold start across users. A brand-new user can be served the global default vector immediately; their personal vector emerges as they read.
- Explainable. The user's profile is a literal list of weighted keywords — trivial to debug and to expose in a "why was this recommended?" UI.
- Self-contained. No need to share read events between users, which would raise privacy and consent questions for a hobby project.
The trade-off is a smaller diversity surface than collaborative filtering would give — see known limitations.
Android Server Celery worker
─────── ────── ─────────────
App launch
│
│ GET /recommender/SendUserFile/<uuid>
│ X-Timestamp, X-Signature
│ ───────────────────────────────────▶│
│ │ validate ID + HMAC
│ │ if first-time user:
│ │ copy default recommenddb
│ ◀──────────────────────── recommenddb (SQLite, ~tens of KB)
│
│ scrape article metadata via Jsoup, render list
│
▼
User reads articles
│
│ insert (readdate, articledate, link) into local recorddb
│ toggle bookmarks → Realm
│
▼
App close (onDestroy or onTaskRemoved)
│
│ POST /recommender/GetUserFile/<uuid>
│ multipart upload of recorddb
│ X-Timestamp, X-Signature
│ ───────────────────────────────────▶│
│ │ validate ID + HMAC + size cap
│ │ stream upload to disk
│ │ enqueue Run_User_Update.delay(uuid)
│ │ │
│ ◀────────────────────── 200 success │
│ │
│ User_update.main(uuid)
│ → drift the user vector
│ │
│ DB_similarity.main(uuid)
│ → rewrite recommenddb
│
▼
Next launch — fresh recommendations
server/recommender/userprofile/
├── default/
│ └── recommenddb # bootstrapped for new users
└── <user-uuid>/
├── recommenddb # SQLite — what the client downloads
├── recorddb # SQLite — what the client uploads
└── UserData.csv # 1×N keyword-weight vector for this user
server/crawler/crawlling/
├── articleInfo/<date>/<sid1>/sid2_<sid2>.csv # raw crawl output
└── articleDB/<date>_DB.csv # pivoted keyword matrix
| table | column | type | purpose |
|---|---|---|---|
| tblink | index | INT | sort order (highest similarity first) |
| tblink | link | TEXT | article URL on news.naver.com |
| tblink | similarity | REAL | cosine similarity to the user vector |
The client reads the top 50 of these on startup, scrapes each article's
title/date/publisher/thumbnail, and renders them in ArticleFragment.
| table | column | type | purpose |
|---|---|---|---|
| tb_record | _id | INT | autoincrement PK |
| tb_record | readdate | TEXT | when the user opened the article |
| tb_record | articledate | TEXT | publish date of the article |
| tb_record | link | TEXT | article URL |
Inserted in ReadMode.onCreate every time the user opens an article.
BookmarkVO rows hold the full snapshot of a bookmarked article (title,
link, publisher, date, scraped body, image URI). Lives in the app's private
Realm file; never leaves the device.
cd server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env and fill in at minimum:
# DJANGO_SECRET_KEY=$(python -c "import secrets; print(secrets.token_urlsafe(50))")
# REQUEST_SIGNING_SECRET=$(python -c "import secrets; print(secrets.token_urlsafe(32))")
# DJANGO_ALLOWED_HOSTS=your.domain.tld
# DJANGO_DEBUG=False # True for local devBootstrap and run:
python manage.py migrate
python manage.py check
set -a; source .env; set +a # export every var defined in .env
python manage.py runserver 0.0.0.0:8080
# In a second shell:
celery -A django_project worker -l infoFor production, put it behind a TLS-terminating reverse proxy (nginx + Let's Encrypt or similar) — the Android client refuses cleartext.
To run the crawler manually (the date is currently hard-coded to 2020-09-05; adjust in the source):
python -m crawler.crawlling.crolling # raw crawl
python -m crawler.crawlling.Make_DB # pivot into articleDB/Toolchain: AGP 8.7.3, Gradle 8.9, compileSdk 36, JDK 17+ (the bundled JBR 21 in Android Studio works out of the box).
Configure androidApp/gradle.properties (or pass via -P on the gradle
command line):
SERVER_BASE_URL=https://your-server.example.com
REQUEST_SIGNING_SECRET=<must-match-server>Build:
cd androidApp
./gradlew assembleDebug
# APK at app/build/outputs/apk/debug/app-debug.apk…or open the project in Android Studio. If gradle sync complains about the
JDK version, set Settings → Build, Execution, Deployment → Build Tools →
Gradle → Gradle JDK to the bundled jbr-21.
The signing secret is exposed via BuildConfig.REQUEST_SIGNING_SECRET at
compile time — never hand-coded into sources. Release builds with an empty
secret fail at gradle time so unsigned APKs can't ship.
Min SDK 23, target SDK 36.
| Threat | Mitigation |
|---|---|
| Reading history sniffed in transit | HTTPS-only network policy; usesCleartextTraffic removed |
Cross-app tracking via ANDROID_ID |
Per-install v4 UUID in private SharedPreferences |
| Anyone with a UUID impersonates a user | HMAC-SHA256 signature over METHOD\nPATH\nTIMESTAMP, 5-min skew |
| Path traversal via crafted user ID | Regex ^[A-Za-z0-9_-]{1,64}$ + Path.is_relative_to(USERPROFILE_ROOT) |
| Disk-fill via huge upload | 5 MB streaming cap; partial files unlinked on overflow |
adb backup siphoning local SQLite |
allowBackup="false" + fullBackupContent="false" |
| XSS via scraped article HTML | WebView JavaScript explicitly disabled |
| CSRF on web routes | CsrfViewMiddleware re-enabled; mobile endpoints csrf_exempt |
| Secrets in source | All env-driven; SECRET_KEY refuses to boot prod with default |
The HMAC secret is in the APK, so a sufficiently determined attacker can extract it. This is documented intentionally — for stronger guarantees, the right next steps are OAuth2 or device-attested tokens, not more obfuscation.
| Path | Role |
|---|---|
server/django_project/settings.py |
Env-driven Django config; HSTS / cookies in prod |
server/django_project/celery.py |
Celery app definition |
server/recommender/views.py |
HTTP gateway: validation, signing, file I/O |
server/recommender/auth.py |
HMAC request-signing decorator |
server/recommender/forms.py |
Multipart upload validation |
server/recommender/tasks.py |
Celery task wiring |
server/recommender/py/User_update.py |
Drift the user keyword vector from new reads |
server/recommender/py/DB_similarity.py |
Cosine-similarity ranking → recommenddb |
server/crawler/crawlling/crolling.py |
Naver News crawler + keyword extraction |
server/crawler/crawlling/Make_DB.py |
Pivot crawl output into the article × keyword CSV |
server/.env.example |
Template for required environment variables |
server/requirements.txt |
Pinned Python dependencies |
| Path | Role |
|---|---|
androidApp/app/src/main/java/com/example/project/ |
|
Config.java |
Server URL, paths, per-install user ID |
RequestSigner.java |
HMAC client-side signing |
StartScreen.java |
Splash: download recommendations + scrape meta |
MainActivity.java |
Tab host (Articles / Bookmarks / User) |
ReadMode.java |
Article reader; bookmark toggle; record read |
ArticleFragment.java / ArticleAdapter.java |
Recommended-list UI |
BookmarkFragment.java / BookmarkAdapter.java |
Bookmarks UI (Realm-backed) |
UserFragment.java / UserRecordView.java |
User info + reading history viewer |
RecommendRequester.java |
GET the recommend SQLite from the server |
RecordSender.java |
POST the reading-history SQLite to the server |
ForecdTerminationService.java |
Catch swipe-kill so the upload still happens |
RecordDBHelper.java |
SQLiteOpenHelper for the local recorddb |
*VO.java |
Plain data holders |
res/xml/networkset.xml |
HTTPS-only network policy |
- Generated data is local-only. Crawl output and article matrix CSVs are not committed; run the crawler locally or provide your own dataset before executing the recommender pipeline.
- Hard-coded crawl date.
crolling.py,Make_DB.py,User_update.py, andDB_similarity.pyall reference2020-09-05. To run live, swap those fordatetime.date.today()(or wire a CLI arg) and schedule the crawler on a daily cron. - Single source. The crawler is hard-coded to news.naver.com.
- Content-based only. No collaborative signal, no diversity re-ranker,
so the top of the list can become an echo of recent reads. A simple MMR
pass after
DB_similaritywould help. - Keyword extraction is TextRank. Surface-form matching means near-synonyms ("서울", "서울시") don't share weight. A switch to multilingual sentence embeddings (e.g. KoSimCSE) would lift recommendation quality noticeably.
- SQLite-as-API. The client and server exchange
.sqlitefiles instead of JSON. It works, but a normal REST API would be easier to evolve. - No real auth. HMAC stops impersonation by anyone who only knows a UUID, but the shared secret is in the APK. A live deployment should add OAuth2 or device attestation.
-First and Last CIY Project-