Skip to content

jsoone24/ArticleRecommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Article Recommender

A personalized news-article recommendation Android app, backed by a Django server that crawls Naver News, extracts keyword vectors, and ranks articles by cosine similarity against a per-user keyword profile built from reading history.

Built originally as a college capstone; modernized in 2026 with current tooling (AGP 8.7, Gradle 8.9, Realm 10.19, Django 4.2+) and security hardening (HMAC request signing, HTTPS-only, path-traversal defenses).


Table of contents

  1. Architecture
  2. Data policy
  3. How the recommender works
  4. Request lifecycle
  5. Data schemas
  6. Server setup
  7. Android setup
  8. Security model
  9. File map
  10. Known limitations

Architecture

┌──────────────────────────┐                       ┌────────────────────────┐
│      Android client      │   HTTPS + HMAC-SHA256 │     Django server      │
│         (Java)           │ ◀───────────────────▶ │                        │
│                          │                       │  ┌──────────────────┐  │
│  ┌────────────────────┐  │   GET  recommenddb    │  │  views.py        │  │
│  │ StartScreen        │  │ ─────────────────────▶│  │  (HTTP gateway)  │  │
│  │  └─▶ MainActivity  │  │ ◀───────────────────  │  └────────┬─────────┘  │
│  │       ├ Articles   │  │   POST recorddb       │           │            │
│  │       ├ Bookmarks  │  │ ─────────────────────▶│           ▼            │
│  │       └ User       │  │                       │  ┌──────────────────┐  │
│  └────────────────────┘  │                       │  │  Celery worker   │  │
│                          │                       │  │  (User_update +  │  │
│  Local storage:          │                       │  │   DB_similarity) │  │
│   • SQLite recorddb      │                       │  └────────┬─────────┘  │
│   • SQLite recommenddb   │                       │           │            │
│   • Realm  bookmarks     │                       │           ▼            │
└──────────────────────────┘                       │  ┌──────────────────┐  │
                                                   │  │   Naver News     │  │
                                                   │  │   crawler        │  │
                                                   │  │   (offline cron) │  │
                                                   │  └──────────────────┘  │
                                                   └────────────────────────┘

The client uploads a SQLite file of reading history when the app closes; the server triggers a Celery task that updates the user's keyword profile and recomputes their personal article ranking, then the client downloads the resulting SQLite on next launch.


Data policy

This repository keeps crawler and recommender code, not crawled article corpora. Generated files under server/crawler/crawlling/articleInfo/, server/crawler/crawlling/articleDB/, and server/recommender/py/articleDB/ are intentionally ignored because they can contain copied article text, publisher metadata, and reporter contact details.

To recreate the data locally, run the crawler and pivot step from the server package; the resulting files stay on your machine and should not be committed:

cd server
python -m crawler.crawlling.crolling
python -m crawler.crawlling.Make_DB

How the recommender works

A pure content-based recommender: each article is reduced to a sparse keyword-weight vector at crawl time, each user is represented as a single keyword-weight vector built from the articles they've read, and ranking is the cosine similarity between the user vector and every article vector.

Pipeline

┌──────────────┐   ┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ 1. Crawl     │──▶│ 2. Pivot &   │──▶│ 3. Update    │──▶│ 4. Score &   │
│   Naver News │   │   normalise  │   │   user vec   │   │   rank       │
│              │   │   keywords   │   │              │   │              │
│   crolling   │   │   Make_DB    │   │  User_update │   │ DB_similarity│
└──────────────┘   └──────────────┘   └──────────────┘   └──────────────┘
   per-article         articleDB          UserData         recommenddb
   CSV per             pivoted CSV        CSV per          SQLite per
   sub-category        per day            user             user

Step 1 — Crawl (crawler/crawlling/crolling.py) Walks Naver News across six top-level categories (politics, economy, society, life/culture, world, IT/science) and ~50 sub-categories. For each article it uses newspaper3k for body extraction and gensim for TextRank-based keyword extraction. Output: one CSV per (date, category, sub-category) with rows of aid, oid, sid1, sid2, body, summary, keyword 0..4, keyword N weight.

Step 2 — Pivot & normalise (crawler/crawlling/Make_DB.py) Per day, melts every per-sub-category CSV into a single matrix:

  • Rows: articles (uniquely identified by aid)
  • Columns: keywords, namespaced as sid1_sid2_<keyword> so the same surface string in different categories doesn't collide
  • Cells: TextRank weight (0..1), or 0 if the article doesn't mention that keyword

This is the article × keyword sparse matrix the rest of the pipeline uses.

Step 3 — Update the user vector (recommender/py/User_update.py) For each newly-read article in the user's recorddb:

  1. Look up that article's keyword vector in the article DB.
  2. Re-weight: new = old × (article × 100 + 1) — articles you've read amplify the keywords they contain in your profile, with a +1 floor so any keyword you've encountered survives even if its weight was tiny.
  3. Renormalise so the user's keyword weights sum to 1.
  4. Drop the lowest-weight keyword to keep the vector bounded.

The result is a user "interest" vector that drifts toward the keywords of articles the user actually opens.

Step 4 — Score & rank (recommender/py/DB_similarity.py)

  1. Stack the user vector on top of the article matrix (so they share columns).
  2. sklearn.metrics.pairwise.cosine_similarity produces the user-vs-articles row of the similarity matrix.
  3. Sort articles by similarity, take the top 100.
  4. Write (link, similarity) rows into the user's recommenddb SQLite, which the client downloads on next launch.

Why content-based?

  • No cold start across users. A brand-new user can be served the global default vector immediately; their personal vector emerges as they read.
  • Explainable. The user's profile is a literal list of weighted keywords — trivial to debug and to expose in a "why was this recommended?" UI.
  • Self-contained. No need to share read events between users, which would raise privacy and consent questions for a hobby project.

The trade-off is a smaller diversity surface than collaborative filtering would give — see known limitations.


Request lifecycle

Android                                 Server                  Celery worker
───────                                 ──────                  ─────────────

App launch
  │
  │  GET /recommender/SendUserFile/<uuid>
  │  X-Timestamp, X-Signature
  │ ───────────────────────────────────▶│
  │                                     │ validate ID + HMAC
  │                                     │ if first-time user:
  │                                     │   copy default recommenddb
  │  ◀──────────────────────── recommenddb (SQLite, ~tens of KB)
  │
  │  scrape article metadata via Jsoup, render list
  │
  ▼
User reads articles
  │
  │  insert (readdate, articledate, link) into local recorddb
  │  toggle bookmarks → Realm
  │
  ▼
App close (onDestroy or onTaskRemoved)
  │
  │  POST /recommender/GetUserFile/<uuid>
  │  multipart upload of recorddb
  │  X-Timestamp, X-Signature
  │ ───────────────────────────────────▶│
  │                                     │ validate ID + HMAC + size cap
  │                                     │ stream upload to disk
  │                                     │ enqueue Run_User_Update.delay(uuid)
  │                                     │                              │
  │  ◀────────────────────── 200 success                              │
  │                                                                    │
  │                                                  User_update.main(uuid)
  │                                                  → drift the user vector
  │                                                                    │
  │                                                  DB_similarity.main(uuid)
  │                                                  → rewrite recommenddb
  │
  ▼
Next launch — fresh recommendations

Data schemas

Server-side, on disk

server/recommender/userprofile/
├── default/
│   └── recommenddb         # bootstrapped for new users
└── <user-uuid>/
    ├── recommenddb          # SQLite — what the client downloads
    ├── recorddb             # SQLite — what the client uploads
    └── UserData.csv         # 1×N keyword-weight vector for this user
server/crawler/crawlling/
├── articleInfo/<date>/<sid1>/sid2_<sid2>.csv   # raw crawl output
└── articleDB/<date>_DB.csv                     # pivoted keyword matrix

recommenddb (server → client)

table column type purpose
tblink index INT sort order (highest similarity first)
tblink link TEXT article URL on news.naver.com
tblink similarity REAL cosine similarity to the user vector

The client reads the top 50 of these on startup, scrapes each article's title/date/publisher/thumbnail, and renders them in ArticleFragment.

recorddb (client → server)

table column type purpose
tb_record _id INT autoincrement PK
tb_record readdate TEXT when the user opened the article
tb_record articledate TEXT publish date of the article
tb_record link TEXT article URL

Inserted in ReadMode.onCreate every time the user opens an article.

Realm (Android, local)

BookmarkVO rows hold the full snapshot of a bookmarked article (title, link, publisher, date, scraped body, image URI). Lives in the app's private Realm file; never leaves the device.


Server setup

cd server
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

cp .env.example .env
# Edit .env and fill in at minimum:
#   DJANGO_SECRET_KEY=$(python -c "import secrets; print(secrets.token_urlsafe(50))")
#   REQUEST_SIGNING_SECRET=$(python -c "import secrets; print(secrets.token_urlsafe(32))")
#   DJANGO_ALLOWED_HOSTS=your.domain.tld
#   DJANGO_DEBUG=False           # True for local dev

Bootstrap and run:

python manage.py migrate
python manage.py check

set -a; source .env; set +a            # export every var defined in .env
python manage.py runserver 0.0.0.0:8080

# In a second shell:
celery -A django_project worker -l info

For production, put it behind a TLS-terminating reverse proxy (nginx + Let's Encrypt or similar) — the Android client refuses cleartext.

To run the crawler manually (the date is currently hard-coded to 2020-09-05; adjust in the source):

python -m crawler.crawlling.crolling   # raw crawl
python -m crawler.crawlling.Make_DB    # pivot into articleDB/

Android setup

Toolchain: AGP 8.7.3, Gradle 8.9, compileSdk 36, JDK 17+ (the bundled JBR 21 in Android Studio works out of the box).

Configure androidApp/gradle.properties (or pass via -P on the gradle command line):

SERVER_BASE_URL=https://your-server.example.com
REQUEST_SIGNING_SECRET=<must-match-server>

Build:

cd androidApp
./gradlew assembleDebug
# APK at app/build/outputs/apk/debug/app-debug.apk

…or open the project in Android Studio. If gradle sync complains about the JDK version, set Settings → Build, Execution, Deployment → Build Tools → Gradle → Gradle JDK to the bundled jbr-21.

The signing secret is exposed via BuildConfig.REQUEST_SIGNING_SECRET at compile time — never hand-coded into sources. Release builds with an empty secret fail at gradle time so unsigned APKs can't ship.

Min SDK 23, target SDK 36.


Security model

Threat Mitigation
Reading history sniffed in transit HTTPS-only network policy; usesCleartextTraffic removed
Cross-app tracking via ANDROID_ID Per-install v4 UUID in private SharedPreferences
Anyone with a UUID impersonates a user HMAC-SHA256 signature over METHOD\nPATH\nTIMESTAMP, 5-min skew
Path traversal via crafted user ID Regex ^[A-Za-z0-9_-]{1,64}$ + Path.is_relative_to(USERPROFILE_ROOT)
Disk-fill via huge upload 5 MB streaming cap; partial files unlinked on overflow
adb backup siphoning local SQLite allowBackup="false" + fullBackupContent="false"
XSS via scraped article HTML WebView JavaScript explicitly disabled
CSRF on web routes CsrfViewMiddleware re-enabled; mobile endpoints csrf_exempt
Secrets in source All env-driven; SECRET_KEY refuses to boot prod with default

The HMAC secret is in the APK, so a sufficiently determined attacker can extract it. This is documented intentionally — for stronger guarantees, the right next steps are OAuth2 or device-attested tokens, not more obfuscation.


File map

Server

Path Role
server/django_project/settings.py Env-driven Django config; HSTS / cookies in prod
server/django_project/celery.py Celery app definition
server/recommender/views.py HTTP gateway: validation, signing, file I/O
server/recommender/auth.py HMAC request-signing decorator
server/recommender/forms.py Multipart upload validation
server/recommender/tasks.py Celery task wiring
server/recommender/py/User_update.py Drift the user keyword vector from new reads
server/recommender/py/DB_similarity.py Cosine-similarity ranking → recommenddb
server/crawler/crawlling/crolling.py Naver News crawler + keyword extraction
server/crawler/crawlling/Make_DB.py Pivot crawl output into the article × keyword CSV
server/.env.example Template for required environment variables
server/requirements.txt Pinned Python dependencies

Android

Path Role
androidApp/app/src/main/java/com/example/project/
Config.java Server URL, paths, per-install user ID
RequestSigner.java HMAC client-side signing
StartScreen.java Splash: download recommendations + scrape meta
MainActivity.java Tab host (Articles / Bookmarks / User)
ReadMode.java Article reader; bookmark toggle; record read
ArticleFragment.java / ArticleAdapter.java Recommended-list UI
BookmarkFragment.java / BookmarkAdapter.java Bookmarks UI (Realm-backed)
UserFragment.java / UserRecordView.java User info + reading history viewer
RecommendRequester.java GET the recommend SQLite from the server
RecordSender.java POST the reading-history SQLite to the server
ForecdTerminationService.java Catch swipe-kill so the upload still happens
RecordDBHelper.java SQLiteOpenHelper for the local recorddb
*VO.java Plain data holders
res/xml/networkset.xml HTTPS-only network policy

Known limitations

  • Generated data is local-only. Crawl output and article matrix CSVs are not committed; run the crawler locally or provide your own dataset before executing the recommender pipeline.
  • Hard-coded crawl date. crolling.py, Make_DB.py, User_update.py, and DB_similarity.py all reference 2020-09-05. To run live, swap those for datetime.date.today() (or wire a CLI arg) and schedule the crawler on a daily cron.
  • Single source. The crawler is hard-coded to news.naver.com.
  • Content-based only. No collaborative signal, no diversity re-ranker, so the top of the list can become an echo of recent reads. A simple MMR pass after DB_similarity would help.
  • Keyword extraction is TextRank. Surface-form matching means near-synonyms ("서울", "서울시") don't share weight. A switch to multilingual sentence embeddings (e.g. KoSimCSE) would lift recommendation quality noticeably.
  • SQLite-as-API. The client and server exchange .sqlite files instead of JSON. It works, but a normal REST API would be easier to evolve.
  • No real auth. HMAC stops impersonation by anyone who only knows a UUID, but the shared secret is in the APK. A live deployment should add OAuth2 or device attestation.


-First and Last CIY Project-

About

Personalized article recommendation Android app

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors