Skip to content

Commit 49cf14e

Browse files
author
Greg Bell
committed
Cleanup for public
1 parent 243e8d6 commit 49cf14e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

48 files changed

+32
-398
lines changed

README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Bayesian RSS feed classifier
2+
Downloads news articles from an RSS feed, categorises them based on naive Bayesian classification, then prints to screen and emails.
3+
4+
Only articles not seen before are shown.
5+
6+
This project makes use of the awesome [toolz](https://github.com/pytoolz/toolz)
7+
library to organize everything as a pipeline of functions.
8+
9+
## Suitability
10+
This is a personal project developed to process the Port News and Wauchope
11+
Gazette newspapers in Australia. XML/HTML tags and properties will be specific
12+
to those feeds, so you'll have to modify code for your particular feed. If
13+
there's interest, complain about this in an issue, and I'll make everything
14+
configurable.
15+
16+
It also only classifies articles as "news" or "sports".
17+
18+
## Training
19+
Use `download_article.py` to download a particular article, based on URL, and
20+
manually put the resulting file in either `training_data/news` or
21+
`training_data/sports` dir.
22+
23+
## Other setup
24+
Add your feed URLs and email to `feed_classifier.py`
25+
Configure your email server in `pipeline/mail_sender.py`
26+
27+
## Running
28+
Run `feed_classifier.py`, no arguments required.

app/feed_classifier.py

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -40,17 +40,14 @@
4040
# Dependencies injected for testing purposes
4141
# Note there's a 'dummy' module which has versions of the functions in
4242
# `internet` that don't connect to the internet. For end-to-end testing.
43-
# TODO:
44-
# implement dummy versions of the persistence functions, also for testing.
4543
from feed_classifier.internet import feed_fetcher, body_fetcher
4644
from feed_classifier.NB_classifier import NB_classifier
4745
from feed_classifier.persistence import have_seen
4846

4947

5048
def main():
49+
# URL to rss.xml's
5150
URLS = [
52-
"http://www.portnews.com.au/rss.xml",
53-
"https://www.wauchopegazette.com.au/rss.xml",
5451
]
5552

5653
title_hashes = load_title_hashes()
@@ -74,7 +71,7 @@ def main():
7471
mark_as_seen(title_hashes),
7572
format_as_text,
7673
toolz.juxt(
77-
print, sendmail(["greg@bitwombat.com.au", "jillbell2@yahoo.com"])
74+
print, sendmail([])
7875
),
7976
),
8077
)

app/pipeline/mail_sender.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,8 @@
66

77
@autocurry
88
def sendmail(recipients, text):
9-
SERVER = "server"
10-
FROM = "greg@gbell.bitwombat.com.au"
9+
SERVER = ""
10+
FROM = ""
1111
SUBJECT = "New Local Stories"
1212

1313
email = """\

app/training_data/news/5205862.art

Lines changed: 0 additions & 1 deletion
This file was deleted.

app/training_data/news/5210015.art

Lines changed: 0 additions & 3 deletions
This file was deleted.

app/training_data/news/5215314.art

Lines changed: 0 additions & 1 deletion
This file was deleted.

app/training_data/news/5215906.art

Lines changed: 0 additions & 1 deletion
This file was deleted.

app/training_data/news/5215956.art

Lines changed: 0 additions & 1 deletion
This file was deleted.

app/training_data/news/5216155.art

Lines changed: 0 additions & 1 deletion
This file was deleted.

app/training_data/news/5216163.art

Lines changed: 0 additions & 1 deletion
This file was deleted.

0 commit comments

Comments
 (0)