GitHub

This is just some demo code for learning and testing.

Source of code is documented at the top of each source file.

This was just something I whipped together in a weekend and is very rough.

Go to smart_scrapy/smart_scrapy/spiders/smart_spider.py and add in the allowed_domains and start_urls.

virtualenv -p python3 venv
source venv/bin/activate
pip install -r requirements.txt
scrapy crawl Smart
python main.py
python whoost_tutorial.py

What does all of this do?

Scrape data from a website domain.
Sanitizes each page for content only using dragnet.
Summarizes each article.
Grab the top 10 topics of each article using gensim/nltk/spacy.
Create a processed output csv using pandas.
Use whoosh to index and search the processed csv.

What do I hope from this?

Maybe I can eventually turn it into a personalized rss feed for myself and others. It's also good practice for me to keep up with modern tutorials for ML stuff as ML work for me has been slow. All while kubernetes/devops work has been up.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
smart_scrapy		smart_scrapy
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
whoosh_tutorial.py		whoosh_tutorial.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What does all of this do?

What do I hope from this?

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Neoky/smart_scrapy

Folders and files

Latest commit

History

Repository files navigation

What does all of this do?

What do I hope from this?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages