diff --git a/README.md b/README.md
index 3d9ab37..f48aa96 100644
--- a/README.md
+++ b/README.md
@@ -13,8 +13,8 @@ Pretty simple!


-
-
+
+
[](https://github.com/rivermont/spidy/graphs/punch-card)
[](https://travis-ci.com/github/rivermont/spidy)
[](https://pypi.org/project/spidy-web-crawler/)
@@ -101,6 +101,7 @@ Here are some features we figure are worth noting.
- Cross-Platform compatibility: spidy will work on all three major operating systems, Windows, Mac OS/X, and Linux!
- Frequent Timestamp Logging: Spidy logs almost every action it takes to both the console and one of two log files.
- Browser Spoofing: Make requests using User Agents from 4 popular web browsers, use a custom spidy bot one, or create your own!
+ - Headless Browser Support: Render full webpages to get dynamic content.
- Portability: Move spidy's folder and its contents somewhere else and it will run right where it left off. *Note*: This only works if you run it from source code.
- User-Friendly Logs: Both the console and log file messages are simple and easy to interpret, but packed with information.
- Webpage saving: Spidy downloads each page that it runs into, regardless of file type. The crawler uses the HTTP `Content-Type` header returned with most files to determine the file type.
@@ -225,6 +226,7 @@ See the [`CONTRIBUTING.md`](https://github.com/rivermont/spidy/blob/master/spidy
* [quatroka](https://github.com/quatroka) - Fixed testing bugs.
* [stevelle](https://github.com/stevelle) - Respect robots.txt.
* [thatguywiththatname](https://github.com/thatguywiththatname) - README link corrections.
+* [lkotlus](https://github.com/lkotlus) - Optimizations, out of scope items, and headless browser support.
# License
We used the [Gnu General Public License](https://www.gnu.org/licenses/gpl-3.0.en.html) (see [`LICENSE`](https://github.com/rivermont/spidy/blob/master/LICENSE)) as it was the license that best suited our needs.
diff --git a/requirements.txt b/requirements.txt
index 8028352..de5777a 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -2,3 +2,6 @@ requests
lxml
flake8
reppy
+selenium
+selenium-wire
+blinker==1.7.0
\ No newline at end of file
diff --git a/spidy/config/blank.cfg b/spidy/config/blank.cfg
index 91c8f37..ed55fa9 100644
--- a/spidy/config/blank.cfg
+++ b/spidy/config/blank.cfg
@@ -28,6 +28,9 @@ RESTRICT =
# The domain within which to restrict crawling.
DOMAIN = ''
+# Domains, subdomains, and paths that are out of scope for the crawl
+OUT_OF_SCOPE = ['', '']
+
# Whether to respect sites' robots.txt or not
RESPECT_ROBOTS =
@@ -48,11 +51,17 @@ HEADER = HEADERS['']
# Or if you want to use custom headers:
HEADER = {'': '', '': ''}
+# Select if you would like to have pages rendered with a headless browser (more thorough, but slower)
+USE_BROWSER =
+
# Amount of errors allowed to happen before automatic shutdown.
MAX_NEW_ERRORS =
MAX_KNOWN_ERRORS =
MAX_HTTP_ERRORS =
MAX_NEW_MIMES =
+# Amount of time (in seconds) the crawl is allowed to run for (set to float('inf') if you want it to run forever)
+MAX_TIME =
+
# Pages to start crawling on in case TODO is empty at start.
START = ['', '']
diff --git a/spidy/config/default.cfg b/spidy/config/default.cfg
index c02de63..fa2afc0 100644
--- a/spidy/config/default.cfg
+++ b/spidy/config/default.cfg
@@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
+OUT_OF_SCOPE = []
RESPECT_ROBOTS = True
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
+USE_BROWSER = False
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
+MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
\ No newline at end of file
diff --git a/spidy/config/docker.cfg b/spidy/config/docker.cfg
index 9e52af7..3a546ca 100644
--- a/spidy/config/docker.cfg
+++ b/spidy/config/docker.cfg
@@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
+OUT_OF_SCOPE = []
RESPECT_ROBOTS = True
TODO_FILE = '/data/crawler_todo.txt'
DONE_FILE = '/data/crawler_done.txt'
WORD_FILE = '/data/crawler_words.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
+USE_BROWSER = False
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
+MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
diff --git a/spidy/config/heavy.cfg b/spidy/config/heavy.cfg
index 1c41c91..4e2f0ea 100644
--- a/spidy/config/heavy.cfg
+++ b/spidy/config/heavy.cfg
@@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = True
RESTRICT = False
DOMAIN = ''
+OUT_OF_SCOPE = []
RESPECT_ROBOTS = False
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
+USE_BROWSER = True
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
+MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
\ No newline at end of file
diff --git a/spidy/config/infinite.cfg b/spidy/config/infinite.cfg
index bcf11bc..1c41881 100644
--- a/spidy/config/infinite.cfg
+++ b/spidy/config/infinite.cfg
@@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
+OUT_OF_SCOPE = []
RESPECT_ROBOTS = True
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 250
HEADER = HEADERS['spidy']
+USE_BROWSER = False
MAX_NEW_ERRORS = 1000000
MAX_KNOWN_ERRORS = 1000000
MAX_HTTP_ERRORS = 1000000
MAX_NEW_MIMES = 1000000
+MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
\ No newline at end of file
diff --git a/spidy/config/light.cfg b/spidy/config/light.cfg
index 9a916c9..7a11da4 100644
--- a/spidy/config/light.cfg
+++ b/spidy/config/light.cfg
@@ -7,14 +7,17 @@ OVERRIDE_SIZE = False
SAVE_WORDS = False
RESTRICT = False
DOMAIN = ''
+OUT_OF_SCOPE = []
RESPECT_ROBOTS = True
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 150
HEADER = HEADERS['spidy']
+USE_BROWSER = False
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
+MAX_TIME = 600
START = ['https://en.wikipedia.org/wiki/Main_Page']
\ No newline at end of file
diff --git a/spidy/config/multithreaded.cfg b/spidy/config/multithreaded.cfg
index 1af0311..17daafa 100644
--- a/spidy/config/multithreaded.cfg
+++ b/spidy/config/multithreaded.cfg
@@ -7,14 +7,17 @@ ZIP_FILES = True
OVERRIDE_SIZE = False
RESTRICT = False
DOMAIN = ''
+OUT_OF_SCOPE = []
RESPECT_ROBOTS = False
TODO_FILE = 'crawler_todo.txt'
DONE_FILE = 'crawler_done.txt'
WORD_FILE = 'crawler_words.txt'
SAVE_COUNT = 100
HEADER = HEADERS['spidy']
+USE_BROWSER = False
MAX_NEW_ERRORS = 5
MAX_KNOWN_ERRORS = 10
MAX_HTTP_ERRORS = 20
MAX_NEW_MIMES = 10
+MAX_TIME = float('inf')
START = ['https://en.wikipedia.org/wiki/Main_Page']
\ No newline at end of file
diff --git a/spidy/config/rivermont-infinite.cfg b/spidy/config/rivermont-infinite.cfg
deleted file mode 100644
index 7682ae0..0000000
--- a/spidy/config/rivermont-infinite.cfg
+++ /dev/null
@@ -1,21 +0,0 @@
-THREAD_COUNT = 8
-OVERWRITE = False
-THREAD_COUNT = 8
-RAISE_ERRORS = False
-SAVE_PAGES = True
-SAVE_WORDS = False
-ZIP_FILES = False
-OVERRIDE_SIZE = False
-RESTRICT = False
-DOMAIN = ''
-RESPECT_ROBOTS = False
-TODO_FILE = 'crawler_todo.txt'
-DONE_FILE = 'crawler_done.txt'
-WORD_FILE = 'crawler_words.txt'
-SAVE_COUNT = 100
-HEADER = HEADERS['spidy']
-MAX_NEW_ERRORS = 1000000
-MAX_KNOWN_ERRORS = 1000000
-MAX_HTTP_ERRORS = 1000000
-MAX_NEW_MIMES = 1000000
-START = ['http://24.40.136.85/']
\ No newline at end of file
diff --git a/spidy/config/rivermont.cfg b/spidy/config/rivermont.cfg
deleted file mode 100644
index b942436..0000000
--- a/spidy/config/rivermont.cfg
+++ /dev/null
@@ -1,20 +0,0 @@
-THREAD_COUNT = 8
-OVERWRITE = False
-RAISE_ERRORS = False
-SAVE_PAGES = True
-ZIP_FILES = False
-OVERRIDE_SIZE = False
-SAVE_WORDS = False
-RESTRICT = False
-DOMAIN = ''
-RESPECT_ROBOTS = False
-TODO_FILE = 'crawler_todo.txt'
-DONE_FILE = 'crawler_done.txt'
-WORD_FILE = 'crawler_words.txt'
-SAVE_COUNT = 100
-HEADER = HEADERS['spidy']
-MAX_NEW_ERRORS = 5
-MAX_KNOWN_ERRORS = 20
-MAX_HTTP_ERRORS = 20
-MAX_NEW_MIMES = 10
-START = ['http://24.40.136.85/']
diff --git a/spidy/config/wsj.cfg b/spidy/config/wsj.cfg
index 5a5ed40..d03ad06 100644
--- a/spidy/config/wsj.cfg
+++ b/spidy/config/wsj.cfg
@@ -12,14 +12,19 @@ RESTRICT = True
# The domain within which to restrict crawling.
DOMAIN = 'wsj.com/'
+# Do not allow crawling involving specific pages and subdomains
+OUT_OF_SCOPE = ['wsj.com/business/airlines', 'africa.wsj.com']
+
RESPECT_ROBOTS = True
TODO_FILE = 'wsj_todo.txt'
DONE_FILE = 'wsj_done.txt'
WORD_FILE = 'wsj_words.txt'
SAVE_COUNT = 60
HEADER = HEADERS['spidy']
+USE_BROWSER = False
MAX_NEW_ERRORS = 100
MAX_KNOWN_ERRORS = 100
MAX_HTTP_ERRORS = 100
MAX_NEW_MIMES = 5
+MAX_TIME = float('inf')
START = ['https://www.wsj.com/']
diff --git a/spidy/crawler.py b/spidy/crawler.py
index 19547c8..9075848 100755
--- a/spidy/crawler.py
+++ b/spidy/crawler.py
@@ -3,6 +3,7 @@
spidy Web Crawler
Built by rivermont and FalconWarriorr
"""
+import argparse
import time
import shutil
import requests
@@ -16,6 +17,10 @@
from lxml import etree
from lxml.html import iterlinks, resolve_base_href, make_links_absolute
from reppy.robots import Robots
+from seleniumwire import webdriver
+from selenium.webdriver.common.alert import Alert
+from selenium.common.exceptions import TimeoutException, UnexpectedAlertPresentException, WebDriverException
+from types import SimpleNamespace
try:
from spidy import __version__
@@ -50,13 +55,13 @@ def get_full_time():
except OSError:
pass # Assumes only OSError will complain if /logs already exists
-LOG_FILE = open(path.join(WORKING_DIR, 'logs', 'spidy_log_{0}.txt'.format(START_TIME)),
+LOG_FILE = open(path.join(WORKING_DIR, 'logs', f'spidy_log_{START_TIME}.txt'),
'w+', encoding='utf-8', errors='ignore')
-LOG_FILE_NAME = path.join('logs', 'spidy_log_{0}'.format(START_TIME))
+LOG_FILE_NAME = path.join('logs', f'spidy_log_{START_TIME}')
# Error log location
-ERR_LOG_FILE = path.join(WORKING_DIR, 'logs', 'spidy_error_log_{0}.txt'.format(START_TIME))
-ERR_LOG_FILE_NAME = path.join('logs', 'spidy_error_log_{0}.txt'.format(START_TIME))
+ERR_LOG_FILE = path.join(WORKING_DIR, 'logs', f'spidy_error_log_{START_TIME}.txt')
+ERR_LOG_FILE_NAME = path.join('logs', f'spidy_error_log_{START_TIME}.txt')
LOGGER = logging.getLogger('SPIDY')
LOGGER.setLevel(logging.DEBUG)
@@ -101,15 +106,14 @@ def write_log(operation, message, package='spidy', status='INFO', worker=0):
"""
global LOG_FILE, log_mutex
with log_mutex:
- message = '[{0}] [{1}] [WORKER #{2}] [{3}] [{4}]: {5}'\
- .format(get_time(), package, str(worker), operation, status, message)
+ message = f'[{get_time()}] [{package}] [WORKER #{str(worker)}] [{operation}] [{status}]: {message}'
print(message)
if not LOG_FILE.closed:
LOG_FILE.write('\n' + message)
-write_log('INIT', 'Starting spidy Web Crawler version {0}'.format(VERSION))
-write_log('INIT', 'Report any problems to GitHub at https://github.com/rivermont/spidy')
+write_log('INIT', f'Starting spidy Web Crawler version {VERSION}')
+write_log('INIT', 'Report any problems on GitHub at https://github.com/rivermont/spidy/issues')
###########
@@ -214,8 +218,7 @@ def _lookup(self, url):
def _remember(self, url):
urlparsed = urllib.parse.urlparse(url)
robots_url = urlparsed.scheme + '://' + urlparsed.netloc + '/robots.txt'
- write_log('ROBOTS',
- 'Reading robots.txt file at: {0}'.format(robots_url),
+ write_log('ROBOTS', f'Reading robots.txt file at: {robots_url}',
package='reppy')
robots = Robots.fetch(robots_url)
checker = robots.agent(self.user_agent)
@@ -228,9 +231,8 @@ def _remember(self, url):
write_log('INIT', 'Creating functions...')
-
-def crawl(url, thread_id=0):
- global WORDS, OVERRIDE_SIZE, HEADER, SAVE_PAGES, SAVE_WORDS
+def crawl(url, browser, thread_id=0):
+ global WORDS, OVERRIDE_SIZE, HEADER, SAVE_PAGES, SAVE_WORDS, KNOWN_ERROR_COUNT
if not OVERRIDE_SIZE:
try:
# Attempt to get the size in bytes of the document
@@ -241,7 +243,27 @@ def crawl(url, thread_id=0):
raise SizeError
# If the SizeError is raised it will be caught in the except block in the run section,
# and the following code will not be run.
- page = requests.get(url, headers=HEADER) # Get page
+ r = requests.get(url, headers=HEADER)
+
+ if (browser is None):
+ page = r # Get page
+ else:
+ try:
+ browser.get(url)
+ page = SimpleNamespace(text=browser.page_source, content=browser.page_source.encode('utf-8'), headers=r.headers)
+ except TimeoutException:
+ KNOWN_ERROR_COUNT.increment()
+ return []
+ except UnexpectedAlertPresentException:
+ browser.get(url)
+ alert = Alert(browser)
+ alert.accept()
+ page = SimpleNamespace(text=browser.page_source, content=browser.page_source.encode('utf-8'), headers=r.headers)
+ KNOWN_ERROR_COUNT.increment()
+ except WebDriverException:
+ KNOWN_ERROR_COUNT.increment()
+ return []
+
word_list = []
doctype = get_mime_type(page)
if doctype.find('image') < 0 and doctype.find('video') < 0:
@@ -262,12 +284,11 @@ def crawl(url, thread_id=0):
save_page(url, page)
if SAVE_WORDS:
# Announce which link was crawled
- write_log('CRAWL', 'Found {0} links and {1} words on {2}'.format(len(links), len(word_list), url),
+ write_log('CRAWL', f'Found {len(links)} links and {len(word_list)} words on {url}',
worker=thread_id)
else:
# Announce which link was crawled
- write_log('CRAWL', 'Found {0} links on {1}'.format(len(links), url),
- worker=thread_id)
+ write_log('CRAWL', f'Found {len(links)} links on {url}', worker=thread_id)
return links
@@ -277,15 +298,28 @@ def crawl_worker(thread_id, robots_index):
"""
# Declare global variables
- global VERSION, START_TIME, START_TIME_LONG
+ global VERSION, START_TIME, START_TIME_LONG, MAX_TIME
global LOG_FILE, LOG_FILE_NAME, ERR_LOG_FILE_NAME
- global HEADER, WORKING_DIR, KILL_LIST
+ global HEADER, USE_BROWSER, WORKING_DIR, KILL_LIST
global COUNTER, NEW_ERROR_COUNT, KNOWN_ERROR_COUNT, HTTP_ERROR_COUNT, NEW_MIME_COUNT
global MAX_NEW_ERRORS, MAX_KNOWN_ERRORS, MAX_HTTP_ERRORS, MAX_NEW_MIMES
global USE_CONFIG, OVERWRITE, RAISE_ERRORS, ZIP_FILES, OVERRIDE_SIZE, SAVE_WORDS, SAVE_PAGES, SAVE_COUNT
global TODO_FILE, DONE_FILE, ERR_LOG_FILE, WORD_FILE
- global RESPECT_ROBOTS, RESTRICT, DOMAIN
- global WORDS, TODO, DONE, THREAD_RUNNING
+ global RESPECT_ROBOTS, RESTRICT, DOMAIN, OUT_OF_SCOPE
+ global WORDS, TODO, DONE
+ global FOUND_URLS
+
+ browser = None
+ if (USE_BROWSER):
+ browser_options = webdriver.FirefoxOptions()
+ browser_options.add_argument('--headless')
+
+ browser = webdriver.Firefox(options=browser_options)
+
+ browser.request_interceptor = interceptor
+ browser.implicitly_wait(10)
+ browser.set_page_load_timeout(10)
+ webdriver.DesiredCapabilities.FIREFOX["unexpectedAlertBehaviour"] = "accept"
while THREAD_RUNNING:
# Check if there are more urls to crawl
@@ -314,12 +348,16 @@ def crawl_worker(thread_id, robots_index):
write_log('CRAWL', 'Too many errors have accumulated; stopping crawler.')
done_crawling()
break
+ elif time.time() - START_TIME >= MAX_TIME: # If too much time has passed
+ write_log('CRAWL', 'Maximum time has been exceeded.')
+ done_crawling()
+ break
elif COUNTER.val >= SAVE_COUNT: # If it's time for an autosave
# Make sure only one thread saves files
with save_mutex:
if COUNTER.val > 0:
try:
- write_log('CRAWL', 'Queried {0} links.'.format(str(COUNTER.val)), worker=thread_id)
+ write_log('CRAWL', f'Queried {str(COUNTER.val)} links.', worker=thread_id)
info_log()
write_log('SAVE', 'Saving files...')
save_files()
@@ -338,7 +376,7 @@ def crawl_worker(thread_id, robots_index):
else:
if check_link(url, robots_index): # If the link is invalid
continue
- links = crawl(url, thread_id)
+ links = crawl(url, browser, thread_id)
for link in links:
# Skip empty links
if len(link) <= 0 or link == "/":
@@ -356,8 +394,8 @@ def crawl_worker(thread_id, robots_index):
except Exception as e:
link = url
- write_log('CRAWL', 'An error was raised trying to process {0}'
- .format(link), status='ERROR', worker=thread_id)
+ write_log('CRAWL', f'An error was raised trying to process {link}',
+ status='ERROR', worker=thread_id)
err_mro = type(e).mro()
if SizeError in err_mro:
@@ -406,7 +444,7 @@ def crawl_worker(thread_id, robots_index):
elif 'Unknown MIME type' in str(e):
NEW_MIME_COUNT.increment()
- write_log('ERROR', 'Unknown MIME type: {0}'.format(str(e)[18:]), worker=thread_id)
+ write_log('ERROR', f'Unknown MIME type: {str(e)[18:]}', worker=thread_id)
err_log(link, 'Unknown MIME', e)
else: # Any other error
@@ -434,7 +472,8 @@ def check_link(item, robots_index=None):
if robots_index and not robots_index.is_allowed(item):
return True
if RESTRICT:
- if DOMAIN not in item:
+ if DOMAIN not in item.split('/')[2]:
+ # Splitting a url on '/' results in ['http(s)', '', '[sub]DOMAIN', 'dir', 'dir', ...]
return True
if len(item) < 10 or len(item) > 255:
return True
@@ -443,6 +482,17 @@ def check_link(item, robots_index=None):
return True
elif item in copy(DONE.queue):
return True
+
+ # Check each domain, subdomain, or path in the out of scope blacklist
+ for scope in OUT_OF_SCOPE:
+ if scope in item:
+ return True
+
+ # Check if the URL has already been processed
+ if item in FOUND_URLS:
+ return True
+
+ FOUND_URLS.add(item)
return False
@@ -498,7 +548,7 @@ def save_files():
todoList.write(site + '\n') # Save TODO list
except UnicodeError:
continue
- write_log('SAVE', 'Saved TODO list to {0}'.format(TODO_FILE))
+ write_log('SAVE', f'Saved TODO list to {TODO_FILE}')
with open(DONE_FILE, 'w', encoding='utf-8', errors='ignore') as done_list:
for site in copy(DONE.queue):
@@ -506,7 +556,7 @@ def save_files():
done_list.write(site + '\n') # Save done list
except UnicodeError:
continue
- write_log('SAVE', 'Saved DONE list to {0}'.format(TODO_FILE))
+ write_log('SAVE', f'Saved DONE list to {TODO_FILE}')
if SAVE_WORDS:
update_file(WORD_FILE, WORDS.get_all(), 'words')
@@ -549,7 +599,7 @@ def mime_lookup(value):
elif value == '':
return '.html'
else:
- raise HeaderError('Unknown MIME type: {0}'.format(value))
+ raise HeaderError(f'Unknown MIME type: {value}')
def save_page(url, page):
@@ -559,15 +609,15 @@ def save_page(url, page):
# Make file path
ext = mime_lookup(get_mime_type(page))
cropped_url = make_file_path(url, ext)
- file_path = path.join(WORKING_DIR, 'saved', '{0}'.format(cropped_url))
+ file_path = path.join(WORKING_DIR, 'saved', cropped_url)
# Save file
with open(file_path, 'w', encoding='utf-8', errors='ignore') as file:
if ext == '.html':
- file.write('''
+ file.write(f'''
-'''.format(url))
+''')
file.write(page.text)
@@ -583,7 +633,7 @@ def update_file(file, content, file_type):
for item in content:
open_file.write('\n' + str(item)) # Write all words to file
open_file.truncate() # Delete everything in file beyond what has been written (old stuff)
- write_log('SAVE', 'Saved {0} {1} to {2}'.format(len(content), file_type, file))
+ write_log('SAVE', f'Saved {len(content)} {file_type} to {file}')
def info_log():
@@ -591,16 +641,16 @@ def info_log():
Logs important information to the console and log file.
"""
# Print to console
- write_log('LOG', 'Started at {0}'.format(START_TIME_LONG))
- write_log('LOG', 'Log location: {0}'.format(LOG_FILE_NAME))
- write_log('LOG', 'Error log location: {0}'.format(ERR_LOG_FILE_NAME))
- write_log('LOG', '{0} links in TODO'.format(TODO.qsize()))
- write_log('LOG', '{0} links in DONE'.format(DONE.qsize()))
- write_log('LOG', 'TODO/DONE: {0}'.format(TODO.qsize() / DONE.qsize()))
- write_log('LOG', '{0}/{1} new errors caught.'.format(NEW_ERROR_COUNT.val, MAX_NEW_ERRORS))
- write_log('LOG', '{0}/{1} HTTP errors encountered.'.format(HTTP_ERROR_COUNT.val, MAX_HTTP_ERRORS))
- write_log('LOG', '{0}/{1} new MIMEs found.'.format(NEW_MIME_COUNT.val, MAX_NEW_MIMES))
- write_log('LOG', '{0}/{1} known errors caught.'.format(KNOWN_ERROR_COUNT.val, MAX_KNOWN_ERRORS))
+ write_log('LOG', f'Started at {START_TIME_LONG}')
+ write_log('LOG', f'Log location: {LOG_FILE_NAME}')
+ write_log('LOG', f'Error log location: {ERR_LOG_FILE_NAME}')
+ write_log('LOG', f'{TODO.qsize()} links in TODO')
+ write_log('LOG', f'{DONE.qsize()} links in DONE')
+ write_log('LOG', f'TODO/DONE: {TODO.qsize() / DONE.qsize()}')
+ write_log('LOG', f'{NEW_ERROR_COUNT.val}/{MAX_NEW_ERRORS} new errors caught.')
+ write_log('LOG', f'{HTTP_ERROR_COUNT.val}/{MAX_HTTP_ERRORS} HTTP errors encountered.')
+ write_log('LOG', f'{NEW_MIME_COUNT.val}/{MAX_NEW_MIMES} new MIMEs found.')
+ write_log('LOG', f'{KNOWN_ERROR_COUNT.val}/{MAX_KNOWN_ERRORS} known errors caught.')
def log(message, level=logging.DEBUG):
@@ -622,7 +672,7 @@ def handle_invalid_input(type_='input. (yes/no)'):
"""
Handles an invalid user input, usually from the input() function.
"""
- write_log('INIT', 'Please enter a valid {0}'.format(type_), status='ERROR')
+ write_log('INIT', f'Please enter a valid {type_}', status='ERROR')
# could raise InputError but this means the user must go through the whole init process again
@@ -632,7 +682,7 @@ def err_log(url, error1, error2):
error1 is the trimmed error source.
error2 is the extended text of the error.
"""
- LOGGER.error("\nURL: {0}\nERROR: {1}\nEXT: {2}\n\n".format(url, error1, str(error2)))
+ LOGGER.error(f"\nURL: {url}\nERROR: {error1}\nEXT: {str(error2)}\n\n")
def zip_saved_files(out_file_name, directory):
@@ -642,7 +692,7 @@ def zip_saved_files(out_file_name, directory):
shutil.make_archive(str(out_file_name), 'zip', directory) # Zips files
shutil.rmtree(directory) # Deletes folder
makedirs(directory) # Creates empty folder of same name
- write_log('SAVE', 'Zipped documents to {0}.zip'.format(out_file_name))
+ write_log('SAVE', f'Zipped documents to {out_file_name}.zip')
########
@@ -818,9 +868,10 @@ def zip_saved_files(out_file_name, directory):
# Initialize variables as empty that will be needed in the global scope
HEADER = {}
-SAVE_COUNT, MAX_NEW_ERRORS, MAX_KNOWN_ERRORS, MAX_HTTP_ERRORS = 0, 0, 0, 0
+USE_BROWSER = False
+SAVE_COUNT, MAX_NEW_ERRORS, MAX_KNOWN_ERRORS, MAX_HTTP_ERRORS, MAX_TIME = 0, 0, 0, 0, float('inf')
MAX_NEW_MIMES = 0
-RESPECT_ROBOTS, RESTRICT, DOMAIN = False, False, ''
+RESPECT_ROBOTS, RESTRICT, DOMAIN, OUT_OF_SCOPE = False, False, '', []
USE_CONFIG, OVERWRITE, RAISE_ERRORS, ZIP_FILES, OVERRIDE_SIZE = False, False, False, False, False
SAVE_PAGES, SAVE_WORDS = False, False
TODO_FILE, DONE_FILE, WORD_FILE = '', '', ''
@@ -830,9 +881,10 @@ def zip_saved_files(out_file_name, directory):
save_mutex = threading.Lock()
FINISHED = False
THREAD_RUNNING = True
+FOUND_URLS = set()
-def init():
+def init(arg_file=None):
"""
Sets all of the variables for spidy,
and as a result can be used for effectively resetting the crawler.
@@ -840,44 +892,28 @@ def init():
# Declare global variables
global VERSION, START_TIME, START_TIME_LONG
global LOG_FILE, LOG_FILE_NAME, ERR_LOG_FILE_NAME
- global HEADER, PACKAGE_DIR, WORKING_DIR, KILL_LIST
+ global HEADER, USE_BROWSER, WORKING_DIR, KILL_LIST
global COUNTER, NEW_ERROR_COUNT, KNOWN_ERROR_COUNT, HTTP_ERROR_COUNT, NEW_MIME_COUNT
global MAX_NEW_ERRORS, MAX_KNOWN_ERRORS, MAX_HTTP_ERRORS, MAX_NEW_MIMES
global USE_CONFIG, OVERWRITE, RAISE_ERRORS, ZIP_FILES, OVERRIDE_SIZE, SAVE_WORDS, SAVE_PAGES, SAVE_COUNT
global TODO_FILE, DONE_FILE, ERR_LOG_FILE, WORD_FILE
- global RESPECT_ROBOTS, RESTRICT, DOMAIN
- global WORDS, TODO, DONE, THREAD_COUNT
+ global RESPECT_ROBOTS, RESTRICT, DOMAIN, OUT_OF_SCOPE
+ global WORDS, TODO, DONE
+ global FOUND_URLS
# Getting Arguments
if not path.exists(path.join(PACKAGE_DIR, 'config')):
write_log('INIT', 'No config folder available.')
USE_CONFIG = False
- else:
- write_log('INIT', 'Should spidy load settings from an available config file? (y/n):', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_): # Use default value
- USE_CONFIG = False
- break
- elif input_ in yes:
- USE_CONFIG = True
- break
- elif input_ in no:
- USE_CONFIG = False
- break
- else:
- handle_invalid_input()
-
- if USE_CONFIG:
+ elif arg_file:
write_log('INIT', 'Config file name:', status='INPUT')
while True:
- input_ = input()
try:
- if input_[-4:] == '.cfg':
- file_path = path.join(PACKAGE_DIR, 'config', input_)
+ if arg_file[-4:] == '.cfg':
+ file_path = path.join(PACKAGE_DIR, 'config', arg_file)
else:
- file_path = path.join(PACKAGE_DIR, 'config', '{0}.cfg'.format(input_))
+ file_path = path.join(PACKAGE_DIR, 'config', '{0}.cfg'.format(arg_file))
write_log('INIT', 'Loading configuration settings from {0}'.format(file_path))
with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
for line in file.readlines():
@@ -888,266 +924,317 @@ def init():
# raise FileNotFoundError()
write_log('INIT', 'Please name a valid .cfg file.')
-
else:
- write_log('INIT', 'Please enter the following arguments. Leave blank to use the default values.')
-
- write_log('INIT', 'How many parallel threads should be used for crawler? (Default: 1):', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- THREAD_COUNT = 1
- break
- elif input_.isdigit():
- THREAD_COUNT = int(input_)
- break
- else:
- handle_invalid_input('integer.')
-
- write_log('INIT', 'Should spidy load from existing save files? (y/n) (Default: Yes):', status='INPUT')
+ write_log('INIT', 'Should spidy load settings from an available config file? (y/n):', status='INPUT')
while True:
input_ = input()
- if not bool(input_):
- OVERWRITE = False
+ if not bool(input_): # Use default value
+ USE_CONFIG = False
break
elif input_ in yes:
- OVERWRITE = False
+ USE_CONFIG = True
break
elif input_ in no:
- OVERWRITE = True
+ USE_CONFIG = False
break
else:
handle_invalid_input()
- write_log('INIT', 'Should spidy raise NEW errors and stop crawling? (y/n) (Default: No):', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- RAISE_ERRORS = False
- break
- elif input_ in yes:
- RAISE_ERRORS = True
- break
- elif input_ in no:
- RAISE_ERRORS = False
- break
- else:
- handle_invalid_input()
+ if arg_file is None:
+ if USE_CONFIG:
+ write_log('INIT', 'Config file name:', status='INPUT')
+ while True:
+ input_ = input()
+ try:
+ if input_[-4:] == '.cfg':
+ file_path = path.join(PACKAGE_DIR, 'config', input_)
+ else:
+ file_path = path.join(PACKAGE_DIR, 'config', '{0}.cfg'.format(input_))
+ write_log('INIT', 'Loading configuration settings from {0}'.format(file_path))
+ with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
+ for line in file.readlines():
+ exec(line, globals())
+ break
+ except FileNotFoundError:
+ write_log('INIT', 'Config file not found.', status='ERROR')
+ # raise FileNotFoundError()
+
+ write_log('INIT', 'Please name a valid .cfg file.')
- write_log('INIT', 'Should spidy save the pages it scrapes to the saved folder? (y/n) (Default: Yes):', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- SAVE_PAGES = True
- break
- elif input_ in yes:
- SAVE_PAGES = True
- break
- elif input_ in no:
- SAVE_PAGES = False
- break
- else:
- handle_invalid_input()
+ else:
+ write_log('INIT', 'Please enter the following arguments. Leave blank to use the default values.')
+
+ write_log('INIT', 'How many parallel threads should be used for crawler? (Default: 1):', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ THREAD_COUNT = 1
+ break
+ elif input_.isdigit():
+ THREAD_COUNT = int(input_)
+ break
+ else:
+ handle_invalid_input('integer.')
- if SAVE_PAGES:
- write_log('INIT', 'Should spidy zip saved documents when autosaving? (y/n) (Default: No):', status='INPUT')
+ write_log('INIT', 'Should spidy load from existing save files? (y/n) (Default: Yes):', status='INPUT')
while True:
input_ = input()
if not bool(input_):
- ZIP_FILES = False
+ OVERWRITE = False
break
elif input_ in yes:
- ZIP_FILES = True
+ OVERWRITE = False
break
elif input_ in no:
- ZIP_FILES = False
+ OVERWRITE = True
break
else:
handle_invalid_input()
- else:
- ZIP_FILES = False
- write_log('INIT', 'Should spidy download documents larger than 500 MB? (y/n) (Default: No):', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- OVERRIDE_SIZE = False
- break
- elif input_ in yes:
- OVERRIDE_SIZE = True
- break
- elif input_ in no:
- OVERRIDE_SIZE = False
- break
- else:
- handle_invalid_input()
+ write_log('INIT', 'Should spidy raise NEW errors and stop crawling? (y/n) (Default: No):', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ RAISE_ERRORS = False
+ break
+ elif input_ in yes:
+ RAISE_ERRORS = True
+ break
+ elif input_ in no:
+ RAISE_ERRORS = False
+ break
+ else:
+ handle_invalid_input()
- write_log('INIT', 'Should spidy scrape words and save them? (y/n) (Default: Yes):', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- SAVE_WORDS = True
- break
- elif input_ in yes:
- SAVE_WORDS = True
- break
- elif input_ in no:
- SAVE_WORDS = False
- break
- else:
- handle_invalid_input()
+ write_log('INIT', 'Should spidy save the pages it scrapes to the saved folder? (y/n) (Default: Yes):', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ SAVE_PAGES = True
+ break
+ elif input_ in yes:
+ SAVE_PAGES = True
+ break
+ elif input_ in no:
+ SAVE_PAGES = False
+ break
+ else:
+ handle_invalid_input()
- write_log('INIT', 'Should spidy restrict crawling to a specific domain only? (y/n) (Default: No):',
- status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- RESTRICT = False
- break
- elif input_ in yes:
- RESTRICT = True
- break
- elif input_ in no:
- RESTRICT = False
- break
+ if SAVE_PAGES:
+ write_log('INIT', 'Should spidy zip saved documents when autosaving? (y/n) (Default: No):', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ ZIP_FILES = False
+ break
+ elif input_ in yes:
+ ZIP_FILES = True
+ break
+ elif input_ in no:
+ ZIP_FILES = False
+ break
+ else:
+ handle_invalid_input()
else:
- handle_invalid_input()
+ ZIP_FILES = False
- if RESTRICT:
- write_log('INIT', 'What domain should crawling be limited to? Can be subdomains, http/https, etc.',
- status='INPUT')
+ write_log('INIT', 'Should spidy download documents larger than 500 MB? (y/n) (Default: No):', status='INPUT')
while True:
input_ = input()
- try:
- DOMAIN = input_
+ if not bool(input_):
+ OVERRIDE_SIZE = False
break
- except KeyError:
- handle_invalid_input('string.')
+ elif input_ in yes:
+ OVERRIDE_SIZE = True
+ break
+ elif input_ in no:
+ OVERRIDE_SIZE = False
+ break
+ else:
+ handle_invalid_input()
- write_log('INIT', 'Should spidy respect sites\' robots.txt? (y/n) (Default: Yes):', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- RESPECT_ROBOTS = True
- break
- elif input_ in yes:
- RESPECT_ROBOTS = True
- break
- elif input_ in no:
- RESPECT_ROBOTS = False
- break
- else:
- handle_invalid_input()
+ write_log('INIT', 'Should spidy scrape words and save them? (y/n) (Default: Yes):', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ SAVE_WORDS = True
+ break
+ elif input_ in yes:
+ SAVE_WORDS = True
+ break
+ elif input_ in no:
+ SAVE_WORDS = False
+ break
+ else:
+ handle_invalid_input()
- write_log('INIT', 'What HTTP browser headers should spidy imitate?', status='INPUT')
- write_log('INIT', 'Choices: spidy (default), Chrome, Firefox, IE, Edge, Custom:', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- HEADER = HEADERS['spidy']
- break
- elif input_.lower() == 'custom':
- # Here we just trust that the user is inputting valid headers...
- write_log('INIT', 'Valid HTTP headers:', status='INPUT')
- HEADER = input()
- break
- else:
- try:
- HEADER = HEADERS[input_]
+ write_log('INIT', 'Should spidy restrict crawling to a specific domain only? (y/n) (Default: No):',
+ status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ RESTRICT = False
+ break
+ elif input_ in yes:
+ RESTRICT = True
+ break
+ elif input_ in no:
+ RESTRICT = False
break
- except KeyError:
- handle_invalid_input('browser name.')
+ else:
+ handle_invalid_input()
- write_log('INIT', 'Location of the TODO save file (Default: crawler_todo.txt):', status='INPUT')
- input_ = input()
- if not bool(input_):
- TODO_FILE = 'crawler_todo.txt'
- else:
- TODO_FILE = input_
+ if RESTRICT:
+ write_log('INIT', 'What domain should crawling be limited to? Can be subdomains, http/https, etc.',
+ status='INPUT')
+ while True:
+ input_ = input()
+ try:
+ DOMAIN = input_
+ break
+ except KeyError:
+ handle_invalid_input('string.')
+
+ write_log('INIT', 'Should spidy respect sites\' robots.txt? (y/n) (Default: Yes):', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ RESPECT_ROBOTS = True
+ break
+ elif input_ in yes:
+ RESPECT_ROBOTS = True
+ break
+ elif input_ in no:
+ RESPECT_ROBOTS = False
+ break
+ else:
+ handle_invalid_input()
- write_log('INIT', 'Location of the DONE save file (Default: crawler_done.txt):', status='INPUT')
- input_ = input()
- if not bool(input_):
- DONE_FILE = 'crawler_done.txt'
- else:
- DONE_FILE = input_
+ write_log('INIT', 'What HTTP browser headers should spidy imitate?', status='INPUT')
+ write_log('INIT', 'Choices: spidy (default), Chrome, Firefox, IE, Edge, Custom:', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ HEADER = HEADERS['spidy']
+ break
+ elif input_.lower() == 'custom':
+ # Here we just trust that the user is inputting valid headers...
+ write_log('INIT', 'Valid HTTP headers:', status='INPUT')
+ HEADER = input()
+ break
+ else:
+ try:
+ HEADER = HEADERS[input_]
+ break
+ except KeyError:
+ handle_invalid_input('browser name.')
- if SAVE_WORDS:
- write_log('INIT', 'Location of the words save file (Default: crawler_words.txt):', status='INPUT')
+ write_log('INIT', 'Should spidy use a headless browser? (y/n) (Default: No)', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ USE_BROWSER = True
+ break
+ elif input_ in yes:
+ USE_BROWSER = True
+ break
+ elif input_ in no:
+ USE_BROWSER = False
+ break
+ else:
+ handle_invalid_input()
+
+ write_log('INIT', 'Location of the TODO save file (Default: crawler_todo.txt):', status='INPUT')
input_ = input()
if not bool(input_):
- WORD_FILE = 'crawler_words.txt'
+ TODO_FILE = 'crawler_todo.txt'
else:
- WORD_FILE = input_
- else:
- WORD_FILE = 'None'
+ TODO_FILE = input_
- write_log('INIT', 'After how many queried links should the crawler autosave? (Default: 100):', status='INPUT')
- while True:
+ write_log('INIT', 'Location of the DONE save file (Default: crawler_done.txt):', status='INPUT')
input_ = input()
if not bool(input_):
- SAVE_COUNT = 100
- break
- elif input_.isdigit():
- SAVE_COUNT = int(input_)
- break
+ DONE_FILE = 'crawler_done.txt'
else:
- handle_invalid_input('integer.')
+ DONE_FILE = input_
- if not RAISE_ERRORS:
- write_log('INIT', 'After how many new errors should spidy stop? (Default: 5):', status='INPUT')
+ if SAVE_WORDS:
+ write_log('INIT', 'Location of the words save file (Default: crawler_words.txt):', status='INPUT')
+ input_ = input()
+ if not bool(input_):
+ WORD_FILE = 'crawler_words.txt'
+ else:
+ WORD_FILE = input_
+ else:
+ WORD_FILE = 'None'
+
+ write_log('INIT', 'After how many queried links should the crawler autosave? (Default: 100):', status='INPUT')
while True:
input_ = input()
if not bool(input_):
- MAX_NEW_ERRORS = 5
+ SAVE_COUNT = 100
break
elif input_.isdigit():
- MAX_NEW_ERRORS = int(input_)
+ SAVE_COUNT = int(input_)
break
else:
handle_invalid_input('integer.')
- else:
- MAX_NEW_ERRORS = 1
- write_log('INIT', 'After how many known errors should spidy stop? (Default: 10):', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- MAX_KNOWN_ERRORS = 20
- break
- elif input_.isdigit():
- MAX_KNOWN_ERRORS = int(input_)
- break
+ if not RAISE_ERRORS:
+ write_log('INIT', 'After how many new errors should spidy stop? (Default: 5):', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ MAX_NEW_ERRORS = 5
+ break
+ elif input_.isdigit():
+ MAX_NEW_ERRORS = int(input_)
+ break
+ else:
+ handle_invalid_input('integer.')
else:
- handle_invalid_input('integer.')
+ MAX_NEW_ERRORS = 1
- write_log('INIT', 'After how many HTTP errors should spidy stop? (Default: 20):', status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- MAX_HTTP_ERRORS = 50
- break
- elif not input_.isdigit():
- MAX_HTTP_ERRORS = int(input_)
- break
- else:
- handle_invalid_input('integer.')
+ write_log('INIT', 'After how many known errors should spidy stop? (Default: 10):', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ MAX_KNOWN_ERRORS = 20
+ break
+ elif input_.isdigit():
+ MAX_KNOWN_ERRORS = int(input_)
+ break
+ else:
+ handle_invalid_input('integer.')
- write_log('INIT', 'After encountering how many new MIME types should spidy stop? (Default: 20):',
- status='INPUT')
- while True:
- input_ = input()
- if not bool(input_):
- MAX_NEW_MIMES = 10
- break
- elif input_.isdigit():
- MAX_NEW_MIMES = int(input_)
- break
- else:
- handle_invalid_input('integer')
+ write_log('INIT', 'After how many HTTP errors should spidy stop? (Default: 20):', status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ MAX_HTTP_ERRORS = 50
+ break
+ elif not input_.isdigit():
+ MAX_HTTP_ERRORS = int(input_)
+ break
+ else:
+ handle_invalid_input('integer.')
+
+ write_log('INIT', 'After encountering how many new MIME types should spidy stop? (Default: 20):',
+ status='INPUT')
+ while True:
+ input_ = input()
+ if not bool(input_):
+ MAX_NEW_MIMES = 10
+ break
+ elif input_.isdigit():
+ MAX_NEW_MIMES = int(input_)
+ break
+ else:
+ handle_invalid_input('integer')
- # Remove INPUT variable from memory
- del input_
+ # Remove INPUT variable from memory
+ del input_
if OVERWRITE:
write_log('INIT', 'Creating save files...')
@@ -1230,6 +1317,12 @@ def handle_keyboard_interrupt():
done_crawling(True)
+# Used by the webdriver to add custom headers
+def interceptor(request):
+ for key in HEADER:
+ request.headers[key] = HEADER[key]
+
+
def main():
"""
The main function of spidy.
@@ -1237,16 +1330,24 @@ def main():
# Declare global variables
global VERSION, START_TIME, START_TIME_LONG
global LOG_FILE, LOG_FILE_NAME, ERR_LOG_FILE_NAME
- global HEADER, WORKING_DIR, KILL_LIST
+ global HEADER, USE_BROWSER, WORKING_DIR, KILL_LIST
global COUNTER, NEW_ERROR_COUNT, KNOWN_ERROR_COUNT, HTTP_ERROR_COUNT, NEW_MIME_COUNT
global MAX_NEW_ERRORS, MAX_KNOWN_ERRORS, MAX_HTTP_ERRORS, MAX_NEW_MIMES
global USE_CONFIG, OVERWRITE, RAISE_ERRORS, ZIP_FILES, OVERRIDE_SIZE, SAVE_WORDS, SAVE_PAGES, SAVE_COUNT
global TODO_FILE, DONE_FILE, ERR_LOG_FILE, WORD_FILE
- global RESPECT_ROBOTS, RESTRICT, DOMAIN
+ global RESPECT_ROBOTS, RESTRICT, DOMAIN, OUT_OF_SCOPE
global WORDS, TODO, DONE
+ global FOUND_URLS
try:
- init()
+ parser = argparse.ArgumentParser(prog="net.py", description="Builds Containernet Topology")
+ parser.add_argument("-f", "--config-file", type=str, help="Path to the desired config file.", required=False)
+ args = parser.parse_args()
+
+ if args.config_file is not None:
+ init(args.config_file)
+ else:
+ init()
except KeyboardInterrupt:
handle_keyboard_interrupt()
@@ -1260,10 +1361,10 @@ def main():
with open(WORD_FILE, 'w', encoding='utf-8', errors='ignore'):
pass
- write_log('INIT', 'Successfully started spidy Web Crawler version {0}...'.format(VERSION))
+ write_log('INIT', f'Successfully started spidy Web Crawler version {VERSION}...')
LOGGER.log(logging.INFO, 'Successfully started crawler.')
- write_log('INIT', 'Using headers: {0}'.format(HEADER))
+ write_log('INIT', f'Using headers: {HEADER}')
robots_index = RobotsIndex(RESPECT_ROBOTS, HEADER['User-Agent'])
@@ -1274,6 +1375,6 @@ def main():
if __name__ == '__main__':
main()
else:
- write_log('INIT', 'Successfully imported spidy Web Crawler version {0}.'.format(VERSION))
+ write_log('INIT', f'Successfully imported spidy Web Crawler version {VERSION}.')
write_log('INIT',
'Call `crawler.main()` to start crawling, or refer to DOCS.md to see use of specific functions.')
diff --git a/spidy/docs/DOCS.md b/spidy/docs/DOCS.md
index 4e0c570..43c9939 100644
--- a/spidy/docs/DOCS.md
+++ b/spidy/docs/DOCS.md
@@ -99,17 +99,17 @@ Everything that follows is intended to be detailed information on each piece in
This section lists the custom classes in `crawler.py`.
Most are Errors or Exceptions that may be raised throughout the code.
-## `HeaderError` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L120))
+## `HeaderError` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L126))
Raised when there is a problem deciphering HTTP headers returned from a website.
-## `SizeError` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L127))
+## `SizeError` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L133))
Raised when a file is too large to download in an acceptable time.
# Functions
This section lists the functions in `crawler.py` that are used throughout the code.
-## `check_link` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L399))
+## `check_link` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L464))
Determines whether links should be crawled.
Types of links that will be pruned:
@@ -118,34 +118,34 @@ Types of links that will be pruned:
- Links that have already been crawled.
- Links in [`KILL_LIST`](https://github.com/rivermont/spidy/blob/master/spidy/docs/DOCS.md#kill_list--source).
-## `check_path` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L433))
+## `check_path` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L510))
Checks whether a file path will cause errors when saving.
Paths longer than 256 characters cannot be saved (Windows).
-## `check_word` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L421))
+## `check_word` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L498))
Checks whether a word is valid.
The word-saving feature was originally added to be used for password cracking with hashcat, which is why `check_word` checks for length of less than 16 characters.
The average password length is around 8 characters.
-## `crawl` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L190))
+## `crawl` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L253))
Does all of the crawling, scraping, scraping of a single document.
-## `err_log` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L601))
+## `err_log` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L678))
Saves the triggering error to the log file.
-## `get_mime_type` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L500))
+## `get_mime_type` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L577))
Extracts the Content-Type header from the headers returned by page.
-## `get_time` - ([Source](https://github.com/rivermont/spidy/blobl/master/spidy/crawler.py#L29))
+## `get_time` - ([Source](https://github.com/rivermont/spidy/blobl/master/spidy/crawler.py#L36))
Returns the current time in the format `HH:MM::SS`.
-## `get_full_time` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L33))
+## `get_full_time` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L40))
Returns the current time in the format `HH:MM:SS, Day, Mon, YYYY`.
-## `handle_keyboard_interrupt` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L1137))
+## `handle_keyboard_interrupt` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L1314))
Shuts down the crawler when a `KeyboardInterrupt` is performed.
-## `info_log` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L561))
+## `info_log` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L638))
Logs important information to the console and log file.
Example log:
@@ -164,17 +164,17 @@ Example log:
[23:17:06] [spidy] [LOG]: Saved done list to crawler_done.txt
[23:17:06] [spidy] [LOG]: Saved 90 bad links to crawler_bad.txt
-## `log` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L578))
+## `log` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L655))
Logs a single message to the error log file.
Prints message verbatim, so message must be formatted correctly in the function call.
-## `make_file_path` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L487))
+## `make_file_path` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L564))
Makes a valid Windows file path for a given url.
-## `make_words` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L166))
+## `make_words` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L522))
Returns a list of all the valid words (determined using [`check_word`](https://github.com/rivermont/spidy/blob/master/spidy/docs/DOCS.md#check_word--source)) on a given page.
-## `mime_lookup` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L511))
+## `mime_lookup` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L588))
This finds the correct file extension for a MIME type using the [`MIME_TYPES`](https://github.com/rivermont/spidy/blob/master/spidy/docs/DOCS.md#mime_types--source) dictionary.
If the MIME type is blank it defaults to `.html`, and if the MIME type is not in the dictionary a [`HeaderError`](https://github.com/rivermont/spidy/blob/master/spidy/docs/DOCS.md#headererror--source) is raised.
Usage:
@@ -183,59 +183,62 @@ Usage:
Where `value` is the MIME type.
-## `save_files` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L459))
+## `save_files` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L536))
Saves the TODO, DONE, word, and bad lists to their respective files.
The word and bad link lists use the same function to save space.
-## `save_page` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L527))
+## `save_page` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L604))
Download content of url and save to the `save` folder.
-## `update_file` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L546))
+## `update_file` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L623))
TODO
-## `write_log` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L78)
+## `write_log` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L85)
Writes message to both the console and the log file.
NOTE: Automatically adds timestamp and `[spidy]` to message, and formats message for log appropriately.
-## `zip_saved_files` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L610))
+## `zip_saved_files` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L687))
Zips the contents of `saved/` to a `.zip` file.
Each archive is unique, with names generated from the current time.
+## `interceptor` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L11320))
+Intercepts a request from selenium and updates the headers to match the one selected by the user.
+
# Global Variables
This section lists the variables in [`crawler.py`](https://github.om/rivermont/spidy/blob/master/spidy/crawler.py) that are used throughout the code.
-## `COUNTER` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L774))
+## `COUNTER` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L852))
Incremented each time a link is crawled.
-## `CRAWLER_DIR` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L30))
+## `WORKING_DIR` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L48))
The directory that `crawler.py` is located in.
-## `DOMAIN` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L794))
+## `DOMAIN` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L873))
The domain that crawling is restricted to if [`RESTRICT`](#restrict--source) is `True`.
-## `DONE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L798))
+## `DONE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L877))
TODO
-## `DONE_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L797))
+## `DONE_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L876))
TODO
-## `ERR_LOG_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L56))
+## `ERR_LOG_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L63))
TODO
-## `ERR_LOG_FILE_NAME` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L57))
+## `ERR_LOG_FILE_NAME` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L64))
TODO
-## `HEADER` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L791))
+## `HEADER` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L869))
TODO
-## `HEADERS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L727))
+## `HEADERS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L805))
TODO
-## `HTTP_ERROR_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L777))
+## `HTTP_ERROR_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L895))
TODO
-## `KILL_LIST` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L762))
+## `KILL_LIST` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L894))
A list of pages that are known to cause problems with the crawler.
- `bhphotovideo.com/c/search`
@@ -243,33 +246,36 @@ A list of pages that are known to cause problems with the crawler.
- `w3.org`: I have never been able to access W3, although it never says it's down. If someone knows of this problem, please let me know.
- `web.archive.org/web/`: While there is some good content, there are sometimes thousands of copies of the same exact page. Not good for web crawling.
-## `KNOWN_ERROR_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L776))
+## `KNOWN_ERROR_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L895))
TODO
-## `LOG_END` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L504))
+## `LOG_END` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L))
Line to print at the end of each `logFile` log
-## `LOG_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L51))
+## `LOG_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L893))
The file that the command line logs are written to.
Kept open until the crawler stops for whatever reason so that it can be written to.
-## `LOG_FILE_NAME` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L53))
+## `LOG_FILE_NAME` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L893))
The actual file name of [`LOG_FILE`](#log_file--source).
Used in [`info_log`](#info_log--source).
-## `MAX_HTTP_ERRORS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L792))
+## `MAX_HTTP_ERRORS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L896))
TODO
-## `MAX_KNOWN_ERRORS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L792))
+## `MAX_KNOWN_ERRORS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L896))
TODO
-## `MAX_NEW_ERRORS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L792))
+## `MAX_NEW_ERRORS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L896))
TODO
-## `MAX_NEW_MIMES` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L793))
+## `MAX_NEW_MIMES` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L896))
TODO
-## `MIME_TYPES` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L628))
+## `MAX_TIME` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L896))
+Maximum amount of time (in seconds) that a crawl will go for. Defaults to float('inf'), allowing it to run forever.
+
+## `MIME_TYPES` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L705))
A dictionary of [MIME types](https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/MIME_types) encountered by the crawler.
While there are [thousands of other types](https://www.iana.org/assignments/media-types/media-types.xhtml) that are not listed, to list them all would be impractical:
- The size of the list would be huge, using memory, space, etc.
@@ -288,63 +294,63 @@ Where `value` is the MIME type.
This will return the extension associated with the MIME type if it exists, however this will throw an [`IndexError`](https://docs.python.org/2/library/exceptions.html#exceptions.IndexError) if the MIME type is not in the dictionary.
Because of this, it is recommended to use the [`mime_lookup`](#mime_lookup--source) function.
-## `NEW_ERROR_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L775))
+## `NEW_ERROR_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L853))
TODO
-## `NEW_MIME_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L778))
+## `NEW_MIME_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L856))
TODO
-## `OVERRIDE_SIZE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L795))
+## `OVERRIDE_SIZE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L874))
TODO
-## `OVERWRITE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L795))
+## `OVERWRITE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L874))
TODO
-## `RAISE_ERRORS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L795))
+## `RAISE_ERRORS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L874))
TODO
-## `RESTRICT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L794))
+## `RESTRICT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L873))
Whether to restrict crawling to [`DOMAIN`](#domain--source) or not.
-## `SAVE_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L792))
+## `SAVE_COUNT` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L871))
TODO
-## `SAVE_PAGES` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L796))
+## `SAVE_PAGES` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L875))
TODO
-## `SAVE_WORDS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L796))
+## `SAVE_WORDS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L875))
TODO
-## `START` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L771))
+## `START` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L849))
Links to start crawling if the TODO list is empty
-## `START_TIME` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L37))
+## `START_TIME` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L44))
The time that `crawler.py` was started, in seconds from the epoch.
More information can be found on the page for the Python [time](https://docs.python.org/3/library/time.html) library.
-## `START_TIME_LONG` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L38))
+## `START_TIME_LONG` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L45))
The time that `crawler.py` was started, in the format `HH:MM:SS, Date Month Year`.
Used in `info_log`.
-## `TODO` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L798))
+## `TODO` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L877))
The list containing all links that are yet to be crawled.
-## `TODO_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L797))
+## `TODO_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L876))
TODO
-## `USE_CONFIG` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L795))
+## `USE_CONFIG` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L874))
TODO
-## `VERSION` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L24))
+## `VERSION` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L31))
The current version of the crawler.
-## `WORD_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L797))
+## `WORD_FILE` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L876))
TODO
-## `WORDS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L782))
+## `WORDS` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L860))
TODO
-## `ZIP_FILES` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L795))
+## `ZIP_FILES` - ([Source](https://github.com/rivermont/spidy/blob/master/spidy/crawler.py#L874))
TODO
***