Py Link Crawler is a Python-based web crawler that uses Playwright to extract and filter links from web pages. The crawler starts from a given URL and collects all links within the same base domain, saving them to a JSON file and removing duplicates.
- Python 3.7+
- Playwright
-
Clone the repository:
git clone https://github.com/mahdizakery/py-link-crawler.git cd py-link-crawler -
Install the required packages:
pip install -r requirements.txt
-
Install Playwright browsers:
playwright install
-
Update the
start_urlvariable inlink_crawler.pywith the URL you want to start crawling from. -
Run the crawler:
python link_crawler.py
-
The collected links will be saved to
all_links.json.
get_base_domain(url): Extracts the base domain from a URL.get_all_links(url, base_domain): Retrieves all links from a page and filters them by the base domain.find_all_pages(start_url): Crawls the web starting from the given URL and collects all links within the same base domain.save_links_to_json(links, filename): Saves the collected links to a JSON file.remove_duplicates_from_json(filename): Removes duplicate links from the JSON file.
This project is licensed under the MIT License.