Skip to content

ZeroXClem/enhanced-web-data-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•ΈοΈ Enhanced Web Data Extractor πŸ”

A powerful and user-friendly web scraping tool built with Python and Streamlit.

🌟 Features

  • πŸš€ Asynchronous web scraping for faster data collection
  • 🌐 Depth-limited crawling to control the scope of extraction
  • πŸ”‘ Keyword filtering to focus on relevant content
  • πŸ“Š Multiple export formats: CSV, Markdown, JSON, and XML
  • πŸ–₯️ Interactive Streamlit UI for easy operation
  • πŸ›‘οΈ Rate limiting to respect server resources
  • πŸ“ˆ Real-time progress tracking

πŸ› οΈ Installation

  1. Clone this repository:

    git clone https://github.com/ZeroXClem/enhanced-web-data-extractor.git
    cd enhanced-web-data-extractor
    
  2. Create a virtual environment (optional but recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
    
  3. Install the required packages:

    pip install -r requirements.txt
    

πŸš€ Usage

  1. Run the Streamlit app:

    streamlit run main.py
    
  2. Open your web browser and navigate to the URL provided by Streamlit (usually http://localhost:8501).

  3. In the Streamlit interface:

    • Enter the base URL you want to scrape
    • Set the maximum number of pages to scrape (1-100)
    • Set the maximum depth for crawling (1-10)
    • (Optional) Enter keywords to filter content
    • Set the rate limit (requests per second)
    • Choose the desired export format(s)
    • Click "Start Scraping"
  4. Monitor the progress and download the extracted data when complete.

🎯 Use Cases

  • πŸ“š Research: Gather data from academic websites or online journals
  • πŸ’Ό Business Intelligence: Collect product information from e-commerce sites
  • πŸ“° News Aggregation: Compile articles from various news sources
  • 🏒 Competitive Analysis: Extract data from competitor websites
  • πŸ“Š Market Research: Gather consumer reviews and opinions

⚠️ Important Notes

  • This tool is for educational purposes only.
  • Always respect websites' terms of service and robots.txt files.
  • Be mindful of rate limiting and don't overload servers with requests.
  • Some websites may have measures in place to prevent scraping.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

Contributions, issues, and feature requests are welcome! Feel free to check issues page.

πŸ‘¨β€πŸ’» Author

ZeroXClem


Happy Scraping! πŸŽ‰πŸ•·οΈ

About

A powerful and user-friendly web scraping tool built with Python and Streamlit.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages