A Python-based web scraper that extracts content from websites and creates vector embeddings using OpenAI's text-embedding-3-large model. The embeddings are stored in a local directory for efficient similarity search and retrieval.
- Web scraping with automatic content extraction
- Vector embeddings generation using OpenAI's text-embedding-3-large
- Local vector database storage
- Command-line interface for easy usage
- Configurable storage location
- Python 3.12 or higher
- OpenAI API key
- Clone the repository:
cd web_scraper- Install required dependencies:
pip install -r requirements.txt- Set up your environment variables:
# Create a .env file and add your OpenAI API key
echo "OPENAI_API_KEY=your-api-key-here" > .envRun the scraper from the command line:
python main.py https://www.rsystems.com/ --save-dir=C:\learn\vector_store- The target URL to scrape (required)
--output: Directory path where the vector database will be stored (required)