Collecting revenue opportunities with tax-efficient precision! πΌ This AWS Lambda service is the financial intelligence powerhouse of our tender scraping fleet - the fifth and final specialized crawler that captures opportunities from South Africa's premier tax collection agency. From data analytics platforms to digital transformation projects, we audit every opportunity! π
- π― Overview
- π° Lambda Function (lambda_function.py)
- π Data Model (models.py)
- π·οΈ AI Tagging Initialization
- π Example Tender Data
- π Getting Started
- π¦ Deployment
- π§° Troubleshooting
Welcome to the treasury of digital opportunities! ποΈ This service is your direct access to SARS's sophisticated procurement ecosystem, capturing cutting-edge technology projects, data analytics solutions, digital transformation initiatives, and specialized consulting services that power South Africa's tax collection and revenue administration! π»
What makes it tax-efficiently excellent? π
- πΌ Financial Sector Expertise: Specialized in tax technology, data analytics, and revenue administration systems
- π΅οΈ Advanced Web Intelligence: Pure HTML scraping mastery - no APIs, just surgical precision web extraction
- π Dual-Phase Investigation: Two-stage scraping process for comprehensive tender intelligence
- π― Digital Focus: Captures high-tech opportunities in fintech, data science, and digital government services
The financial forensics brain of our operation! π§ The lambda_handler orchestrates our sophisticated dual-audit extraction process:
-
π Initial Tax Assessment: Connects to the SARS procurement webpage - the official treasury for all technology and consulting opportunities across South Africa's revenue administration.
-
π‘οΈ Audit-Grade Error Handling: Built like a tax compliance system! Handles network audits, website maintenance periods, and response discrepancies with financial-grade precision. Every transaction is tracked! π
-
π Comprehensive Audit Process: Here's where our tax investigation expertise shines! Unlike other agencies, SARS requires pure forensic web scraping:
- Phase 1: Audit the main "Published Tenders" page to identify all active procurement opportunities
- Phase 2: Conduct detailed investigation of each tender's individual page for comprehensive data extraction
- Phase 3: Cross-reference and validate all financial and technical specifications
-
βοΈ Tax Code Validation: Each tender undergoes rigorous
SarsTendermodel processing with specialized logic for HTML parsing, document extraction, and briefing session identification - because tax matters require precision! π -
β Compliance Inspector: Our validation process ensures only regulation-compliant tenders make it through. Failed assessments get logged for review - no tax loopholes in our pipeline! π¨
-
π¦ Revenue Batching: Valid tenders are efficiently organized into fiscal batches of 10 messages - optimized for maximum SQS throughput like a well-structured tax return.
-
π Treasury Express: Each batch flows to the central
AIQueue.fifoSQS queue with the uniqueMessageGroupIdofSARSTenderScrape. This keeps our revenue administration tenders organized and maintains perfect audit trail.
Our data architecture is engineered for tax-grade accuracy! ποΈ
The solid fiscal foundation that supports all our tender accounting! This abstract class defines the core ledger that records all revenue opportunities:
π§ Core Attributes:
title: The procurement specification - what technology is being acquired?description: Detailed technical requirements and compliance specificationssource: Always "SARS" for this revenue administration specialistpublished_date: When this opportunity entered our fiscal records (special handling - see below)closing_date: Submission deadline - when the tax window closes! β°supporting_docs: Critical procurement documents and briefing materialstags: Keywords for AI intelligence (starts empty, gets assessed by our AI service)
This financial powerhouse inherits all the foundational strength from TenderBase and adds SARS's unique revenue administration features:
ποΈ SARS-Specific Attributes:
tender_number: Official SARS procurement code (e.g., "RFP18/2025")briefing_session: Details about compulsory briefing sessions and presentations
π Advanced Revenue Processing:
The from_api_response method is our master tax auditor! It performs:
- HTML Audit: BeautifulSoup-powered deep analysis of tender pages
- Document Forensics: Extraction of supporting documents, Q&A sessions, and briefing materials
- Compliance Verification: Validation of procurement timelines and requirements
Important Tax Note: π¨ SARS operates differently from other agencies - their website doesn't publish tender dates! Our solution:
# From models.py - Tax-efficient timestamp management! πΌ
# As a fallback, we use the current timestamp of when the scraper is run.
published_date = datetime.now()We use the exact moment of discovery as the published date - providing consistent, auditable timestamps for when our system first identified each opportunity. It's like a tax assessment date! π
We're all about intelligent revenue optimization! π€ Every tender that processes through our system is perfectly prepared for downstream AI enhancement:
# From models.py - Preparing for AI revenue classification! π°
return cls(
# ... other fields
tags=[], # Initialize tags as an empty list, ready for the AI service.
# ... other fields
)This ensures seamless treasury integration with our AI pipeline - every tender object arrives with a clean, empty tags field just waiting to be assessed with intelligent categorizations! π§ πΌ
Here's what a real SARS technology project looks like after our scraper works its financial magic! π©β¨
{
"title": "Rfp18/2025: The Procurement Of Third-Party Data And Related Services",
"description": "Rfp18/2025: The Procurement Of Third-Party Data And Related Services",
"source": "SARS",
"publishedDate": "2025-10-16T19:34:05.725453",
"closingDate": "2025-10-22T11:00:00",
"supporting_docs": [
{
"name": "The procurement of third-party data and related services",
"url": "https://www.sars.gov.za/sars-rfp-18-2025-tender-pack/"
},
{
"name": "Briefing session presentation",
"url": "https://www.sars.gov.za/non-compulsary-briefing-for-rfp-18-2025/"
},
{
"name": "Questions and answers",
"url": "https://www.sars.gov.za/sars-rfp-18-2025-communication-1/"
}
],
"tags": [],
"tenderNumber": "RFP18/2025",
"briefingSession": "(Non-Compulsory) 2025/09/30 at 10:00"
}π° What this revenue opportunity delivers:
- π Data Analytics Focus: Third-party data procurement for advanced tax analytics
- π» Technology Integration: Modern data services for revenue administration
- π Comprehensive Documentation: Full tender pack, briefing presentations, and Q&A sessions
- β° Tight Timeline: Quick turnaround from October 16 to October 22, 2025
- π― Professional Briefing: Non-compulsory but valuable briefing session opportunity
- π Transparent Process: Multiple communication channels and Q&A support
Ready to calculate your way to success? Let's prepare your tax return of opportunities! π
- AWS CLI configured with appropriate credentials π
- Python 3.9+ with pip π
- BeautifulSoup4 for advanced web scraping π
- Access to AWS Lambda and SQS services βοΈ
- Understanding of revenue administration and financial technology πΌ
- π Clone the repository
- π¦ Install dependencies:
pip install -r requirements.txt - π§ͺ Run tests:
python -m pytest - π Test locally: Use AWS SAM for local Lambda simulation
This section covers three deployment methods for the SARS Tender Processing Lambda Service. Choose the method that best fits your workflow and infrastructure preferences.
Before deploying, ensure you have:
- AWS CLI configured with appropriate credentials π
- AWS SAM CLI installed (
pip install aws-sam-cli) - Python 3.13 runtime support in your target region
- Access to AWS Lambda, SQS, and CloudWatch Logs services βοΈ
- Required Python dependencies:
beautifulsoup4andrequests
Deploy directly through your IDE using the AWS Toolkit extension.
- Install AWS Toolkit in your IDE (VS Code, IntelliJ, etc.)
- Configure AWS Profile with your credentials
- Open Project containing
lambda_function.pyandmodels.py
- Right-click on
lambda_function.pyin your IDE - Select "Deploy Lambda Function" from AWS Toolkit menu
- Configure Deployment:
- Function Name:
SarsLambda - Runtime:
python3.13 - Handler:
lambda_function.lambda_handler - Memory:
128 MB - Timeout:
120 seconds
- Function Name:
- Add Layers manually after deployment:
- beautifulsoup4-library layer
- requests-library layer
- Set Environment Variables as needed
- Configure IAM Permissions for SQS, Logs, and EC2 (for VPC if needed)
- Test the function using the AWS Toolkit test feature
- Monitor logs through CloudWatch integration
- Update function code directly from IDE for quick iterations
Use AWS SAM for infrastructure-as-code deployment with the provided template.
# Install AWS SAM CLI
pip install aws-sam-cli
# Verify installation
sam --versionSince the template references layers not included in the repository, create them:
# Create layer directories
mkdir -p beautifulsoup4-library/python
mkdir -p requests-library/python
# Install beautifulsoup4 layer
pip install beautifulsoup4 -t beautifulsoup4-library/python/
# Install requests layer
pip install requests -t requests-library/python/# Build the SAM application
sam build
# Deploy with guided configuration (first time)
sam deploy --guided
# Follow the prompts:
# Stack Name: sars-lambda-stack
# AWS Region: us-east-1 (or your preferred region)
# Confirm changes before deploy: Y
# Allow SAM to create IAM roles: Y
# Save parameters to samconfig.toml: Y# Quick deployment after initial setup
sam build && sam deploy# Test function locally
sam local invoke SarsLambda
# Start local API Gateway (if needed)
sam local start-api- β Complete infrastructure management
- β Automatic layer creation and management
- β IAM permissions defined in template
- β Easy rollback capabilities
- β CloudFormation integration
Automated deployment using GitHub Actions workflow for production environments.
-
GitHub Repository Secrets:
AWS_ACCESS_KEY_ID: Your AWS access key AWS_SECRET_ACCESS_KEY: Your AWS secret key AWS_REGION: us-east-1 (or your target region) -
Pre-existing Lambda Function: The workflow updates an existing function, so deploy initially using Method 1 or 2.
-
Create Release Branch:
# Create and switch to release branch git checkout -b release # Make your changes to lambda_function.py or models.py # Commit changes git add . git commit -m "feat: update SARS tender processing logic" # Push to trigger deployment git push origin release
-
Automatic Deployment: The workflow will:
- Checkout the code
- Configure AWS credentials
- Create deployment zip with
lambda_function.pyandmodels.py - Update the existing Lambda function code
- Maintain existing configuration (layers, environment variables, etc.)
You can also trigger deployment manually:
- Go to Actions tab in your GitHub repository
- Select "Deploy Python Scraper to AWS" workflow
- Click "Run workflow"
- Choose the
releasebranch - Click "Run workflow" button
- β Automated CI/CD pipeline
- β Consistent deployment process
- β Audit trail of deployments
- β Easy rollback to previous commits
- β No local environment dependencies
Regardless of deployment method, configure the following:
SQS_QUEUE_URL=https://sqs.us-east-1.amazonaws.com/211635102441/AIQueue.fifo
SCRAPING_TIMEOUT=30
BATCH_SIZE=10
USER_AGENT=Mozilla/5.0 (compatible; SARS-Tender-Bot/1.0)Set up scheduled execution:
# Create CloudWatch Events rule for daily execution
aws events put-rule \
--name "SarsLambdaSchedule" \
--schedule-expression "cron(0 9 * * ? *)" \
--description "Daily SARS tender scraping"
# Add Lambda as target
aws events put-targets \
--rule "SarsLambdaSchedule" \
--targets "Id"="1","Arn"="arn:aws:lambda:us-east-1:211635102441:function:SarsLambda"After deployment, test the function:
# Test via AWS CLI
aws lambda invoke \
--function-name SarsLambda \
--payload '{}' \
response.json
# Check the response
cat response.json- β Function executes without errors
- β CloudWatch logs show successful scraping activity
- β SQS queue receives tender messages
- β No timeout or memory errors
- β Valid JSON tender data in queue messages
- Duration: Function execution time
- Error Rate: Failed invocations
- Memory Utilization: RAM usage patterns
- Throttles: Concurrent execution limits
# View recent logs
aws logs tail /aws/lambda/SarsLambda --follow
# Search for errors
aws logs filter-log-events \
--log-group-name /aws/lambda/SarsLambda \
--filter-pattern "ERROR"Layer Dependencies Missing
Issue: beautifulsoup4 or requests import errors
Solution: Ensure layers are properly created and attached:
# For SAM: Verify layer directories exist and contain packages
ls -la beautifulsoup4-library/python/
ls -la requests-library/python/
# For manual deployment: Create and upload layers separatelyIAM Permission Errors
Issue: Access denied for SQS or CloudWatch operations
Solution: Verify the Lambda execution role has required permissions:
sqs:SendMessagesqs:GetQueueUrlsqs:GetQueueAttributeslogs:CreateLogGrouplogs:CreateLogStreamlogs:PutLogEvents
Workflow Deployment Fails
Issue: GitHub Actions workflow errors
Solution: Check repository secrets are correctly configured and the target Lambda function exists in AWS.
Choose the deployment method that best fits your development workflow and infrastructure requirements. SAM deployment is recommended for development environments, while workflow deployment excels for production CI/CD pipelines.
Pure HTML Scraping Complexity
Issue: No API available - everything requires surgical HTML extraction.
Solution: SARS is a pure web scraping challenge! Maintain robust HTML parsing with fallback selectors and regular expression patterns. Tax websites require forensic precision! π
Website Structure Updates
Issue: SARS website redesigns breaking the scraping logic.
Solution: Government websites evolve like tax regulations! Monitor for structural changes and maintain flexible selectors. Keep your scraping code as current as tax law! π
Dual-Phase Scraping Timeouts
Issue: Main page loads but individual tender pages timeout.
Solution: SARS tender pages can be document-heavy! Implement intelligent timeout handling and retry logic for individual page scraping. Sometimes tax documents take time to load! β°
Missing Published Dates
Issue: SARS doesn't provide published dates for tenders.
Solution: We use discovery timestamps! This provides consistent, auditable dates for when our system first identified each opportunity. Document your methodology like a tax audit! π
Complex Document Structures
Issue: SARS tenders often have multiple supporting documents and briefing materials.
Solution: Implement comprehensive document extraction logic that captures tender packs, briefing presentations, Q&A sessions, and communication updates. Treat each document like a tax form - every detail matters! πΌ
Built with love, bread, and code by Bread Corporation π¦β€οΈπ»