AlphaEngine is an end-to-end algorithmic trading system that combines quantitative research, deep learning, and cloud infrastructure to generate, validate, and explain trading signals.
The project spans the full ML engineering lifecycle, from raw market data ingestion to a deployed, queryable intelligence layer, built deliberately from the ground up without relying on high-level abstractions.
Why this project? Built as a hands-on deep dive into three domains simultaneously: quantitative finance methodology, production ML engineering with PyTorch, and cloud-native deployment on Microsoft Azure.
Alpaca Markets API
↓
Azure Blob Storage (raw & processed data)
↓
Feature Engineering Pipeline (log-returns, rolling volatility, normalization)
↓
PyTorch LSTM Model (sequence modeling on time-series data)
↓
Azure ML (experiment tracking & model versioning)
↓
Backtesting Engine (custom-built, bias-free)
↓
LangChain RAG Agent (natural language strategy Q&A)
↓
Azure Container App (REST API + Dashboard)
| Layer | Technology | Purpose |
|---|---|---|
| Data | Alpaca Markets API | Historical & live market data + paper trading |
| Storage | Azure Blob Storage | Raw and processed data persistence |
| Modeling | PyTorch | LSTM architecture for time-series forecasting |
| Experiment Tracking | Azure ML | Model versioning, training runs, metrics |
| Intelligence Layer | LangChain + RAG | Natural language querying of strategy & documents |
| Vector Search | Azure AI Search | Embedding store for financial documents |
| Deployment | Azure Container Apps | Scalable API hosting |
| Backend | FastAPI | REST API layer |
alphaengine/
│
├── data/
│ ├── raw/ # Raw market data from Alpaca API
│ └── processed/ # Cleaned, normalized, feature-engineered data
│
├── ingestion/
│ ├── api_loader.py # Abstracted market data loader (Alpaca / extensible)
│ ├── preprocessor.py # Feature engineering: log-returns, rolling features
│ └── blob_upload.py # Azure Blob Storage read/write
│
├── models/
│ ├── dataset.py # PyTorch Dataset & DataLoader
│ ├── lstm.py # LSTM model architecture
│ └── train.py # Training loop with experiment tracking
│
├── strategy/
│ ├── signals.py # Signal generation from model outputs
│ └── backtester.py # Custom backtesting engine
│
├── agent/
│ ├── tools.py # Custom LangChain tools (backtesting, metrics)
│ ├── chain.py # RAG chain + agent orchestration
│ ├── retriever.py # Vector search over financial documents
│ └── ingestion/
│ ├── doc_loader.py # PDF / filing ingestion
│ └── embeddings.py # Embedding generation & storage
│
├── api/
│ └── app.py # FastAPI application
│
├── evaluation/
│ └── metrics.py # Sharpe ratio, max drawdown, hit rate
│
└── memo/
└── business_case.md # McKinsey-style business case write-up
- Project structure & environment setup
- Azure account & resource group
- Alpaca Markets API integration
- Feature engineering pipeline (log-returns, rolling volatility)
- Azure Blob Storage persistence
- End-to-end data pipeline
- PyTorch Dataset & DataLoader for time-series
- LSTM architecture implementation
- Training loop with Azure ML experiment tracking
- Overfitting analysis & regularization
- Signal generation from model outputs
- Custom backtesting engine (no look-ahead bias)
- Performance metrics: Sharpe ratio, max drawdown, hit rate
- LangChain RAG agent with financial document retrieval
- Paper trading integration via Alpaca
- FastAPI REST endpoint
- Azure Container App deployment
The system is grounded in core quant concepts applied deliberately throughout:
- Log-returns over simple returns for time-series stationarity and additivity
- Rolling volatility as a key input feature capturing regime changes
- Sharpe Ratio as the primary strategy evaluation metric
- Look-ahead bias prevention as a first-class concern in the backtesting engine
- Train/validation/test splits respecting temporal ordering — no random shuffling
Abstracted data layer — api_loader.py exposes a standardized DataFrame interface regardless of the underlying data source. Switching from Alpaca to any other provider requires changes in one file only.
Custom backtester — built from scratch rather than using off-the-shelf libraries, to ensure full understanding of bias sources and edge cases.
RAG over structured + unstructured data — the LangChain agent combines quantitative backtesting results with unstructured financial documents (SEC filings, research papers) for context-aware strategy explanations.
# Clone the repository
git clone https://github.com/yourusername/alphaengine.git
cd alphaengine
# Create and activate conda environment
conda create -n alphaengine python=3.11
conda activate alphaengine
# Install dependencies
pip install -r requirements.txt
# Configure environment variables
cp .env.example .env
# Add your Alpaca API keys and Azure credentials to .envThis project is actively under development. It is being built step by step with a focus on actually understanding specific technical concepts and worflows over speed. Code is written from scratch where possible, and every architectural decision is made consciously.
Current phase: Phase 1 — Data Foundation
Built by a Bioinformatics MSc graduate transitioning into ML Engineering / Quant Research. The project deliberately combines domains like quantitative finance, deep learning, and cloud infrastructure in order to reflect the profile of modern data science and ml engineering roles.