Official implementation of the paper: "Stochastic Covariance-Based Initialization (SCBI): A Scalable Warm-Start Strategy for High-Dimensional Linear Models"
A novel neural network weight initialization method that achieves 87Γ faster convergence on regression tasks and 33% lower initial loss on classification tasks. SCBI is a GPU-accelerated initialization method that calculates the optimal starting weights for linear layers using covariance statistics. It effectively "solves" linear regression tasks before the first epoch of training, replacing random initialization with a statistically grounded "warm start."
SCBI significantly outperforms standard Random Initialization (He/Xavier) by approximating the closed-form solution via stochastic bagging.
| Dataset | Task | Improvement (Epoch 0) |
|---|---|---|
| Synthetic High-Dim | Binary Classification | 61.6% reduction in loss |
| California Housing | Regression | 90.8% reduction in MSE |
| Forest Cover Type | Multi-Class Classification | 26.2% reduction in loss |
Figure 1: Convergence comparison on Synthetic High-Dimensional Data.
Figure 2: California Housing Regression: SCBI vs Random.
Figure 3: Multi-Class Classification: SCBI vs Random.
| Task | Metric | Standard Init | SCBI Init | Improvement |
|---|---|---|---|---|
| Regression | Initial MSE | 26,000 | 300 | 87Γ |
| Regression | Final MSE | 22,000 | ~0 | >1000Γ |
| Classification | Initial Loss | 1.18 | 0.79 | 33% |
| Classification | Final Loss | 1.14 | 0.77 | 32% |
SCBI (Stochastic Covariance-Based Initialization) is a data-driven initialization method that computes optimal linear weights by solving the Normal Equation on stochastic subsets of your training data.
- Universal Formulation: Works for regression, binary, multi-class, and multi-label classification
- Stochastic Bagging: Prevents overfitting by averaging solutions across random data subsets
- Ridge Regularization: Ensures numerical stability even with correlated features
- Fast Approximation: Linear-complexity variant for high-dimensional problems (D > 10,000)
For each stochastic subset:
(X^T X + Ξ»I)^{-1} X^T y
Final weights are obtained by ensemble averaging across all subsets.
Dependencies are minimal. You only need PyTorch and standard data libraries.
pip install torch numpy scikit-learn matplotlib# Download scbi.py
wget https://github.com/fares3010/SCBI/blob/main/scbi.pyfrom scbi import SCBIInitializer, scbi_init
import torch
import torch.nn as nn
from scbi import scbi_init
# Your data
X_train = torch.randn(1000, 50) # [N, D]
y_train = torch.randn(1000) # [N]
# Compute SCBI weights
weights, bias = scbi_init(X_train, y_train, n_samples=10)
# Use in your model
model = nn.Linear(50, 1)
with torch.no_grad():
model.weight.data = weights.T
model.bias.data = biasimport torch
import torch.nn as nn
from scbi import SCBIInitializer
# Your data (one-hot encoded targets)
X_train = torch.randn(1000, 50)
y_onehot = torch.zeros(1000, 10)
y_onehot.scatter_(1, torch.randint(0, 10, (1000, 1)), 1)
# Initialize with SCBI
model = nn.Linear(50, 10)
initializer = SCBIInitializer(n_samples=15, ridge_alpha=1.5)
initializer.initialize_layer(model, X_train, y_onehot)from scbi import fast_damping_init
X_train = torch.randn(500, 15000) # High-dimensional
y_train = torch.randn(500)
# Use fast approximation (O(NΓDΒ²) instead of O(NΓDΒ³))
weights, bias = fast_damping_init(X_train, y_train)Main class for SCBI initialization.
SCBIInitializer(
n_samples=10, # Number of stochastic subsets
sample_ratio=0.5, # Fraction of data per subset
ridge_alpha=1.0, # Regularization strength
verbose=True # Print progress
)Methods:
compute_weights(X_data, y_data)β Returns(weights, bias)initialize_layer(layer, X_data, y_data)β Initializesnn.Linearlayer in-place
weights, bias = scbi_init(
X_data, # Input features [N, D]
y_data, # Targets [N] or [N, Output_Dim]
n_samples=10,
sample_ratio=0.5,
ridge_alpha=1.0,
verbose=True
)Fast approximation for very high-dimensional data.
FastDampingInitializer(eps=1e-8, verbose=True)Methods:
compute_weights(X, y)β Returns(weights, bias)initialize_layer(layer, X, y)β Initializesnn.Linearlayer
weights, bias = fast_damping_init(X_data, y_data, verbose=True)- Default: 10
- Range: 5-20
- Higher: More stable, slower
- Lower: Faster, more variance
- Default: 0.5 (50% of data)
- Range: 0.3-1.0
- Higher: Less stochastic, may overfit
- Lower: More stochastic, more robust
- Default: 1.0
- Range: 0.5-5.0
- Higher: More regularization (for noisy/high-dim data)
- Lower: Less regularization (for clean/low-dim data)
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.datasets import make_regression
from sklearn.preprocessing import StandardScaler
from scbi import SCBIInitializer
# 1. Generate synthetic data
X, y = make_regression(n_samples=2000, n_features=50, noise=10.0)
# 2. Standardize features (important!)
scaler = StandardScaler()
X = scaler.fit_transform(X)
# 3. Convert to tensors
X_train = torch.tensor(X[:1600], dtype=torch.float32)
y_train = torch.tensor(y[:1600], dtype=torch.float32)
# 4. Create model with SCBI initialization
model = nn.Linear(50, 1)
initializer = SCBIInitializer(n_samples=10, ridge_alpha=1.0)
initializer.initialize_layer(model, X_train, y_train)
# 5. Train normally
optimizer = optim.SGD(model.parameters(), lr=0.01)
criterion = nn.MSELoss()
for epoch in range(5):
optimizer.zero_grad()
pred = model(X_train)
loss = criterion(pred, y_train.unsqueeze(1))
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}: Loss = {loss.item():.4f}")import torch
import torch.nn as nn
from scbi import compute_scbi_weights # Assuming you save the algorithm in scbi.py
# 1. Define your model
model = nn.Linear(input_dim, 1)
# 2. Calculate SCBI Weights (GPU Accelerated)
# Note: X_train and y_train must be Tensors
print("Calculating Warm Start...")
w_init, b_init = compute_scbi_weights(X_train, y_train, n_samples=10, ridge_alpha=1.0)
# 3. Assign Weights to Model
with torch.no_grad():
model.weight.data = w_init.T
model.bias.data = b_init
# 4. Train as normal (Adam/SGD)...
# You will see the loss start near 0.For classification, ensure your target y is One-Hot Encoded before passing it to SCBI.
from scbi import compute_scbi_classification
# ... Load Data ...
# Calculate Weights for 7 Classes
w_init, b_init = compute_scbi_classification(X_train, y_one_hot, n_samples=10)
with torch.no_grad():
model.weight.data = w_init.T
model.bias.data = b_init.squeeze()Standard initialization strategies (Xavier, He) are semantically blindβthey initialize weights based on architecture dimensions, ignoring data statistics.
SCBI leverages the Normal Equation approximation:
By computing this on random subsets (Bagging) using GPU matrix operations, we obtain a robust estimator of the global minimum without the
- Small to medium datasets (N < 100,000)
- First layer initialization in deep networks
- Linear/weakly non-linear problems
- Tasks where training data is expensive
- Fast prototyping and experimentation
- Very large datasets (N > 1,000,000) - computational cost may outweigh benefits
- Highly non-linear problems - SCBI is fundamentally linear
- Pre-trained models - Transfer learning may be more effective
| Method | Time Complexity | Space Complexity |
|---|---|---|
| Xavier/He Init | O(D) | O(D) |
| SCBI | O(n_samples Γ N Γ DΒ²) | O(DΒ²) |
| Fast Damping | O(N Γ DΒ²) | O(DΒ²) |
- D < 1,000: Use standard SCBI (n_samples=10-20)
- 1,000 < D < 10,000: Use standard SCBI (n_samples=5-10)
- D > 10,000: Use Fast Damping approximation
- Stochastic Sampling: Randomly sample
sample_ratiofraction of training data - Augmentation: Add bias column to feature matrix
- Covariance Matrix: Compute X^T @ X
- Regularization: Add ridge penalty Ξ»I
- Correlation: Compute X^T @ y
- Solve: Use linear algebra to solve (X^T X + Ξ»I)^{-1} X^T y
- Repeat: Do this for
n_samplesdifferent subsets - Average: Ensemble average all solutions
Averaging across multiple random subsets provides:
- Robustness: Less sensitive to outliers
- Regularization: Implicit regularization effect
- Better generalization: Prevents overfitting to specific data patterns
If you use SCBI in your research, please cite:
Fares, A. (2026). SCBI: Stochastic Covariance-Based Initialization (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.18576203This project is licensed under the MIT License - see the LICENSE file for details.


