Skip to content

Latest commit

 

History

History
692 lines (463 loc) · 15.9 KB

File metadata and controls

692 lines (463 loc) · 15.9 KB

❓ Frequently Asked Questions (FAQ)

Common questions and answers about SecMask.


Table of Contents


General Questions

What is SecMask?

SecMask is a Mixture of Experts (MoE) system for detecting and masking secrets (API keys, tokens, credentials) in text. It uses two specialized NER models:

  • Fast Expert: DistilBERT-based, handles 92.7% of cases in ~6ms
  • Long Expert: Longformer-based, handles complex cases requiring up to 2048 tokens

Why use SecMask instead of regex-based tools?

Regex limitations:

  • Brittle patterns that break with minor variations
  • High false positive rates
  • Can't handle context-dependent secrets
  • Requires constant maintenance

SecMask advantages:

  • ML-based detection learns patterns from data
  • Low false positive rate (82% precision, production-safe)
  • Handles context (distinguishes real secrets from examples)
  • Automatically adapts to new secret formats
  • Multi-stage pipeline (NER + deterministic filters)

What types of secrets does SecMask detect?

SecMask detects:

  • API keys (OpenAI, Stripe, SendGrid, etc.)
  • Cloud credentials (AWS, Azure, GCP)
  • GitHub tokens (classic, fine-grained, PATs)
  • JWT tokens
  • SSH/PEM keys
  • Database connection strings
  • Kubernetes secrets
  • And more...

See BENCHMARKS.md for detailed detection rates.

Is SecMask open source?

Yes! SecMask is released under dual licensing:

  • SecMask codebase: MIT License (training scripts, inference code, documentation)
  • Fine-tuned models: Apache 2.0 (inherited from DistilBERT and Longformer base models)

You can:

  • ✅ Use freely in commercial projects
  • ✅ Modify and redistribute
  • ✅ Contribute improvements
  • ✅ Fine-tune for your own use cases

Attribution required for:

  • DistilBERT base model (© Hugging Face, Apache 2.0)
  • Longformer base model (© Allen Institute for AI, Apache 2.0)

See LICENSE and NOTICE for full details.

How accurate is SecMask?

On our test set (600 examples) at τ=0.80 threshold:

  • F1 Score: 0.52 (NER model only)
  • Precision: 82% (low false positives, production-safe)
  • Recall: 38% (NER component)

Production Note: These metrics represent the NER models alone. Production deployments combine NER detection with deterministic filters (PEM blocks, K8s secrets, pattern-based matching) for comprehensive secret coverage while maintaining the high precision guarantee.

See BENCHMARKS.md for detailed metrics and evaluation methodology.


Technical Questions

What is Mixture of Experts (MoE)?

MoE is an architecture where multiple specialized models ("experts") handle different types of inputs:

  1. Router decides which expert to use based on input characteristics
  2. Fast Expert (DistilBERT, 512 tokens) handles most cases quickly
  3. Long Expert (Longformer, 2048 tokens) handles complex cases

This gives us both speed (6ms average) and accuracy.

Why two models instead of one?

Trade-off between speed and capacity:

Aspect Fast Expert Long Expert
Latency 6ms 12ms
Max tokens 512 2048
Model size 268MB 592MB
Use case Single secrets Long configs

Using both gives us the best of both worlds:

  • 92.7% of texts processed in 6ms (fast expert)
  • Complex cases escalated to long expert automatically

How does the router work?

The router uses heuristics to decide which expert to use:

def should_escalate(text):
    """Decide if we need long expert"""

    # Fast expert if text is short
    if len(text.split()) < 100:
        return False

    # Fast expert if no complex patterns
    if not has_multi_line_structure(text):
        return False

    # Otherwise, use long expert
    return True

See router.py for full implementation.

Can I train my own models?

Yes! See train_ner_masker.py for the training script.

Requirements:

  • Labeled dataset (BIO-tagged secrets)
  • GPU (NVIDIA T4 or better)
  • ~2 hours training time

Steps:

# Prepare data (see data/README.md)
python data/make_v2_data.py

# Train fast expert
python train_ner_masker.py \
  --model-name distilbert-base-uncased \
  --train-file data/v2_train.jsonl \
  --val-file data/v2_val.jsonl

# Train long expert
python train_longformer_expert.py \
  --model-name allenai/longformer-base-4096 \
  --train-file data/long_context_train.jsonl \
  --val-file data/long_context_val.jsonl

What is the model architecture?

Fast Expert:

  • Base: distilbert-base-uncased (66M parameters)
  • Task: Token classification (NER)
  • Labels: O (non-secret), B-SECRET, I-SECRET
  • Context window: 512 tokens

Long Expert:

  • Base: allenai/longformer-base-4096 (149M parameters)
  • Task: Token classification (NER)
  • Labels: Same as fast expert
  • Context window: 2048 tokens (4096 max)

Both use standard HuggingFace transformers architecture.

How do I add support for new secret types?

Option 1: Retrain with new data

# Add labeled examples to data/v2_train.jsonl
echo '{"text": "New secret: xyz-123", "labels": ["O", "O", "B-SECRET"]}' >> data/v2_train.jsonl

# Retrain model
python train_ner_masker.py --train-file data/v2_train.jsonl

Option 2: Add regex filter

# Add to filters.json
{
  "name": "custom_secret",
  "pattern": "xyz-[0-9]{3}",
  "confidence": 0.95
}

# Apply filter
from filters import apply_filters
masked = apply_filters(text, filters)

Usage Questions

How do I install SecMask?

Quick install:

# Install dependencies
pip install transformers torch

# Download code
git clone https://github.com/andrewandrewsen/secmask.git
cd secmask

# Run
python infer_moe.py --in file.txt \
  --fast-model andrewandrewsen/distilbert-secret-masker

See README.md for detailed instructions.

How do I use SecMask from Python?

from infer_moe import mask_text_moe

# Basic usage
masked = mask_text_moe(
    "My API key is sk-1234567890",
    fast_model_dir="andrewandrewsen/distilbert-secret-masker"
)

print(masked)  # "My API key is [SECRET]"

See EXAMPLES.md for more examples.

How do I adjust sensitivity?

Use the --tau parameter (threshold):

# More sensitive (more detections, more false positives)
python infer_moe.py --in file.txt --tau 0.50

# Less sensitive (fewer false positives, may miss some secrets)
python infer_moe.py --in file.txt --tau 0.90

# Default (balanced)
python infer_moe.py --in file.txt --tau 0.80

Recommended thresholds:

  • Production logs: tau=0.85 (minimize false positives)
  • Pre-commit hooks: tau=0.75 (catch more secrets)
  • Security audits: tau=0.70 (be extra cautious)

Can I process multiple files?

Yes! Use a simple loop:

# Bash
for file in *.py; do
    python infer_moe.py --in "$file" --out "${file}.masked"
done
# Python
from pathlib import Path
from infer_moe import mask_text_moe

for file in Path('.').glob('*.py'):
    with open(file, 'r') as f:
        content = f.read()

    masked = mask_text_moe(content,
        fast_model_dir="andrewandrewsen/distilbert-secret-masker")

    with open(f"{file}.masked", 'w') as f:
        f.write(masked)

How do I use private HuggingFace models?

Option 1: Login via CLI

huggingface-cli login
# Enter your token when prompted

Option 2: Environment variable

export HF_TOKEN="hf_xxxxxxxxxxxxx"
python infer_moe.py --in file.txt --fast-model my-org/private-model

Option 3: Pass token directly

python infer_moe.py --in file.txt \
  --fast-model my-org/private-model \
  --token hf_xxxxxxxxxxxxx

Deployment Questions

Can I deploy SecMask in production?

Yes! SecMask is production-ready. See DEPLOYMENT.md for:

  • Docker deployment
  • Kubernetes
  • AWS Lambda
  • Azure Functions

What are the hardware requirements?

Minimum (CPU-only):

  • 2 CPU cores
  • 4GB RAM
  • ~6-10ms latency per request

Recommended (GPU):

  • NVIDIA T4 or better
  • 8GB RAM
  • ~3-5ms latency per request

See BENCHMARKS.md for details.

How do I scale SecMask?

Horizontal scaling (multiple instances):

# Kubernetes
kubectl scale deployment secmask --replicas=10

# Docker Swarm
docker service scale secmask=10

Vertical scaling (more resources):

resources:
  requests:
    memory: "8Gi"
    cpu: "4000m"

See DEPLOYMENT.md for auto-scaling setup.

Can I use SecMask in a Lambda function?

Yes! See DEPLOYMENT.md for setup.

Key considerations:

  • Use container image deployment (not zip)
  • Set timeout to 30s
  • Set memory to 2048MB
  • Pre-download models at build time

Does SecMask work offline?

Yes, once models are downloaded:

# Download models
python -c "from transformers import AutoModel; \
    AutoModel.from_pretrained('andrewandrewsen/distilbert-secret-masker')"

# Now works offline
python infer_moe.py --in file.txt \
  --fast-model ~/.cache/huggingface/hub/models--andrewandrewsen--distilbert-secret-masker

Privacy & Security

Does SecMask send data to external servers?

No. SecMask runs entirely locally. Your data never leaves your machine unless you:

  • Use HuggingFace Inference API (not recommended)
  • Deploy SecMask as a remote service

Is my data safe?

Yes! SecMask:

  • Processes data in-memory only
  • Doesn't log secrets (only metadata)
  • Doesn't send telemetry

Best practices:

  • Run SecMask on-premises
  • Use local model storage (not HF cache)
  • Review masked output before sharing

Can SecMask leak secrets in logs?

SecMask is designed to prevent secret leakage:

  • Only logs masked text (not original)
  • Doesn't log model predictions
  • Safe to enable debug logging

Example log output:

INFO: Masking text (length: 245 chars)
INFO: Masked in 6.3ms, found 2 secrets

How do I report security vulnerabilities?

DO NOT open public issues for security vulnerabilities.

Instead:

  1. Email: [security@example.com]
  2. Include: Description, impact, steps to reproduce
  3. We'll respond within 48 hours

See SECURITY.md for details.


Performance

How fast is SecMask?

Latency:

  • Fast expert: 6ms (median), 12ms (P99)
  • Long expert: 12ms (median), 25ms (P99)
  • MoE average: 6.8ms (92.7% use fast expert)

Throughput:

  • CPU: ~50 requests/second (single core)
  • GPU (T4): ~300 requests/second
  • GPU (A100): ~1200 requests/second

See BENCHMARKS.md for detailed metrics.

Why is the first request slow?

Model loading overhead:

  • First request loads model into memory (~2-5s)
  • Subsequent requests reuse loaded model (fast)

Solutions:

  • Keep service running (don't restart per request)
  • Use model caching
  • Pre-load models at startup
# Pre-load models
from infer_moe import load_model
model, tokenizer = load_model("andrewandrewsen/distilbert-secret-masker")

# Now fast for all requests

How can I make SecMask faster?

1. Use fast-only mode (no escalation):

python infer_moe.py --in file.txt --no-escalate
# 2x faster, slight accuracy loss

2. Use GPU:

# Automatic if available
python infer_moe.py --in file.txt

3. Batch processing:

from transformers import pipeline

pipe = pipeline("token-classification",
    model="andrewandrewsen/distilbert-secret-masker",
    batch_size=16)  # Process 16 at once

results = pipe(texts)  # Much faster

4. ONNX conversion:

# Convert to ONNX (2-3x faster)
python -m optimum.onnxruntime \
    --model andrewandrewsen/distilbert-secret-masker \
    --export onnx

See DEPLOYMENT.md for more.

How much memory does SecMask use?

Model sizes:

  • Fast expert: 268MB
  • Long expert: 592MB
  • Runtime: +500MB (tokenizer, inference)

Total:

  • Fast-only: ~800MB
  • MoE (both): ~1.4GB

GPU adds VRAM overhead (~1GB).


Troubleshooting

Error: "Model not found"

Cause: Model not downloaded or incorrect path.

Solution:

# Download model
python -c "from transformers import AutoModel; \
    AutoModel.from_pretrained('andrewandrewsen/distilbert-secret-masker')"

# Use full HuggingFace ID
python infer_moe.py --fast-model andrewandrewsen/distilbert-secret-masker

Error: "CUDA out of memory"

Cause: GPU VRAM insufficient.

Solution:

# Use CPU
export CUDA_VISIBLE_DEVICES=""
python infer_moe.py --in file.txt

# Or reduce batch size (if using batching)
pipe = pipeline(..., batch_size=1)

Why are there false positives?

Common causes:

  • Hex strings (e.g., #1a2b3c, git commits)
  • UUIDs (e.g., 123e4567-e89b-12d3-a456-426614174000)
  • Hashes (e.g., MD5, SHA256)

Solutions:

# Increase threshold (fewer false positives)
python infer_moe.py --tau 0.90

# Add custom filters to whitelist patterns
# Edit filters.json

Why are secrets missed (false negatives)?

Common causes:

  • New secret format not in training data
  • Obfuscated secrets (e.g., base64 encoded)
  • Very long secrets (>512 tokens for fast expert)

Solutions:

# Decrease threshold (more detections)
python infer_moe.py --tau 0.70

# Enable long expert
python infer_moe.py \
  --fast-model andrewandrewsen/distilbert-secret-masker \
  --long-model andrewandrewsen/longformer-secret-masker

# Retrain with new examples
python train_ner_masker.py --train-file data/custom_train.jsonl

SecMask is too slow. What can I do?

Quick wins:

  1. Use fast-only mode: --no-escalate
  2. Use GPU if available
  3. Increase threshold: --tau 0.85 (fewer detections = faster)

Advanced:

  1. ONNX conversion (2-3x speedup)
  2. Quantization (smaller model, faster)
  3. Batch processing (higher throughput)

See Performance section above.

How do I debug issues?

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

from infer_moe import mask_text_moe
masked = mask_text_moe(...)

Check model loading:

from transformers import AutoModel

try:
    model = AutoModel.from_pretrained("andrewandrewsen/distilbert-secret-masker")
    print("✅ Model loaded successfully")
except Exception as e:
    print(f"❌ Error: {e}")

Test inference:

from infer_moe import mask_text_moe

text = "Test: sk-1234567890"
masked = mask_text_moe(text, fast_model_dir="andrewandrewsen/distilbert-secret-masker")
print(f"Input: {text}")
print(f"Output: {masked}")
assert '[SECRET]' in masked, "Secret not detected!"

Still Have Questions?


Last Updated: 2024-11