dotnet · Copilot · Dec 5, 2025 · Dec 5, 2025 · Dec 5, 2025 · Dec 5, 2025
@@ -0,0 +1,122 @@
+# DList Migration Decisions
+
+## Migration Strategy
+
+### 1. Naming Convention
+- Using `CachedDList` instead of `DList` in public APIs for clarity
+- Module functions follow same naming as `QueueList` for easy replacement
+
+### 2. API Compatibility Decisions
+
+#### QueueList.appendOne vs CachedDList.appendOne
+- **QueueList**: `QueueList<'T> -> 'T -> QueueList<'T>` (curried)
+- **CachedDList**: Both member (`x.AppendOne(y)`) and module function (`appendOne x y`)
+- **Decision**: Use module function `CachedDList.appendOne` for compatibility
+- **Perf Impact**: None - O(1) for both
+
+#### QueueList.append vs CachedDList.append  
+- **QueueList**: `QueueList<'T> -> QueueList<'T> -> QueueList<'T>` - O(n) operation
+- **CachedDList**: `CachedDList<'T> -> CachedDList<'T> -> CachedDList<'T>` - **O(1) operation**
+- **Decision**: Direct replacement - this is the KEY OPTIMIZATION
+- **Perf Impact**: **Massive improvement** - O(1) vs O(n) for main hot path
+
+#### QueueList.foldBack
+- **QueueList**: Custom implementation with reversed tail handling
+- **CachedDList**: Delegates to `List.foldBack` on materialized (cached) list
+- **Decision**: Direct replacement via cached list
+- **Perf Impact**: Neutral to positive (caching amortizes cost across multiple foldBack calls)
+
+#### QueueList.ofList
+- **QueueList**: Creates front/back split
+- **CachedDList**: Stores list directly, creates DList wrapper
+- **Decision**: Direct replacement
+- **Perf Impact**: Slightly better (less splitting)
+
+### 3. Migration Order
+
+1. **Phase 1: Core Types** (TypedTree.fs/fsi)
+   - Change `ModuleOrNamespaceType` constructor to use `CachedDList`
+   - Update cache invalidation in mutation methods
+   - Update all property implementations using foldBack
+
+2. **Phase 2: Serialization** (TypedTreePickle.fs)
+   - Add `p_cached_dlist` and `u_cached_dlist` functions
+   - Replace `p_qlist`/`u_qlist` usage for `ModuleOrNamespaceType`
+
+3. **Phase 3: Hot Paths** (TypedTreeOps.fs)
+   - **CombineModuleOrNamespaceTypes** - CRITICAL: O(1) append instead of O(n)
+   - Update all `QueueList.foldBack` calls to `CachedDList.foldBack`
+
+4. **Phase 4: Remaining Usage Sites**
+   - Symbols.fs, Optimizer.fs, fsi.fs, etc.
+   - Replace as needed for compilation
+
+### 4. Backward Compatibility
+
+#### Pickle Format
+- **Decision**: Keep pickle format compatible by converting CachedDList to/from list
+- **Implementation**: `p_cached_dlist = p_wrap CachedDList.toList (p_list pv)`
+- **Rationale**: Avoids breaking binary compatibility
+
+#### FirstElements/LastElements Properties
+- **QueueList**: Has separate front and reversed back lists
+- **CachedDList**: Single materialized list
+- **Decision**: `FirstElements` returns full materialized list, `LastElements` returns empty list
+- **Rationale**: These are rarely used except in debugging; compatibility maintained
+- **Perf Impact**: None for actual usage
+
+### 5. Performance Expectations
+
+Based on benchmarks (V5 - DList with cached iteration):
+
+| Metric | QueueList | CachedDList | Improvement |
+|--------|-----------|-------------|-------------|
+| Append (2 DLists) | O(n) | **O(1)** | **Massive** |
+| AppendOne | O(1) | O(1) | Same |
+| foldBack (first call) | O(n) | O(n) | Same |
+| foldBack (subsequent) | O(n) | O(1) (cached) | Better |
+| Memory overhead | 1x | 1.6x | Acceptable |
+| Combined scenario (5000 appends) | 19.7ms | 4.8ms | **4.1x faster** |
+
+Expected impact on compilation (5000 files, same namespace):
+- **Typecheck phase**: 171s → ~40-50s (4x improvement)
+- **Total time**: 8:43 → ~2-3 min
+- **Memory**: 11.69 GB → ~12-14 GB (small increase acceptable)
+
+### 6. Known Limitations
+
+1. **LastElements always empty**: CachedDList doesn't maintain separate front/back
+   - **Impact**: Minimal - only used in debug views
+   - **Alternative**: Could track but adds complexity with no benefit
+
+2. **Lazy materialization**: First iteration/foldBack forces full materialization
+   - **Impact**: Positive - amortizes cost across multiple operations
+   - **Benchmark confirmed**: Still 4.1x faster overall
+
+3. **Memory overhead 1.6x**: Stores both DList function and cached list
+   - **Impact**: Acceptable trade-off for 4x speedup
+   - **Mitigation**: Lazy evaluation means cache only created when needed
+
+### 7. Rollback Plan
+
+If issues arise:
+1. All changes localized to TypedTree* files and utilities
+2. Can revert by changing imports back to QueueList
+3. DList code can remain for future use
+4. Benchmark results preserved for reference
+
+### 8. Testing Strategy
+
+1. **Unit Tests**: Existing TypedTree tests should pass unchanged
+2. **Integration**: Full compiler test suite
+3. **Performance**: 5000 file scenario with --times flag
+4. **Validation**: Compare against baseline results in investigation/
+
+## Status
+
+- [x] DList implementation complete (DList.fs/fsi)
+- [x] Benchmarks confirm 4.1x improvement
+- [ ] TypedTree migration
+- [ ] Build validation
+- [ ] Test suite validation
+- [ ] Performance measurements
@@ -0,0 +1,88 @@
+# DList Migration TODO
+
+## Status: MIGRATION COMPLETE - TESTING IN PROGRESS
+
+## Completed Tasks
+- [x] Create comprehensive QueueList benchmarks
+- [x] Identify V5 (DList with cached iteration) as best performer (4.1x faster, 1.6x memory)
+- [x] Document all benchmark results
+- [x] Find all QueueList usage sites (89 instances across 11 files)
+- [x] Create DList.fsi and DList.fs implementation
+- [x] Add DList to build system (FSharp.Compiler.Service.fsproj)
+- [x] Verify DList compiles successfully
+- [x] **COMPLETE MIGRATION**: Replace all 89 QueueList usages with CachedDList
+- [x] **BUILD SUCCESS**: 0 errors, 0 warnings
+- [x] Create DECISIONS.md documenting migration strategy
+
+## QueueList Usage Sites (Priority Hot Paths)
+1. **TypedTree.fs** - Core type definition (ModuleOrNamespaceType)
+2. **TypedTreeOps.fs** - CombineModuleOrNamespaceTypes (MAIN HOT PATH)
+3. **TypedTreePickle.fs** - Serialization
+4. **Symbols.fs** - Symbol operations
+5. **Optimizer.fs** - Dead code elimination
+6. **fsi.fs** - Interactive
+
+## Current Tasks
+
+### 1. Create DList Implementation ✅ DONE
+- [x] Create `src/Compiler/Utilities/DList.fsi` (interface file)
+- [x] Create `src/Compiler/Utilities/DList.fs` (implementation)
+  - Core DList type: `type DList<'T> = DList of ('T list -> 'T list)`
+  - Wrapper type `CachedDList<'T>` with lazy materialized list
+  - Functions: empty, singleton, cons, append, appendMany, toList
+  - QueueList-compatible API: AppendOne, ofList, map, filter, foldBack, etc.
+  - Fast O(1) "DList Append DList" operation
+
+### 2. Add DList to Build System ✅ DONE
+- [x] Add DList.fsi and DList.fs to FSharp.Compiler.Service.fsproj
+- [x] Ensure proper ordering in compilation
+
+### 3. Migrate All Usage Sites ✅ DONE
+- [x] TypedTree.fs: Change ModuleOrNamespaceType to use CachedDList
+- [x] TypedTree.fsi: Update interface
+- [x] TypedTreeOps.fs: Update CombineModuleOrNamespaceTypes (KEY OPTIMIZATION - now O(1) append!)
+- [x] TypedTreePickle.fs: Add p_cached_dlist/u_cached_dlist functions
+- [x] CheckDeclarations.fs: Replace QueueList with CachedDList
+- [x] NameResolution.fs: Replace QueueList with CachedDList
+- [x] NicePrint.fs: Replace QueueList with CachedDList
+- [x] fsi.fs: Replace QueueList with CachedDList
+- [x] Optimizer.fs: Replace QueueList with CachedDList
+- [x] Symbols.fs: Replace QueueList with CachedDList
+- [x] TOTAL: 89 instances replaced across 11 files
+
+### 4. Build and Test ⚠️ IN PROGRESS
+- [x] Ensure all code builds successfully (`./build.sh -c Release`) - ✅ 0 errors, 0 warnings
+- [x] Run full test suite - ⚠️ 2775 passed, 2221 failed
+- [ ] Fix pickle format compatibility issue (FSharp.Core metadata reading)
+  - Issue: FSharp.Core compiled with old QueueList, tests use new CachedDList
+  - Solution: Clean rebuild of all artifacts
+- [ ] Verify all tests pass
+
+### 5. Performance Validation 📊 NEXT
+- [ ] Clean rebuild compiler with DList changes
+- [ ] Generate 5000 files/5000 modules test project
+- [ ] Run compilation with --times flag
+- [ ] Capture memory usage with /usr/bin/time -v
+- [ ] Compare with baseline:
+  - Baseline: 8:43 total, 11.69 GB, 171s typecheck
+  - Target: ~2-3 min total (4x improvement in typecheck based on benchmarks)
+- [ ] Document results in investigation/dlist_results/
+
+## Expected Outcome
+Based on benchmarks showing V5 (DList Cached) at 4.1x faster:
+- Typecheck phase: 171s → ~40-50s (4x improvement)
+- Total time: 523s → ~200-250s
+- Memory: Should remain similar or improve (1.6x overhead in micro-benchmark)
+
+## Implementation Notes
+- Keep all benchmark code and results (per instructions)
+- DList provides O(1) append for two DLists (key optimization)
+- Lazy cache ensures iteration/foldBack performance
+- Wrapper type provides QueueList-compatible API surface
+- Focus on hot path first: CombineModuleOrNamespaceTypes
+
+## Rollback Plan
+If DList migration causes issues:
+1. Revert to QueueList (all changes localized to utilities + TypedTree*)
+2. Keep benchmark results for future reference
+3. Document lessons learned
@@ -0,0 +1,38 @@
+# Performance Comparison Summary
+
+## Test Configuration
+- 5000 files, 1 module each (same namespace ConsoleApp1)
+- Each module depends on the previous one
+
+## Results
+
+| Metric | Baseline (Stock SDK) | After Changes | Delta |
+|--------|---------------------|---------------|-------|
+| Total Time | 8:43.45 (523s) | 11:27.96 (688s) | +31% SLOWER |
+| Memory | 11.69 GB | 15.01 GB | +28% MORE |
+| Typecheck | 488.50s | N/A | - |
+
+## Analysis
+
+The changes made performance WORSE:
+
+1. **QueueList.AppendOptimized**: The new implementation creates intermediate lists that increase allocations
+2. **foldBack optimization**: Using `List.fold` on reversed tail may not be more efficient than the original
+3. **AllEntitiesByLogicalMangledName caching**: The cache doesn't help because each `CombineCcuContentFragments` call creates a NEW `ModuleOrNamespaceType` object, so the cache is never reused
+
+## Root Cause of Regression
+
+The caching strategy doesn't work because `CombineModuleOrNamespaceTypes` always returns a NEW `ModuleOrNamespaceType` object:
+```fsharp
+ModuleOrNamespaceType(kind, vals, QueueList.ofList entities)
+```
+
+Each new object has its own fresh cache that starts empty. The cache only helps if the SAME object's `AllEntitiesByLogicalMangledName` is accessed multiple times.
+
+## Recommendations
+
+1. **Revert the changes** - they made things worse
+2. **Different approach needed**: Instead of caching, need to:
+   - Avoid creating new objects on every merge
+   - Use persistent/incremental data structures
+   - Or restructure the algorithm to avoid O(n²) iterations
@@ -0,0 +1,119 @@
+# F# Large Project Build Performance Investigation
+
+## Issue Summary
+Building a project with 10,000 F# modules is indeterminately slow due to super-linear (O(n²)) scaling behavior in the compiler.
+
+## Key Findings
+
+### File Count vs Module Count Experiment
+
+To isolate whether the issue is with file count or module count, we tested the same 3000 modules organized differently:
+
+| Experiment | Files | Modules/File | Typecheck Time | Total Time | Memory (MB) |
+|------------|-------|--------------|----------------|------------|-------------|
+| Exp1       | 3000  | 1            | 142.07s        | 163.15s    | 5202 MB     |
+| Exp2       | 1000  | 3            | 30.59s         | 46.36s     | 2037 MB     |
+| Exp3       | 3     | 1000         | 10.41s         | 28.00s     | 1421 MB     |
+| Exp4       | 1     | 3000         | 18.08s         | 36.57s     | 1441 MB     |
+
+**Key observations:**
+- Same 3000 modules: 3000 files takes 142s, 1 file takes 18s = **7.9x slower with more files**
+- Memory: 5202 MB vs 1441 MB = **3.6x more memory with more files**
+- **The issue is clearly correlated with NUMBER OF FILES, not number of modules**
+- Typecheck phase dominates in all cases
+
+### CombineModuleOrNamespaceTypes Instrumentation
+
+Added instrumentation to track the growth of entities processed in `CombineModuleOrNamespaceTypes`:
+
+| Iteration | Path | mty1.entities | mty2.entities | Total Entities Processed | Elapsed (ms) |
+|-----------|------|---------------|---------------|-------------------------|--------------|
+| 1         | root | 0             | 1             | 1                       | 35,000       |
+| 500       | root | 0             | 1             | 28,221                  | 36,400       |
+| 1000      | ConsoleApp1 | 2       | 664           | 112,221                 | 37,600       |
+| 2000      | root | 0             | 1             | 446,221                 | 41,200       |
+| 3000      | root | 1             | 1             | 1,004,000               | 47,300       |
+| 5000      | root | 0             | 1             | 2,782,221               | 69,900       |
+| 7000      | ConsoleApp1 | 2       | 4,664         | 5,452,221               | 109,500      |
+| 9000      | root | 1             | 1             | 8,008,000               | 155,000      |
+| 12000     | ConsoleApp1 | 2       | 3,000         | 11,263,500              | 175,500      |
+| 14500     | ConsoleApp1 | 2       | 5,500         | 16,582,250              | 180,500      |
+
+**Key observations from instrumentation:**
+- 14,500+ total iterations of `CombineModuleOrNamespaceTypes` for 3000 files
+- Total entities processed grows quadratically: ~16.6 million entity operations for 3000 files
+- The `ConsoleApp1` namespace merge handles increasingly large entity counts (up to 5,500 entities per merge)
+- Each file adds 2 new entities (type + module), but the accumulated namespace grows linearly
+
+### Timing Comparison (Stock vs Optimized Compiler)
+
+| File Count | Stock Compiler | Optimized Compiler | Difference |
+|------------|---------------|-------------------|------------|
+| 1000       | 24.0s         | 26.9s             | +12%       |
+| 2000       | 65.0s         | 79.5s             | +22%       |
+| 3000       | 159.8s        | 187.6s            | +17%       |
+
+**Scaling Analysis:**
+| Files | Stock Ratio | Optimized Ratio | Expected (linear) |
+|-------|------------|-----------------|-------------------|
+| 1000  | 1x         | 1x              | 1x                |
+| 2000  | 2.7x       | 2.96x           | 2x                |
+| 3000  | 6.7x       | 6.98x           | 3x                |
+
+Both compilers exhibit O(n²) scaling. The optimization adds overhead without fixing the fundamental issue.
+
+### Phase Breakdown from --times (1000/2000/3000 files)
+
+| Phase              | 1000 files | 2000 files | 3000 files | Growth Rate |
+|--------------------|------------|------------|------------|-------------|
+| **Typecheck**      | 16.75s     | 67.69s     | 171.45s    | O(n²)       |
+| Optimizations      | 2.80s      | 4.96s      | 6.14s      | ~O(n)       |
+| TAST -> IL         | 1.50s      | 2.25s      | 3.16s      | ~O(n)       |
+| Write .NET Binary  | 0.87s      | 1.50s      | 2.35s      | ~O(n)       |
+| Parse inputs       | 0.51s      | 0.61s      | 0.91s      | ~O(n)       |
+
+**The Typecheck phase dominates and exhibits clear O(n²) growth.**
+
+### dotnet-trace Analysis
+Trace file captured at `/tmp/trace1000.nettrace` (25.8MB) and converted to speedscope format.
+Key hot paths in the trace are in type checking and CCU signature combination.
+
+## Root Cause Analysis
+
+### Primary Bottleneck: CombineCcuContentFragments
+The function `CombineCcuContentFragments` in `TypedTreeOps.fs` is called for each file to merge the file's signature into the accumulated CCU signature. 
+
+The algorithm in `CombineModuleOrNamespaceTypes`:
+1. Builds a lookup table from ALL accumulated entities - O(n)
+2. Iterates ALL accumulated entities to check for conflicts - O(n)
+3. Creates a new list of combined entities - O(n)
+
+This is O(n) per file, giving O(n²) total for n files.
+
+### Why This Affects fsharp-10k
+All 10,000 files use `namespace ConsoleApp1`, so:
+- At the TOP level, there's always a conflict (the `ConsoleApp1` namespace entity)
+- The `CombineEntities` function recursively combines the namespace contents
+- INSIDE the namespace, each file adds unique types (Foo1, Foo2, etc.) - no conflicts
+- But the full iteration still happens to check for conflicts
+
+### Attempted Optimization (Reverted)
+Attempted a fast path in `CombineModuleOrNamespaceTypes`:
+- When no entity name conflicts exist, use `QueueList.append` instead of rebuilding
+- **Result: Made performance WORSE** (+12-22% overhead)
+- The overhead from conflict detection exceeded savings from fast path
+- Reverted this change as it was not beneficial
+
+### Required Fix (Future Work)
+A proper fix would require architectural changes:
+1. Restructuring the CCU accumulator to support O(1) entity appends
+2. Using incremental updates instead of full merges  
+3. Potentially caching the `AllEntitiesByLogicalMangledName` map across merges
+4. Or using a different data structure that supports efficient union operations
+5. Consider lazy evaluation of entity lookups
+
+## Reproduction
+Test project: https://github.com/ners/fsharp-10k
+- Each file declares a type `FooN` that depends on `Foo(N-1)`
+- Creates 10,001 source files (including Program.fs)
+- All in same namespace `ConsoleApp1`