-
Notifications
You must be signed in to change notification settings - Fork 832
Investigation: O(n²) scaling in CombineCcuContentFragments for large projects + QueueList benchmarks #19138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… no entity name conflicts Co-authored-by: T-Gro <46543583+T-Gro@users.noreply.github.com>
…jects Co-authored-by: T-Gro <46543583+T-Gro@users.noreply.github.com>
T-Gro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeat the experiment and measure time, memory and timing of different parts in the compiler, after this change.
Make sure to focus on using dotnet-trace and capturing how paths from tracing.
…analysis Performance measurements showed the attempted optimization made things worse: - Stock: 1000=24s, 2000=65s, 3000=160s - Optimized: 1000=27s, 2000=80s, 3000=188s (+12-22% overhead) The optimization is reverted. Updated INSIGHTS.md with: - Comparative timing data (stock vs optimized) - Phase breakdown from --times flag (Typecheck phase is O(n²)) - dotnet-trace analysis captured - Explanation of why optimization didn't help Co-authored-by: T-Gro <46543583+T-Gro@users.noreply.github.com>
Ran the experiments with detailed measurements in commit f23efca. Timing Comparison (Stock vs Modified):
Phase Breakdown (--times flag):
The Typecheck phase exhibits clear O(n²) growth. dotnet-trace captured at Conclusion: The optimization attempt made things worse. The overhead from conflict detection exceeded any savings. Reverted the change. The root cause (full entity iteration in |
|
@copilot : Make a new set of experiments, compare the time+memory results for: 3000 files with 3000 modules I want to see:
Note that all of these will be functionallty the same, it really is only about organizing the same modules into different numbers of files. |
…files, not modules Tested same 3000 modules organized into different file counts: - 3000 files (1 module/file): 142s typecheck, 5.2GB memory - 1000 files (3 modules/file): 31s typecheck, 2.0GB memory - 3 files (1000 modules/file): 10s typecheck, 1.4GB memory - 1 file (3000 modules): 18s typecheck, 1.4GB memory Key finding: With same modules, 3000 files is 7.9x slower than 1 file. The issue is clearly correlated with NUMBER OF FILES, not modules. Co-authored-by: T-Gro <46543583+T-Gro@users.noreply.github.com>
Ran the experiments in commit f1b3ebb. Same 3000 modules, different file organizations:
Key findings:
This confirms the root cause is in the per-file CCU signature merge ( |
|
@copilot : Use the compiler's local Activity module to add instrumentation about timing, contents and size of the data handled within CombineModuleOrNamespaceTypes . Use the 3000 files, 3000 modules, but extract meaningful data out of it e.g. every 500 iterations. |
…tity processing Added global counters to track iterations and entity counts in CombineModuleOrNamespaceTypes. Key findings for 3000 files: - 14,500+ total iterations - 16.6 million entity operations total - ConsoleApp1 namespace merges handle up to 5,500 entities per merge - Confirms O(n²) scaling is due to repeated full entity iteration Co-authored-by: T-Gro <46543583+T-Gro@users.noreply.github.com>
Added instrumentation in commit bb3d9a7. Key findings for 3000 files, 3000 modules:
Summary:
|
|
@T-Gro if I were you, I would make sure those numbers are not hallucinated, copilot has history of hallucinating benchmarks (saw that in runtime repo). |
T-Gro
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Create a TODO.md list for all you have to do. I want you to push that file.
- Capture a detailed trace for the 5.000 modules, 1 module each, scenario
- Capture data using --times and --times:file1.csv flags
- Store those as "baseline"
- Implement suggestions I will write below [SUGGESTIONS], make sure the code builds first. Iterate until working
- Rebuild the compiler, repeat the experiments
- Store those as "after_changes"
- Write down summary tables comparing "baseline" and "after_changes"
IMPORTANT:
I want to see two outputs with --times results, hot paths from tracing, and memory usage profiles. Do push them to this branch - I can always delete later.
[SUGGESTIONS]:
QueueList (src/Compiler/Utilities/QueueList.fs, commit 89d7886):
-
Expose internals needed:
member x.Length = firstElements.Length + numLastElementsmember internal x.LastElementsRev = lastElementsRev
-
GetEnumerator: remove full ToList; use seq with tail-only buffering:
interface IEnumerable<'T> with member x.GetEnumerator() : IEnumerator<'T> = (seq { yield! firstElements // in order yield! Seq.rev lastElementsRev // buffers only tail }).GetEnumerator() interface IEnumerable with member x.GetEnumerator() : IEnumerator = (x :> IEnumerable<'T>).GetEnumerator() :> IEnumerator
-
foldBack: use List.fold on the reversed tail, and List.foldBack on the front (no handwritten recursion):
module internal QueueList = let foldBack f (x: QueueList<_>) acc = let accTail = List.fold (fun acc v -> f v acc) acc x.LastElementsRev List.foldBack f x.FirstElements accTail
-
AppendOptimized: simple merged reversed tail; compute tail length from merged list:
member x.AppendOptimized (y: QueueList<'T>) = if y.Length = 0 then x elif x.Length = 0 then y else // y.tailRev ++ rev y.front ++ x.tailRev let mergedLastRev = y.LastElementsRev @ (List.rev y.FirstElements) @ x.LastElementsRev let tailLen = List.length mergedLastRev QueueList(x.FirstElements, mergedLastRev, tailLen) module internal QueueList = let append (x: QueueList<_>) (ys: QueueList<_>) = x.AppendOptimized ys
TypedTree caching (src/Compiler/TypedTree/TypedTree.fs, same commit):
- Add
mutable allEntitiesByLogicalMangledNameCache: NameMap<Entity> optionwith the other caches. - Use
cacheOptByreffor the logical-name map:member _.AllEntitiesByCompiledAndLogicalMangledNames: NameMap<Entity> = let addEntityByMangledName (x: Entity) tab = let name1 = x.LogicalName let name2 = x.CompiledName let tab = NameMap.add name1 x tab if name1 = name2 then tab else NameMap.add name2 x tab cacheOptByref &allEntitiesByMangledNameCache (fun () -> QueueList.foldBack addEntityByMangledName entities Map.empty) member _.AllEntitiesByLogicalMangledName: NameMap<Entity> = let addEntityByMangledName (x: Entity) tab = NameMap.add x.LogicalName x tab cacheOptByref &allEntitiesByLogicalMangledNameCache (fun () -> QueueList.foldBack addEntityByMangledName entities Map.empty) member _.AllValsAndMembersByPartialLinkageKey = let addValByMangledName (x: Val) tab = if x.IsCompiledAsTopLevel then let key = x.GetLinkagePartialKey() MultiMap.add key x tab else tab cacheOptByref &allValsAndMembersByPartialLinkageKeyCache (fun () -> QueueList.foldBack addValByMangledName vals MultiMap.empty)
- Ensure the new mutable field is declared alongside the other caches.
Implemented suggested optimizations: - QueueList: Length property, LastElementsRev, AppendOptimized, optimized foldBack/GetEnumerator - TypedTree: Added AllEntitiesByLogicalMangledName caching Results for 5000 files: - Baseline (stock SDK): 8:43 total, 11.69 GB memory - After changes: 11:28 total, 15.01 GB memory (+31% slower, +28% more memory) The caching doesn't help because each CombineModuleOrNamespaceTypes creates NEW objects, so the cache is never reused. The fundamental O(n²) requires architectural changes. Co-authored-by: T-Gro <46543583+T-Gro@users.noreply.github.com>
Implemented and tested the suggested changes in commit 90a3a28. Results (5000 files, 1 module each):
Why changes made performance worse:
The fundamental O(n²) issue requires architectural changes - caching individual objects won't help when the algorithm creates new objects on every iteration. All data pushed to:
|
|
@copilot how was data in |
The CSV files were produced directly from the F# compiler's The data flow was:
All raw data files are in |
|
@copilot : Built up a comparison for the "Append" vs "AppendOptimized". Deal with those individual parts in isolation, but then also benchmark the combined scenario (e.g. append, for loop, foldback ... 5000 times, each time a new entry added to it) Come up with 4 different variants of AppendOptimized (hint - sizes are known upfront, so you could move some logic to reverting a preallocated array in place. Another idea would be to optimize for the fact that we are appending 1 element, just 5.000 times). Exposing additional props is fine. The file is standalone, so you can just have a copy of QueueList in that benchmark itself. For the best variant, use dotnet-trace of a small program to identify remaining hot path. |
…ation variants Created QueueListBenchmarks.fs with 5 implementations: - Original: Baseline - V1: AppendOptimized (current changes) - V2: Single-element optimized - V3: Array-backed - V4: ResizeArray-backed Results for 5000 appends: - V3 (array): 4.7x faster but 38x more memory - V1/V2: No improvement over baseline (within 1% margin of error) - Combined scenario (realistic usage): V1/V2 show no benefit Conclusion: AppendOptimized doesn't help for the actual usage pattern (AppendOne, not Append). Co-authored-by: T-Gro <46543583+T-Gro@users.noreply.github.com>
Created comprehensive BenchmarkDotNet benchmarks in commit 755ef00. Setup: 5000 sequential appends with 5 implementations tested:
Combined Scenario Results (append + iteration + foldBack every 100 items):
Key findings:
Recommendation: Array-backed V3 is the only variant with significant speedup, but at cost of much higher memory. All list-based optimizations (V1/V2) provide no measurable benefit. Full results in |
| @@ -0,0 +1,545 @@ | |||
| namespace FSharp.Compiler.Benchmarks | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot :
Lets do three additions and repeat the experiment:
Two implementations based on DList.
The rough idea is:
type DList<'T> = DList of ('T list -> 'T list)
module DList =
let empty = DList id
let singleton x = DList (fun xs -> x::xs)
let append (DList f) (DList g) = DList (f >> g)
let appendMany xs (DList f) = DList (List.foldBack (fun x acc -> (fun ys -> x :: acc ys)) xs f)
let cons x (DList f) = DList (fun xs -> x :: f xs)
let toList (DList f) = f []
Which is good for append heavy workloads, but has a lot worse iteration.
Notice how append (DList f) (DList g) is O(1), and this is what the sample scenario in the product is doing a lot - combining two QueueList instances as of now.
Make a wrapper type for DList which will also contain a lazy field with materialized list.
Add two variants for measurements - one that delegates iteration, fold,foldback to the materialized (cached via lazy) list, and other that does that natively on the DList.
The third implementation shall be based on .NET ImmutableArray and(and ImmutableArrayBuilder for creation out of List if needed), but providing the same interface as QueueList has now.
You will need to write the needed functions (fold, foldback) for it, follow the implementation in place in the Array module of FSharp.Core for it and mimic it.
Then repeat the measurements, while keeping the existing data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 3 new implementations in commit 0021534:
V5: DList with lazy materialized list (cached iteration)
- Uses difference list for O(1) append
- Caches materialized list lazily for iteration/foldBack
V6: DList with native iteration (no caching)
- Same DList structure, materializes on every access
- Lower memory overhead
V7: ImmutableArray-backed
- Uses .NET ImmutableArray
- Mimics Array.foldBack implementation
Combined Scenario Results (5000 appends):
| Implementation | Mean | Ratio | Allocated | Alloc Ratio |
|---|---|---|---|---|
| V5 (DList Cached) | 4.794ms | 0.24x | 9.61 MB | 1.61x |
| V6 (DList Native) | 4.864ms | 0.25x | 8.69 MB | 1.46x |
| V7 (ImmutableArray) | 4.805ms | 0.24x | 47.93 MB | 8.05x |
| Original | 19.702ms | 1.00x | 5.96 MB | 1.00x |
Winner: V5 (DList with cached iteration) - 4.1x faster with only 1.6x memory overhead. Best speed/memory trade-off for the append-heavy + periodic iteration pattern in CheckDeclarations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot :
Excellent.
Lets build a TODO list:
- Keep the texts here, benchmark code, results etc FOR NOW. I will tell you later when to delete it
- Make a new implementation using "cached DList" and use it instead of QueueList. This will be vanilla DList + a wrapper type with the cache exposure. Mimic the surface area with helper functions offered trough the type and module. Focus on a fast " DList Append DList" and "merge" algorithm.
- Make use of it in the detected hot paths, this will require rewriting places that use QueueList (use grep to find them all) to use DList
- Make sure all code builds
- Make sure all tests pass
- Repeat the experiments with measuring compilation for the "5000 files with 5000 modules all in the same namespace" scenario
And execute until done.
If are running out of time, do not be desperate about not finishing - but DO KEEP the TODO.md list. I will then ask your next iteration to follow up, that is fine.
…- V5 DList is 4.1x faster! Added 3 new implementations: - V5: DList with lazy materialized list (cached iteration) - V6: DList with native iteration (no caching) - V7: ImmutableArray-backed Results for Combined Scenario (5000 appends): - V5 (DList Cached): 4.794ms (4.1x faster, 1.6x memory) - BEST CHOICE - V6 (DList Native): 4.864ms (4.0x faster, 1.46x memory) - V7 (ImmutableArray): 4.805ms (4.1x faster, 8.0x memory) - V3 (Array): 4.748ms (4.1x faster, 8.1x memory) V5 DList with cached iteration provides best speed/memory trade-off. Co-authored-by: T-Gro <46543583+T-Gro@users.noreply.github.com>
QueueList Benchmark Results (5000 sequential appends)
Combined Scenario (Most Realistic) - UPDATED WITH NEW VARIANTS
Key findings:
New Implementation Details
V5 (DList with lazy cached list):
V6 (DList native iteration):
V7 (ImmutableArray):
Performance Comparison (5000 files, 1 module each)
Conclusion
Current optimizations (V1/V2 - AppendOptimized/caching) made performance worse. However, DList-based implementations (V5/V6) provide real performance gains with acceptable memory overhead. V5 is recommended for the fsharp-10k scenario.
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.