From a73001347a811e72b1d7f4f9f6ee869fa2236e61 Mon Sep 17 00:00:00 2001 From: Michael Simons Date: Mon, 24 Nov 2025 21:05:38 +0000 Subject: [PATCH 1/5] Add 'Eliminate Duplicate SDK Files' one-pager --- .../general/eliminate-duplicate-file.md | 302 ++++++++++++++++++ 1 file changed, 302 insertions(+) create mode 100644 documentation/general/eliminate-duplicate-file.md diff --git a/documentation/general/eliminate-duplicate-file.md b/documentation/general/eliminate-duplicate-file.md new file mode 100644 index 000000000000..9db7cc028137 --- /dev/null +++ b/documentation/general/eliminate-duplicate-file.md @@ -0,0 +1,302 @@ +# Eliminate Duplicate SDK Files + +## Overview + +This proposal focuses on eliminating duplicated files within the .NET SDK to reduce installation size and improve disk footprint. +By removing duplicate assemblies, we can reduce the SDK size by **35% (53 MB compressed, 140 MB on disk)** on Linux x64. + +**Tracking issue:** [dotnet/sdk#41128](https://github.com/dotnet/sdk/issues/41128) + +## Customer Impact: Why SDK Size Matters + +While we often envision the .NET SDK as something installed once on a developer's machine, the reality is that most SDK installations occur in ephemeral, high-volume scenarios where the SDK is repeatedly downloaded and extracted. +SDK size directly affects both network costs (download/wire transfer) and time costs (extraction overhead). + +Containers represent one of the largest and most measurable areas of impact. Official .NET SDK container images are pulled approximately [750,000 times per week](https://msit.powerbi.com/groups/6b5ffb99-5fd3-492b-bd02-724f09fe9eff/reports/7e5d7fef-a86c-4f94-8aa3-d356c3125ee0?ctid=72f988bf-86f1-41af-91ab-2d7cd011db47&pbi_source=linkShare&bookmarkGuid=f44da1fd-c619-4158-aa51-f050b379a2b3). +When developers build within containers—whether for local development or CI/CD pipelines—they're pulling that full SDK image. +A 50 MB reduction in compressed size translates to 37.5 TB per week in bandwidth saved from container pulls alone. +Beyond containers, SDK installations also happen at high volume in CI/CD pipelines, cloud development environments, and through various tooling extensions. + +The [Native AOT SDK epic](https://github.com/dotnet/sdk/issues/40931) represents a significant investment in improving performance, but it comes at the cost of increased SDK size. +Without addressing duplication first, Native AOT will exacerbate the size problem and likely make customers in CI and container scenarios unhappy. +This deduplication work gives us the headroom needed to absorb the Native AOT expansion while still delivering a better overall experience. + +## By the Numbers + +A duplicate file analysis was performed on the SDK layout of .NET 10.0.100, where file uniqueness is defined by *name*, *TFM* (Target Framework Moniker), and *culture*. +The following data is the result of this analysis using the [SdkLayoutAnalyzer](https://github.com/MichaelSimons/SdkLayoutAnalyzer) tool. + +The goal is for the vast mojority if not all of the components within the SDK to depend on and use the same version of their dependencies. + +**Note:** The baseline measurements in the following tables compare the complete .NET installation (including runtimes, packs, shared frameworks, host, etc.) versus just the SDK directory contents. + +### Linux x64 + +| Metric | Baseline 10.0 (dotnet / SDK only) | Detected SDK Duplicates | % Duplicates (dotnet / SDK only) | +|--------------|-----------------------------------|-------------------------|----------------------------------| +| Archive Size | 229 MB / 150 MB | 53 MB | 23.1% / 35.3% | +| Disk Size | 623 MB / 402 MB | 140 MB | 22.5% / 34.8% | +| File Count | 4,915 / 3,667 | 816 files | 16.6% / 22.2% | + +### Windows x64 + +| Metric | Baseline 10.0 (dotnet / SDK only) | Detected SDK Duplicates | % Duplicates (dotnet / SDK only) | +|--------------|-----------------------------------|-------------------------|----------------------------------| +| Disk Size | 789 MB / 411 MB | 148 MB | 18.8% / 36.0% | +| File Count | 5,610 / 3,791 | 908 files | 16.2% / 24.0% | + +**Note:** The WindowsDestop runtime and target pack, along with .NET Framework support are the major reasons for the differences between Windows and Linux. + +### Download impact of removing duplicates + +The following table shows download times for the .NET Linux x64 archive at various network speeds, comparing the baseline 229 MB archive against the optimized 176 MB archive (53 MB reduction). +These times represent pure wire transfer costs and do not include connection overhead such as DNS resolution, TLS handshake, or other protocol negotiations. + +| Network Speed | Before (229 MB) | After (176 MB) | Time Saved | +| -------------------- | --------------: | -------------: | ---------: | +| 100 Mbps | 18.32s | 14.08s | 4.24s | +| 500 Mbps | 3.66s | 2.82s | 0.85s | +| 1 Gbps (1000 Mbps) | 1.83s | 1.41s | 0.42s | +| 10 Gbps (10,000 Mbps)| 0.18s | 0.14s | 0.04s | + +The deduplicated archive downloads **23% faster** on average across all network speeds. + +### Extraction impact of removing duplicates + +The following table shows .NET archive extraction times measured on a Linux x64 development environment with SSD storage. + +| Metric | Before (228.05 MB) | After (175.73 MB) | Time Saved | +| ------- | -----------------: | ----------------: | ---------: | +| Mean | 3.81s | 2.91s | 0.90s | +| Median | 3.78s | 2.92s | 0.86s | +| Min | 3.77s | 2.88s | 0.89s | +| Max | 3.92s | 2.96s | 0.96s | + +The deduplicated archive extracts **23.5% faster** on average, saving approximately 0.90 seconds per extraction. + +### Duplicate categorization (relative to lowest version file to keep) + +- Duplicates with same hash as file to keep: 663 (100.3 MB) +- Duplicates with different version: 40 (5.9 MB) +- Duplicates with same version but different hash: 113 (33.8 MB) + - Of which, same version but different arch: 89 (31.4 MB) + +### Top 10 Largest Duplicates + +| Filename | Culture | TFM | Duplicate Count | Duplicate Size (MB) | +|-------------------------------------------|---------|--------------- |-----------------|---------------------| +| Microsoft.CodeAnalysis.CSharp.dll | neutral | net9.0 | 3 | 24.7 | +| Microsoft.CodeAnalysis.dll | neutral | net9.0 | 3 | 10.8 | +| Microsoft.CodeAnalysis.Features.dll | neutral | net9.0 | 2 | 5.3 | +| Newtonsoft.Json.dll | neutral | net6.0 | 7 | 5.2 | +| Microsoft.CodeAnalysis.VisualBasic.dll | neutral | net9.0 | 2 | 4.6 | +| Microsoft.CodeAnalysis.Workspaces.dll | neutral | net9.0 | 2 | 4.0 | +| Microsoft.CodeAnalysis.Razor.Compiler.dll | neutral | netstandard2.0 | 2 | 3.6 | +| Microsoft.Build.Tasks.Core.dll | neutral | net10.0 | 2 | 2.6 | +| Microsoft.Build.dll | neutral | net10.0 | 2 | 2.5 | +| System.Diagnostics.EventLog.Messages.dll | neutral | netstandard2.0 | 4 | 2.3 | +| **Total** | | | **29** | **65.6** | + +### Trends + +An analysis of the current in-support versions of .NET (Linux x64) illustrates that this problem is not trending towards a desirable end state. +Notice the 33% increase in duplicate file size in 10.0 compared to 9.0. + +| .NET Version | SDK Only File Count | SDK Only Disk Size | # Duplicate Files | Duplicate Size | +|--------------|---------------------|--------------------|-------------------|----------------| +| 8.0 | 3499 | 394 MB | 772 | 114 MB | +| 9.0 | 3619 | 380 MB | 811 | 107 MB | +| 10.0 | 3667 | 402 MB | 816 | 140 MB | + +#### Top 4 Duplicate File Size Increases in 10.0 + +| File Name | Source | Change | Size Increase (MB) | +|---------------------------------------|------------------------------|----------|--------------------| +| Microsoft.CodeAnalysis.CSharp.dll | sdk/ | New Copy | 18.2 | +| Microsoft.CodeAnalysis.dll | sdk/ | New Copy | 7.9 | +| Microsoft.CodeAnalysis.Features.dll | sdk/DotnetTools/dotnet-watch | New Copy | 5.3 | +| Microsoft.CodeAnalysis.Workspaces.dll | sdk/DotnetTools/dotnet-watch | New Copy | 4.0 | +| **Total** | | | **35.4** | + +## Proposed Approach + +The overall direction of this effort is to eliminate the vast majority of duplicate assemblies within the .NET SDK so that each shared dependency is carried only once. +There may be a few special cases where different versions, etc. need to be retained. +Achieving this requires solving two distinct but related problems. +First, from a runtime and execution perspective, SDK components must be able to reliably load a single shared copy of each assembly from a common location. +Second, from a build and production perspective, SDK components would ideally be compiled against the same shared set of assemblies, and the SDK layout and packaging process must ensure that only one copy of each shared assembly is included in the final SDK distribution. +The proposed approach addresses both sides of this problem by defining a shared assembly location, enabling components to load from it, and restructuring the SDK build to populate and enforce this unified dependency model. + +### Define a Common Assembly Location + +A well-defined common assembly location is essential for shared assemblies to be loaded by SDK components, particularly out-of-process components like global tools. All SDK components should depend on a single version of shared dependencies as much as possible. + +The root SDK directory is a natural location for this common assembly cache. However, the root SDK folder is already quite large, and consolidating shared assemblies there would add over 100 files to the root directory. This level of clutter doesn't seem desirable and would negatively impact readability. + +A dedicated subdirectory for shared assemblies is preferred. This approach keeps the root directory clean and makes it clear which assemblies are part of the shared cache. Since we're centralizing on a single version of each shared dependency, there's no need for version-specific subdirectories within the common location. + +**Directory Structure Considerations:** + +While version folders aren't needed, we must account for framework versus core components. Some framework-specific assemblies are shared and would need to be placed in a subdirectory to distinguish them from core assemblies (e.g., `shared/net472/` for framework components, with core assemblies directly in `shared/`). + +**Possible names for the common assembly location:** + +- `shared` — simple, clear, and consistent with .NET conventions (like the existing `shared` directory for runtimes). +- `common` — widely recognized across ecosystems for shared dependencies and utilities. +- `dependencies` / `deps` — descriptive and clear. +- `libs` — short and familiar in many build systems. + +#### Side Effects of a Common Assembly Location + +Consolidating assemblies into a common location introduces potential side effects that must be carefully considered. +Discussions with compiler experts have noted that the compiler toolchain is particularly sensitive to assembly availability and resolution paths. +The presence of assemblies in a shared location can affect behavior. + +Beyond the compiler, other SDK components may have similar sensitivities to assembly placement and availability. +These areas require careful analysis to identify potential behavioral changes when moving to a shared assembly model. +Further ivestigation in these areas will be required. + +### Load from the Common Assembly Location + +With the introduction of a formalized common assembly location, SDK components must be able to load assemblies from it. +The approach varies based on the type of component: + +**Out-of-Process Components:** Components that run in their own process, such as global tools, will use **AssemblyLoadContext** to load shared assemblies from the common location. +This approach is already used today by [dotnet-watch](https://github.com/dotnet/sdk/blob/26bbbd92e5a3cc58037e696147fa25e03e68e3a8/src/BuiltInTools/dotnet-watch/Program.cs#L288). + +**In-Process Framework Components:** These components are hosted within the SDK's framework context and define the assembly resolution paths. + +**In-Process Core Components:** These components are hosted within the SDK's core context and define the assembly resolution paths. + +**Performance:** Performance is a concern that must be validated. +We need to ensure that any changes to assembly loading do not regress performance in any way. + +### SDK Layout Adjustments + +The SDK layout needs to be updated to: + +1. Populate the common assembly location. +1. Exclude common assemblies from individual components. + +The strategy for achieving this varies based on the type of component: + +**Global Tools:** Global tools have special layout targets today. +A good strategy would be for these targets to list out the assemblies to keep in the tool's directory, and the rest would be added to a shared assemblies list that gets copied to the shared assembly location. + +**In-Process Components:** Other components that load in the same process would just need to ensure that the shared assemblies are defined and copied to the shared assembly location. +This can be done via various copying logic or using the `ExcludeAssets` mechanism. +The approach will vary based on the component type, but whenever possible we want to use a systematic approach. + +**Version Control:** It's important to define which component controls what version is placed in the shared assembly cache. +This is critical for resolving version conflicts. +Higher version references may cause load failures, while lower version references could trigger security alerts. +Compile-time checks can be added to enforce the desired version rules. The VMR has helped reduce version conflict occurrences. + +The upcoming [NuGet Vision 2027 work](https://microsoft-my.sharepoint.com/:w:/r/personal/aortiz_microsoft_com/Documents/NuGet%20Vision%202027.docx?d=w9f413c3dc36a4e7d887fe007071e10c0&csf=1&web=1&e=MD76Wr&nav=eyJoIjoiMjE2NDM5NDIwIn0) for "better supporting apps running in hosted environments or with specific composition/deployment patterns" will help facilitate version conflict resolution in this area. This work will introduce the capability to declare dependencies as "provided by the hosting environment," allowing components to use the host environment's version rather than carrying their own copy. This approach will make it much easier to eliminate version differences and consolidate on shared dependency versions within the SDK. + +Similar patterns will be used for architecture differences as discussed in the [Architecture Differences](#architecture-differences) section. + +### Testing Strategy + +Once duplicated assemblies are removed, a regression test should: + +- Detect duplicate files in the SDK layout. +- Fail the build if duplicates are found. +- Prevent regressions. + +## Proof of Concept Results + +A proof of concept was implemented for `dotnet-watch` and `dotnet-format`, two of the largest sources of duplication. +The POC used the **AssemblyLoadContext** approach with a shared assembly location and yielded the following results: + +| Metric | Baseline 11.0 (SDK only) | Size Reduction | % Reduction | +|--------------|--------------------------|----------------|-------------| +| Archive Size | 100 MB | 21 MB | 21.0% | +| Disk Size | 296 MB | 62 MB | 20.8% | +| File Count | 3,957 | 384 files | 9.7% | + +**Note:** These numbers are from a Linux development build, which differs significantly from official signed/optimized builds included in the [By the Numbers](#by-the-numbers) section. + +## Other Concerns + +### Architecture Differences + +As noted in the [duplicate categorization](#linux-x64), a portion of duplicates with the same version but different hash are due to architecture differences (AnyCPU vs x64). +The plan is to eliminate these differences by standardizing on CPU-specific versions. + +Initial analysis indicates these differences stem from AnyCPU builds coexisting with CPU-specific builds of the same assembly. +The general approach will be to prefer the CPU-specific version over the AnyCPU version when eliminating these duplicates, as CPU-specific builds can offer better performance characteristics for the target platform. + +### Non-Assembly Duplicates + +The data above only covers assemblies. +There are also duplicated non-assembly files such as `msdia140.dll.manifest`, `Microsoft.TestPlatform.targets`, `Microsoft.TemplateEngine.Cli.xml`, and `dotnet.runtimeconfig.json`. + +- Duplicate non-assembly files (same hash): 65 +- Total size of duplicates: 0.45 MB +- Largest duplicated file: `Microsoft.TemplateEngine.Cli.xml` — 0.18 MB + +Non-assembly files are not directly targeted in this work. +They will be removed when the cost is low; otherwise, they remain out of scope due to limited ROI. + +### Different TFM Duplicates + +Beyond files duplicated with the same TFM (the primary focus of this work), some files with the same name and culture target different TFMs. +Eliminating these duplicates is outside the scope of the planned work. +Once same-TFM duplicates are addressed, additional analysis can be performed to evaluate the ROI of consolidating files across different TFMs. + +The analysis categorizes these groups as follows: + +**Linux x64 (10.0.100):** + +Group Categorization: +- Groups differing by Core vs FX: 228 +- Groups with different FX versions: 3 +- Groups with different Core versions: 1 +- Groups with multiple NetStandard versions: 0 +- Groups with NetStandard + NetFx: 5 +- Groups with NetStandard + Core: 30 + +Potential Savings (if duplicates were eliminated): +- Different FX versions (keep lowest): 1.9 MB +- Different Core versions (keep lowest): 0.1 MB +- NetStandard + NetFx (keep NetStandard): 0.9 MB +- NetStandard + Core (keep NetStandard): 5.9 MB + +Total potential savings: 8.7 MB + +**Key Observations:** + +- **Core vs FX differences** are generally expected and necessary to support both frameworks. +- **Different Core versions** (e.g., net8.0 and net10.0) represent potential consolidation opportunities where the SDK could standardize on newer TFMs. +- **Different FX versions** (e.g., net472 and net48) may be necessary for backward compatibility but should be reviewed. +- **NetStandard combinations** indicate multi-targeting strategies that may be optimizable in some cases. + +These groups are not mutually exclusive—a single file group can appear in multiple categories. +For example, a file with `netstandard2.0`, `net472`, and `net8.0` would be counted in both "NetStandard + NetFx" and "NetStandard + Core" categories. + +Addressing different TFM duplicates requires different strategies than same-TFM duplicates and may involve API surface area analysis and compatibility considerations. + +### Mixed RID Content + +The SDK currently ships with some content placement issues related to Runtime Identifier (RID) specificity that contribute to unnecessary bloat: + +#### RID-Specific Content in Inappropriate RIDs +In some cases, we ship RID-specific content to runtime identifiers where it's not applicable. This represents content that should be trimmed out entirely as it serves no purpose on the target platform. + +**Example:** [dotnet/sdk#51743](https://github.com/dotnet/sdk/issues/51743) - Windows-specific assemblies shipped in Linux distributions. + +#### Cross-Platform Support Content +In other cases, we ship content to support cross-platform development scenarios—for example, Windows-specific assemblies included in Linux SDKs to enable cross-compilation or multi-targeting scenarios. While this content does serve a purpose, it should be analyzed case-by-case to determine whether it should: +- Ship in-box as part of the core SDK experience +- Be available as optional packages that can be dynamically acquired when needed + +**Example:** [dotnet/sdk#51835](https://github.com/dotnet/sdk/issues/51835) - Cross-platform development tooling dependencies. + +#### Scope and Next Steps +Both of these content placement issues are outside the scope of the duplicate elimination work. However, they will likely be surfaced and made more visible as part of this effort. When identified, independent issues will be logged to address these concerns separately. + +## Related + +- [Visual Studio de-duplication effort](https://microsoft.sharepoint.com/:w:/s/b3f10b15-fb59-4650-957a-2c632aa943ba/IQBR2aXv7jC8RatIyxJJNPCeAQxjDmLii-R65o0yUvOJatk?e=IcnsAG) +- [NuGet Vision 2027 - Better supporting apps running in hosted environments or with specific composition/deployment patterns](https://microsoft-my.sharepoint.com/:w:/r/personal/aortiz_microsoft_com/Documents/NuGet%20Vision%202027.docx?d=w9f413c3dc36a4e7d887fe007071e10c0&csf=1&web=1&e=MD76Wr&nav=eyJoIjoiMjE2NDM5NDIwIn0) From ded43e2b686d9097fc016985c78b6f6cff4a79c5 Mon Sep 17 00:00:00 2001 From: Michael Simons Date: Mon, 24 Nov 2025 15:09:24 -0600 Subject: [PATCH 2/5] Update documentation/general/eliminate-duplicate-file.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- documentation/general/eliminate-duplicate-file.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/general/eliminate-duplicate-file.md b/documentation/general/eliminate-duplicate-file.md index 9db7cc028137..3c22f39aa31f 100644 --- a/documentation/general/eliminate-duplicate-file.md +++ b/documentation/general/eliminate-duplicate-file.md @@ -26,7 +26,7 @@ This deduplication work gives us the headroom needed to absorb the Native AOT ex A duplicate file analysis was performed on the SDK layout of .NET 10.0.100, where file uniqueness is defined by *name*, *TFM* (Target Framework Moniker), and *culture*. The following data is the result of this analysis using the [SdkLayoutAnalyzer](https://github.com/MichaelSimons/SdkLayoutAnalyzer) tool. -The goal is for the vast mojority if not all of the components within the SDK to depend on and use the same version of their dependencies. +The goal is for the vast majority if not all of the components within the SDK to depend on and use the same version of their dependencies. **Note:** The baseline measurements in the following tables compare the complete .NET installation (including runtimes, packs, shared frameworks, host, etc.) versus just the SDK directory contents. From d167fef729cf4370fe0639f2657d202aaada8446 Mon Sep 17 00:00:00 2001 From: Michael Simons Date: Mon, 24 Nov 2025 15:09:34 -0600 Subject: [PATCH 3/5] Update documentation/general/eliminate-duplicate-file.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- documentation/general/eliminate-duplicate-file.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/general/eliminate-duplicate-file.md b/documentation/general/eliminate-duplicate-file.md index 3c22f39aa31f..43627c12ebfe 100644 --- a/documentation/general/eliminate-duplicate-file.md +++ b/documentation/general/eliminate-duplicate-file.md @@ -45,7 +45,7 @@ The goal is for the vast majority if not all of the components within the SDK to | Disk Size | 789 MB / 411 MB | 148 MB | 18.8% / 36.0% | | File Count | 5,610 / 3,791 | 908 files | 16.2% / 24.0% | -**Note:** The WindowsDestop runtime and target pack, along with .NET Framework support are the major reasons for the differences between Windows and Linux. +**Note:** The WindowsDesktop runtime and target pack, along with .NET Framework support are the major reasons for the differences between Windows and Linux. ### Download impact of removing duplicates From 92078ebd2e09dd920df0d0b0fa3e5ab5b0d79cf3 Mon Sep 17 00:00:00 2001 From: Michael Simons Date: Mon, 24 Nov 2025 15:09:41 -0600 Subject: [PATCH 4/5] Update documentation/general/eliminate-duplicate-file.md Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- documentation/general/eliminate-duplicate-file.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/documentation/general/eliminate-duplicate-file.md b/documentation/general/eliminate-duplicate-file.md index 43627c12ebfe..512825cdf3dd 100644 --- a/documentation/general/eliminate-duplicate-file.md +++ b/documentation/general/eliminate-duplicate-file.md @@ -154,7 +154,7 @@ The presence of assemblies in a shared location can affect behavior. Beyond the compiler, other SDK components may have similar sensitivities to assembly placement and availability. These areas require careful analysis to identify potential behavioral changes when moving to a shared assembly model. -Further ivestigation in these areas will be required. +Further investigation in these areas will be required. ### Load from the Common Assembly Location From 2597f2aaedc9ab0e30e6efc57b2731e4be8180be Mon Sep 17 00:00:00 2001 From: Michael Simons Date: Tue, 9 Dec 2025 20:16:12 +0000 Subject: [PATCH 5/5] Pivot approach to utilize hard links --- .../general/eliminate-duplicate-file.md | 277 +++++++++++++----- 1 file changed, 204 insertions(+), 73 deletions(-) diff --git a/documentation/general/eliminate-duplicate-file.md b/documentation/general/eliminate-duplicate-file.md index 512825cdf3dd..a4d236e2c293 100644 --- a/documentation/general/eliminate-duplicate-file.md +++ b/documentation/general/eliminate-duplicate-file.md @@ -3,7 +3,7 @@ ## Overview This proposal focuses on eliminating duplicated files within the .NET SDK to reduce installation size and improve disk footprint. -By removing duplicate assemblies, we can reduce the SDK size by **35% (53 MB compressed, 140 MB on disk)** on Linux x64. +By removing duplicate assemblies, we can reduce the SDK size by 35% (53 MB compressed, 140 MB on disk) on Linux x64. **Tracking issue:** [dotnet/sdk#41128](https://github.com/dotnet/sdk/issues/41128) @@ -12,13 +12,14 @@ By removing duplicate assemblies, we can reduce the SDK size by **35% (53 MB com While we often envision the .NET SDK as something installed once on a developer's machine, the reality is that most SDK installations occur in ephemeral, high-volume scenarios where the SDK is repeatedly downloaded and extracted. SDK size directly affects both network costs (download/wire transfer) and time costs (extraction overhead). -Containers represent one of the largest and most measurable areas of impact. Official .NET SDK container images are pulled approximately [750,000 times per week](https://msit.powerbi.com/groups/6b5ffb99-5fd3-492b-bd02-724f09fe9eff/reports/7e5d7fef-a86c-4f94-8aa3-d356c3125ee0?ctid=72f988bf-86f1-41af-91ab-2d7cd011db47&pbi_source=linkShare&bookmarkGuid=f44da1fd-c619-4158-aa51-f050b379a2b3). -When developers build within containers—whether for local development or CI/CD pipelines—they're pulling that full SDK image. -A 50 MB reduction in compressed size translates to 37.5 TB per week in bandwidth saved from container pulls alone. +Containers represent one of the largest and most measurable areas of impact. +Official .NET SDK container images are pulled approximately [750,000 times per week](https://msit.powerbi.com/groups/6b5ffb99-5fd3-492b-bd02-724f09fe9eff/reports/7e5d7fef-a86c-4f94-8aa3-d356c3125ee0?ctid=72f988bf-86f1-41af-91ab-2d7cd011db47&pbi_source=linkShare&bookmarkGuid=f44da1fd-c619-4158-aa51-f050b379a2b3). +When developers build within containers, whether for local development or CI/CD pipelines, they're pulling that full SDK image. +A 50 MB reduction in compressed size translates to 37.5 TB per week in bandwidth saved from official .NET container pulls alone. Beyond containers, SDK installations also happen at high volume in CI/CD pipelines, cloud development environments, and through various tooling extensions. The [Native AOT SDK epic](https://github.com/dotnet/sdk/issues/40931) represents a significant investment in improving performance, but it comes at the cost of increased SDK size. -Without addressing duplication first, Native AOT will exacerbate the size problem and likely make customers in CI and container scenarios unhappy. +Without addressing duplication, Native AOT will exacerbate the size problem and likely make customers in CI and container scenarios unhappy. This deduplication work gives us the headroom needed to absorb the Native AOT expansion while still delivering a better overall experience. ## By the Numbers @@ -28,7 +29,10 @@ The following data is the result of this analysis using the [SdkLayoutAnalyzer]( The goal is for the vast majority if not all of the components within the SDK to depend on and use the same version of their dependencies. -**Note:** The baseline measurements in the following tables compare the complete .NET installation (including runtimes, packs, shared frameworks, host, etc.) versus just the SDK directory contents. +**Notes:** +1. The baseline measurements in the following tables compare the complete .NET installation (including runtimes, packs, shared frameworks, host, etc.) versus just the SDK directory contents. +1. Standard archive formats (zip and tar) do not automatically deduplicate identical files, each file is compressed independently, even when multiple files have identical content. +This means that duplicate files consume space both on disk after extraction and within the compressed archive itself. ### Linux x64 @@ -59,7 +63,7 @@ These times represent pure wire transfer costs and do not include connection ove | 1 Gbps (1000 Mbps) | 1.83s | 1.41s | 0.42s | | 10 Gbps (10,000 Mbps)| 0.18s | 0.14s | 0.04s | -The deduplicated archive downloads **23% faster** on average across all network speeds. +The deduplicated archive downloads 23% faster on average across all network speeds. ### Extraction impact of removing duplicates @@ -72,7 +76,7 @@ The following table shows .NET archive extraction times measured on a Linux x64 | Min | 3.77s | 2.88s | 0.89s | | Max | 3.92s | 2.96s | 0.96s | -The deduplicated archive extracts **23.5% faster** on average, saving approximately 0.90 seconds per extraction. +The deduplicated archive extracts 23.5% faster on average, saving approximately 0.90 seconds per extraction. ### Duplicate categorization (relative to lowest version file to keep) @@ -120,102 +124,120 @@ Notice the 33% increase in duplicate file size in 10.0 compared to 9.0. ## Proposed Approach -The overall direction of this effort is to eliminate the vast majority of duplicate assemblies within the .NET SDK so that each shared dependency is carried only once. -There may be a few special cases where different versions, etc. need to be retained. -Achieving this requires solving two distinct but related problems. -First, from a runtime and execution perspective, SDK components must be able to reliably load a single shared copy of each assembly from a common location. -Second, from a build and production perspective, SDK components would ideally be compiled against the same shared set of assemblies, and the SDK layout and packaging process must ensure that only one copy of each shared assembly is included in the final SDK distribution. -The proposed approach addresses both sides of this problem by defining a shared assembly location, enabling components to load from it, and restructuring the SDK build to populate and enforce this unified dependency model. +The proposed approach leverages hard links to eliminate duplicate files within the SDK layout, reducing both disk footprint and archive size without requiring changes to component assembly loading logic or directory structures. -### Define a Common Assembly Location +### Hard Link Deduplication -A well-defined common assembly location is essential for shared assemblies to be loaded by SDK components, particularly out-of-process components like global tools. All SDK components should depend on a single version of shared dependencies as much as possible. +Hard links provide a filesystem-level solution to file deduplication by allowing multiple directory entries to reference the same physical file on disk. +When files with identical content exist in multiple locations within the SDK layout, we can replace duplicates with hard links to a single physical copy. -The root SDK directory is a natural location for this common assembly cache. However, the root SDK folder is already quite large, and consolidating shared assemblies there would add over 100 files to the root directory. This level of clutter doesn't seem desirable and would negatively impact readability. +**Key Benefits:** +- **No runtime changes required** — Components continue to reference files using their existing paths. Hard links are transparent to applications. +- **Immediate disk savings** — Multiple directory entries pointing to the same inode reduce disk usage by the size of duplicate files. +- **Tarball compatibility** — Tar format natively supports hard links, preserving space savings in compressed archives. +- **Simple implementation** — Identification and replacement can be implemented entirely during the layout generation phase. -A dedicated subdirectory for shared assemblies is preferred. This approach keeps the root directory clean and makes it clear which assemblies are part of the shared cache. Since we're centralizing on a single version of each shared dependency, there's no need for version-specific subdirectories within the common location. +### Implementation Strategy -**Directory Structure Considerations:** +Files will be identified as duplicates by comparing SHA256 hashes of their content. +When multiple files within the SDK layout share identical hashes, all copies except one will be replaced with hard links pointing to a single retained instance. -While version folders aren't needed, we must account for framework versus core components. Some framework-specific assemblies are shared and would need to be placed in a subdirectory to distinguish them from core assemblies (e.g., `shared/net472/` for framework components, with core assemblies directly in `shared/`). +Creating hard links is trivial now that the .NET runtime has added native support for creating hard links across all supported platforms in .NET 11. +The SDK build infrastructure can leverage `File.CreateHardLink` APIs during layout generation to replace duplicate files with hard links. -**Possible names for the common assembly location:** +Archive formats like tar.gz natively support hard links, preserving the space savings when the SDK is distributed in compressed form. -- `shared` — simple, clear, and consistent with .NET conventions (like the existing `shared` directory for runtimes). -- `common` — widely recognized across ecosystems for shared dependencies and utilities. -- `dependencies` / `deps` — descriptive and clear. -- `libs` — short and familiar in many build systems. +### Windows Archive Format Considerations -#### Side Effects of a Common Assembly Location +Zip archives do not support hard links. +To solve this problem, we will add tarball support for Windows distributions. -Consolidating assemblies into a common location introduces potential side effects that must be carefully considered. -Discussions with compiler experts have noted that the compiler toolchain is particularly sensitive to assembly availability and resolution paths. -The presence of assemblies in a shared location can affect behavior. +Windows 11 and recent versions of Windows 10 include native support for extracting tar.gz archives through Windows Explorer and the `tar` command-line utility. +This native support makes tarballs a viable distribution format for Windows users without requiring third-party tools. -Beyond the compiler, other SDK components may have similar sensitivities to assembly placement and availability. -These areas require careful analysis to identify potential behavioral changes when moving to a shared assembly model. -Further investigation in these areas will be required. +### Windows Tarball Adoption Plan -### Load from the Common Assembly Location +The transition to tarball-based distributions for Windows will be implemented in phases: -With the introduction of a formalized common assembly location, SDK components must be able to load assemblies from it. -The approach varies based on the type of component: +**Phase 1: Enable Hard Links in SDK Layout** -**Out-of-Process Components:** Components that run in their own process, such as global tools, will use **AssemblyLoadContext** to load shared assemblies from the common location. -This approach is already used today by [dotnet-watch](https://github.com/dotnet/sdk/blob/26bbbd92e5a3cc58037e696147fa25e03e68e3a8/src/BuiltInTools/dotnet-watch/Program.cs#L288). +In the initial preview release, hard links will be added to the SDK layout during the build process. +This change enables space savings in tarball distributions immediately. +Tarballs created from this layout will preserve hard links and deliver the full size reduction benefits. +The existing zip archives will continue to be published with duplicates during this phase. -**In-Process Framework Components:** These components are hosted within the SDK's framework context and define the assembly resolution paths. +**Phase 2: Migrate Acquisition Channels** -**In-Process Core Components:** These components are hosted within the SDK's core context and define the assembly resolution paths. +In subsequent preview releases, we will migrate the acquisition channels under our control to use Windows tarballs: +- **dotnet-install scripts** ([dotnet/install-scripts](https://github.com/dotnet/install-scripts)) — Update to prefer tarballs on Windows +- **Azure DevOps .NET install task** ([microsoft/azure-pipelines-tasks](https://github.com/microsoft/azure-pipelines-tasks)) — Add tarball support for Windows agents +- **GitHub Actions setup-dotnet** ([actions/setup-dotnet](https://github.com/actions/setup-dotnet)) — Add tarball extraction support for Windows +- **Official .NET Docker images** ([dotnet/dotnet-docker](https://github.com/dotnet/dotnet-docker)) — Migrate Windows containers to use tarballs -**Performance:** Performance is a concern that must be validated. -We need to ensure that any changes to assembly loading do not regress performance in any way. +This phased migration allows us to validate the tarball experience with our own tooling before encouraging broader adoption, and enables us to gather telemetry data on zip usage to inform the decisions about retaining zip archive support. -### SDK Layout Adjustments +### Retaining Zip Archive Support -The SDK layout needs to be updated to: +Two key factors impact the decision of whether to retain zip archive support: -1. Populate the common assembly location. -1. Exclude common assemblies from individual components. +**1. Support Matrix Considerations:** -The strategy for achieving this varies based on the type of component: +The .NET 11 support matrix has not yet been finalized. +The [.NET 10 support matrix](https://github.com/dotnet/core/blob/main/release-notes/10.0/supported-os.md) includes three Windows versions that lack native tarball support: Windows Server 2016, Windows Server 2012 R2, and Windows Server 2012. +These versions appear to reach EOL before .NET 11, meaning they are candidates to remove from the support matrix pending what their Windows extended support policies. +If these versions remain in the .NET 11 support matrix, we may need to retain zip support to provide a first-class installation experience for those platforms. -**Global Tools:** Global tools have special layout targets today. -A good strategy would be for these targets to list out the assemblies to keep in the tool's directory, and the rest would be added to a shared assemblies list that gets copied to the shared assembly location. +**2. Existing Zip Usage:** -**In-Process Components:** Other components that load in the same process would just need to ensure that the shared assemblies are defined and copied to the shared assembly location. -This can be done via various copying logic or using the `ExcludeAssets` mechanism. -The approach will vary based on the component type, but whenever possible we want to use a systematic approach. +After migrating various acquisition changnels to use the Windows tarball like, dotnet-install scripts, Azure DevOps install tasks, GitHub Actions setup-dotnet, and official .NET Docker images, we can measure the remaining zip usage through .NET telemetry. +This data will inform whether there is a significant long tail of users who depend on zip archives. +This data will allow us to make an informed decision if the zip archives should be retained for .NET 11. -**Version Control:** It's important to define which component controls what version is placed in the shared assembly cache. -This is critical for resolving version conflicts. -Higher version references may cause load failures, while lower version references could trigger security alerts. -Compile-time checks can be added to enforce the desired version rules. The VMR has helped reduce version conflict occurrences. +### Installer Considerations -The upcoming [NuGet Vision 2027 work](https://microsoft-my.sharepoint.com/:w:/r/personal/aortiz_microsoft_com/Documents/NuGet%20Vision%202027.docx?d=w9f413c3dc36a4e7d887fe007071e10c0&csf=1&web=1&e=MD76Wr&nav=eyJoIjoiMjE2NDM5NDIwIn0) for "better supporting apps running in hosted environments or with specific composition/deployment patterns" will help facilitate version conflict resolution in this area. This work will introduce the capability to declare dependencies as "provided by the hosting environment," allowing components to use the host environment's version rather than carrying their own copy. This approach will make it much easier to eliminate version differences and consolidate on shared dependency versions within the SDK. +Although installers are not the primary audience for this deduplication work, they should ideally benefit from hard link support to preserve disk space savings. -Similar patterns will be used for architecture differences as discussed in the [Architecture Differences](#architecture-differences) section. +**Linux Packages (deb, rpm):** + +Linux package formats support hard links natively. +Package managers like `dpkg` and `rpm` preserve hard links during installation, maintaining the disk space benefits when installing from packages. + +**macOS Installer (pkg):** + +The macOS pkg installer format supports hard links and will preserve them during installation, maintaining disk space savings on macOS systems. + +**Windows Installer (MSI):** + +MSI installers do not natively support hard links. +Supporting hard links in MSI would require custom logic that Windows installer subject matter experts have identified as problematic and do not recommend pursuing. +Further investigation is need to confirm this. +As a result, Windows MSI installations will likely not benefit from hard link deduplication and will consume additional disk space compared to tarball installations on Windows. +The Windows installers at least have historically had a more 'maximal' approach to what is included in them, so there's precedence for the methods to have deltas. ### Testing Strategy -Once duplicated assemblies are removed, a regression test should: +Tests will be created to ensure: -- Detect duplicate files in the SDK layout. -- Fail the build if duplicates are found. -- Prevent regressions. +- **Layout validation:** Verify that duplicate files outside of hardlinks do not exist +- **Archive Extraction testing:** Verify that tarball extraction preserves hardlinks on all supported platforms +- **Installer testing:** Verify the non-windows installers preserver hardlinks ## Proof of Concept Results -A proof of concept was implemented for `dotnet-watch` and `dotnet-format`, two of the largest sources of duplication. -The POC used the **AssemblyLoadContext** approach with a shared assembly location and yielded the following results: +A proof of concept for hard link deduplication was implemented, demonstrating the viability and effectiveness of the approach: + +**Implementation Details:** +- A new MSBuild task `DeduplicateFilesWithHardLinks` was created to identify and replace duplicate files with hard links during SDK layout generation +- Files are hashed using SHA256 to identify duplicates + +**Space Savings:** -| Metric | Baseline 11.0 (SDK only) | Size Reduction | % Reduction | -|--------------|--------------------------|----------------|-------------| -| Archive Size | 100 MB | 21 MB | 21.0% | -| Disk Size | 296 MB | 62 MB | 20.8% | -| File Count | 3,957 | 384 files | 9.7% | +In a Linux x64 development build of .NET 11.0 SDK, the hard link approach achieved: +- 131 MB reduction in disk size +- 61 MB reduction in archive size -**Note:** These numbers are from a Linux development build, which differs significantly from official signed/optimized builds included in the [By the Numbers](#by-the-numbers) section. +**Note:** Development build measurements differ from production builds due to signing, optimization, and release packaging differences. +Production measurements will be captured during full implementation. ## Other Concerns @@ -225,7 +247,7 @@ As noted in the [duplicate categorization](#linux-x64), a portion of duplicates The plan is to eliminate these differences by standardizing on CPU-specific versions. Initial analysis indicates these differences stem from AnyCPU builds coexisting with CPU-specific builds of the same assembly. -The general approach will be to prefer the CPU-specific version over the AnyCPU version when eliminating these duplicates, as CPU-specific builds can offer better performance characteristics for the target platform. +The general approach will be to prefer the CPU-specific version over the AnyCPU version when eliminating these duplicates. ### Non-Assembly Duplicates @@ -282,21 +304,130 @@ Addressing different TFM duplicates requires different strategies than same-TFM The SDK currently ships with some content placement issues related to Runtime Identifier (RID) specificity that contribute to unnecessary bloat: #### RID-Specific Content in Inappropriate RIDs -In some cases, we ship RID-specific content to runtime identifiers where it's not applicable. This represents content that should be trimmed out entirely as it serves no purpose on the target platform. + +In some cases, we ship RID-specific content to runtime identifiers where it's not applicable. +This represents content that should be trimmed out entirely as it serves no purpose on the target platform. **Example:** [dotnet/sdk#51743](https://github.com/dotnet/sdk/issues/51743) - Windows-specific assemblies shipped in Linux distributions. #### Cross-Platform Support Content -In other cases, we ship content to support cross-platform development scenarios—for example, Windows-specific assemblies included in Linux SDKs to enable cross-compilation or multi-targeting scenarios. While this content does serve a purpose, it should be analyzed case-by-case to determine whether it should: + +In other cases, we ship content to support cross-platform development scenarios—for example, Windows-specific assemblies included in Linux SDKs to enable cross-compilation or multi-targeting scenarios. +While this content does serve a purpose, it should be analyzed case-by-case to determine whether it should: - Ship in-box as part of the core SDK experience - Be available as optional packages that can be dynamically acquired when needed **Example:** [dotnet/sdk#51835](https://github.com/dotnet/sdk/issues/51835) - Cross-platform development tooling dependencies. #### Scope and Next Steps -Both of these content placement issues are outside the scope of the duplicate elimination work. However, they will likely be surfaced and made more visible as part of this effort. When identified, independent issues will be logged to address these concerns separately. + +Both of these content placement issues are outside the scope of the duplicate elimination work. +However, they will likely be surfaced and made more visible as part of this effort. +When identified, independent issues will be logged to address these concerns separately. ## Related - [Visual Studio de-duplication effort](https://microsoft.sharepoint.com/:w:/s/b3f10b15-fb59-4650-957a-2c632aa943ba/IQBR2aXv7jC8RatIyxJJNPCeAQxjDmLii-R65o0yUvOJatk?e=IcnsAG) - [NuGet Vision 2027 - Better supporting apps running in hosted environments or with specific composition/deployment patterns](https://microsoft-my.sharepoint.com/:w:/r/personal/aortiz_microsoft_com/Documents/NuGet%20Vision%202027.docx?d=w9f413c3dc36a4e7d887fe007071e10c0&csf=1&web=1&e=MD76Wr&nav=eyJoIjoiMjE2NDM5NDIwIn0) + +## Appendix: Alternative Approach - Shared Assembly Location + +This appendix documents an alternative approach that was initially considered for eliminating duplicate assemblies. +While this approach was ultimately not selected in favor of hard link deduplication, it remains valuable as a reference for understanding different strategies and their trade-offs. + +### Overview + +The alternative approach focused on eliminating the vast majority of duplicate assemblies within the .NET SDK by having each shared dependency carried only once in a common location. +This required solving two distinct but related problems: +First, from a runtime and execution perspective, SDK components must be able to reliably load a single shared copy of each assembly from a common location. +Second, from a build and production perspective, SDK components would ideally be compiled against the same shared set of assemblies, and the SDK layout and packaging process must ensure that only one copy of each shared assembly is included in the final SDK distribution. + +### Define a Common Assembly Location + +A well-defined common assembly location is essential for shared assemblies to be loaded by SDK components, particularly out-of-process components like global tools. +All SDK components should depend on a single version of shared dependencies as much as possible. + +The root SDK directory is a natural location for this common assembly cache. +However, the root SDK folder is already quite large, and consolidating shared assemblies there would add over 100 files to the root directory. +This level of clutter doesn't seem desirable and would negatively impact readability. + +A dedicated subdirectory for shared assemblies is preferred. +This approach keeps the root directory clean and makes it clear which assemblies are part of the shared cache. +Since we're centralizing on a single version of each shared dependency, there's no need for version-specific subdirectories within the common location. + +**Directory Structure Considerations:** + +While version folders aren't needed, we must account for framework versus core components. +Some framework-specific assemblies are shared and would need to be placed in a subdirectory to distinguish them from core assemblies (e.g., `shared/net472/` for framework components, with core assemblies directly in `shared/`). + +**Possible names for the common assembly location:** + +- `shared` — simple, clear, and consistent with .NET conventions (like the existing `shared` directory for runtimes). +- `common` — widely recognized across ecosystems for shared dependencies and utilities. +- `dependencies` / `deps` — descriptive and clear. +- `libs` — short and familiar in many build systems. + +#### Side Effects of a Common Assembly Location + +Consolidating assemblies into a common location introduces potential side effects that must be carefully considered. +Discussions with compiler experts have noted that the compiler toolchain is particularly sensitive to assembly availability and resolution paths. +The presence of assemblies in a shared location can affect behavior. + +Beyond the compiler, other SDK components may have similar sensitivities to assembly placement and availability. +These areas require careful analysis to identify potential behavioral changes when moving to a shared assembly model. +Further ivestigation in these areas will be required. + +### Load from the Common Assembly Location + +With the introduction of a formalized common assembly location, SDK components must be able to load assemblies from it. +The approach varies based on the type of component: + +**Out-of-Process Components:** Components that run in their own process, such as global tools, will use **AssemblyLoadContext** to load shared assemblies from the common location. +This approach is already used today by [dotnet-watch](https://github.com/dotnet/sdk/blob/26bbbd92e5a3cc58037e696147fa25e03e68e3a8/src/BuiltInTools/dotnet-watch/Program.cs#L288). + +**In-Process Framework Components:** These components are hosted within the SDK's framework context and define the assembly resolution paths. + +**In-Process Core Components:** These components are hosted within the SDK's core context and define the assembly resolution paths. + +**Performance:** Performance is a concern that must be validated. +We need to ensure that any changes to assembly loading do not regress performance in any way. + +### SDK Layout Adjustments + +The SDK layout needs to be updated to: + +1. Populate the common assembly location. +1. Exclude common assemblies from individual components. + +The strategy for achieving this varies based on the type of component: + +**Global Tools:** Global tools have special layout targets today. +A good strategy would be for these targets to list out the assemblies to keep in the tool's directory, and the rest would be added to a shared assemblies list that gets copied to the shared assembly location. + +**In-Process Components:** Other components that load in the same process would just need to ensure that the shared assemblies are defined and copied to the shared assembly location. +This can be done via various copying logic or using the `ExcludeAssets` mechanism. +The approach will vary based on the component type, but whenever possible we want to use a systematic approach. + +**Version Control:** It's important to define which component controls what version is placed in the shared assembly cache. +This is critical for resolving version conflicts. +Higher version references may cause load failures, while lower version references could trigger security alerts. +Compile-time checks can be added to enforce the desired version rules. +The VMR has helped reduce version conflict occurrences. + +The upcoming [NuGet Vision 2027 work](https://microsoft-my.sharepoint.com/:w:/r/personal/aortiz_microsoft_com/Documents/NuGet%20Vision%202027.docx?d=w9f413c3dc36a4e7d887fe007071e10c0&csf=1&web=1&e=MD76Wr&nav=eyJoIjoiMjE2NDM5NDIwIn0) for "better supporting apps running in hosted environments or with specific composition/deployment patterns" will help facilitate version conflict resolution in this area. +This work will introduce the capability to declare dependencies as "provided by the hosting environment," allowing components to use the host environment's version rather than carrying their own copy. +This approach will make it much easier to eliminate version differences and consolidate on shared dependency versions within the SDK. + +Similar patterns will be used for architecture differences as discussed in the [Architecture Differences](#architecture-differences) section. + +### Why This Approach Was Not Selected + +While the shared assembly location approach offers theoretical benefits in terms of explicit dependency management, it was ultimately not selected due to several significant challenges: + +1. **Complexity:** Requires extensive changes to component loading logic, assembly resolution paths, and SDK infrastructure across numerous components. +2. **Runtime Impact:** Potential for performance regressions and behavioral changes due to modified assembly loading patterns. +3. **Side Effects:** Compiler toolchain and other components are sensitive to assembly placement, requiring extensive testing and validation. +4. **Version Conflicts:** Managing version conflicts across components requires sophisticated coordination and tooling. +5. **Implementation Cost:** High development and maintenance burden compared to filesystem-level deduplication. + +The hard link approach was selected because it achieves the same disk and archive size benefits without requiring changes to component loading logic or risking behavioral changes in sensitive toolchain components.