forked from WireGuard/wireguard-go
-
Notifications
You must be signed in to change notification settings - Fork 25
change the name of default branch to "main" #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Need-an-AwP
wants to merge
39
commits into
tailscale:main
Choose a base branch
from
Need-an-AwP:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: James Tucker <james@tailscale.com>
StdNetBind probes for UDP GSO and GRO support at runtime. UDP GSO is dependent on checksum offload support on the egress netdev. UDP GSO will be disabled in the event sendmmsg() returns EIO, which is a strong signal that the egress netdev does not support checksum offload. The iperf3 results below demonstrate the effect of this commit between two Linux computers with i5-12400 CPUs. There is roughly ~13us of round trip latency between them. The first result is from commit 052af4a without UDP GSO or GRO. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 9.85 GBytes 8.46 Gbits/sec 1139 3.01 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 9.85 GBytes 8.46 Gbits/sec 1139 sender [ 5] 0.00-10.04 sec 9.85 GBytes 8.42 Gbits/sec receiver The second result is with UDP GSO and GRO. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 3.15 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 sender [ 5] 0.00-10.04 sec 12.3 GBytes 10.6 Gbits/sec receiver Reviewed-by: Adrian Dewhurst <adrian@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com>
After reducing UDP stack traversal overhead via GSO and GRO, runtime.chanrecv() began to account for a high percentage (20% in one environment) of perf samples during a throughput benchmark. The individual packet channel ops with the crypto goroutines was the primary contributor to this overhead. Updating these channels to pass vectors, which the device package already handles at its ends, reduced this overhead substantially, and improved throughput. The iperf3 results below demonstrate the effect of this commit between two Linux computers with i5-12400 CPUs. There is roughly ~13us of round trip latency between them. The first result is with UDP GSO and GRO, and with single element channels. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 3.15 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 12.3 GBytes 10.6 Gbits/sec 232 sender [ 5] 0.00-10.04 sec 12.3 GBytes 10.6 Gbits/sec receiver The second result is with channels updated to pass a slice of elements. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 13.2 GBytes 11.3 Gbits/sec 182 3.15 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 13.2 GBytes 11.3 Gbits/sec 182 sender [ 5] 0.00-10.04 sec 13.2 GBytes 11.3 Gbits/sec receiver Reviewed-by: Adrian Dewhurst <adrian@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com>
$ benchstat old.txt new.txt goos: linux goarch: amd64 pkg: golang.zx2c4.com/wireguard/tun cpu: 12th Gen Intel(R) Core(TM) i5-12400 │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Checksum/64-12 10.670n ± 2% 4.769n ± 0% -55.30% (p=0.000 n=10) Checksum/128-12 19.665n ± 2% 8.032n ± 0% -59.16% (p=0.000 n=10) Checksum/256-12 37.68n ± 1% 16.06n ± 0% -57.37% (p=0.000 n=10) Checksum/512-12 76.61n ± 3% 32.13n ± 0% -58.06% (p=0.000 n=10) Checksum/1024-12 160.55n ± 4% 64.25n ± 0% -59.98% (p=0.000 n=10) Checksum/1500-12 231.05n ± 7% 94.12n ± 0% -59.26% (p=0.000 n=10) Checksum/2048-12 309.5n ± 3% 128.5n ± 0% -58.48% (p=0.000 n=10) Checksum/4096-12 603.8n ± 4% 257.2n ± 0% -57.41% (p=0.000 n=10) Checksum/8192-12 1185.0n ± 3% 515.5n ± 0% -56.50% (p=0.000 n=10) Checksum/9000-12 1328.5n ± 5% 564.8n ± 0% -57.49% (p=0.000 n=10) Checksum/9001-12 1340.5n ± 3% 564.8n ± 0% -57.87% (p=0.000 n=10) geomean 185.3n 77.99n -57.92% Reviewed-by: Adrian Dewhurst <adrian@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com>
IPv4 header and pseudo header checksums were being computed on every merge operation. Additionally, virtioNetHdr was being written at the same time. This delays those operations until after all coalescing has occurred. Reviewed-by: Adrian Dewhurst <adrian@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com>
Queue{In,Out}boundElement locking can contribute to significant overhead via sync.Mutex.lockSlow() in some environments. These types are passed throughout the device package as elements in a slice, so move the per-element Mutex to a container around the slice. Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>
This adds AMD64 assembly implementations of IP checksum computation, one for baseline AMD64 and the other for v3 AMD64 (AVX2 and BMI2). All performance numbers reported are from a Ryzen 7 4750U but similar improvements are expected for a wide range of processors. The generic IP checksum implementation has also been further improved to be significantly faster using bits.AddUint64 (for a 64KiB buffer the throughput improves from 15,000MiB/s to 27,600MiB/s; similar gains are also reported on ARM64 but I do not have specific numbers). The baseline AMD64 implementation for a 64KiB buffer reports 32,700MiB/s and the AVX2 implementation is slightly over 107,000MiB/s. Unfortunately, for very small sizes (e.g. the expected size for an IPv4 header) setting up SIMD computation involves some overhead that makes computing a checksum for small buffers slower than a non-SIMD implementation. Even more unfortunately, testing for this at runtimen in Go and calling a func optimized for small buffers mitigates most of the improvement due to call overhead. The break even point is around 256 byte buffers; IPv4 headers are no more than 60 bytes including extensions. IPv6 headers do not have a checksum but are a fixed size of 40 bytes. As a result, the generated assembly code uses an alternate approach for buffers of less than 256 bytes. Additionally, buffers of less than 32 bytes need to be handled specially because the strategy for reading buffers that are not a multiple of 8 bytes fails when the buffer is too small. As suggested by additional benchmarking, pseudo header computation has been rewritten to be faster (benchmark time reduced by 1/2 to 1/4). Updates tailscale/corp#9755 Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Close closes the events channel, resulting in a panic from send on closed channel. Reported-By: Brad Fitzpatrick <brad@tailscale.com> Link: tailscale/tailscale#9555 Signed-off-by: James Tucker <james@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Access to Peer.endpoint was previously synchronized by Peer.RWMutex. This has now moved to Peer.endpoint.Mutex. Peer.SendBuffers() is now the sole caller of Endpoint.ClearSrc(), which is signaled via a new bool, Peer.endpoint.clearSrcOnTx. Previous Callers of Endpoint.ClearSrc() now set this bool, primarily via peer.markEndpointSrcForClearing(). Peer.SetEndpointFromPacket() clears Peer.endpoint.clearSrcOnTx when an updated conn.Endpoint is stored. This maintains the same event order as before, i.e. a conn.Endpoint received after peer.endpoint.clearSrcOnTx is set, but before the next Peer.SendBuffers() call results in the latest conn.Endpoint source being used for the next packet transmission. These changes result in throughput improvements for single flow, parallel (-P n) flow, and bidirectional (--bidir) flow iperf3 TCP/UDP tests as measured on both Linux and Windows. Latency under load improves especially for high throughput Linux scenarios. These improvements are likely realized on all platforms to some degree, as the changes are not platform-specific. Co-authored-by: James Tucker <james@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com>
Peer.RoutineSequentialReceiver() deals with packet vectors and does not need to perform timer and endpoint operations for every packet in a given vector. Changing these per-packet operations to per-vector improves throughput by as much as 10% in some environments. Signed-off-by: Jordan Whited <jordan@tailscale.com>
Certain device drivers (e.g. vxlan, geneve) do not properly handle coalesced UDP packets later in the stack, resulting in packet loss. Signed-off-by: Jordan Whited <jordan@tailscale.com>
The sync.Locker used with a sync.Cond must be acquired when changing the associated condition, otherwise there is a window within sync.Cond.Wait() where a wake-up may be missed. Fixes: 4846070 ("device: use a waiting sync.Pool instead of a channel") Signed-off-by: Jordan Whited <jordan@tailscale.com>
…itPool Fixes: 3bb8fec ("conn, device, tun: implement vectorized I/O plumbing") Signed-off-by: Jordan Whited <jordan@tailscale.com>
Introduce an optional extension point for Endpoint that enables a path for WireGuard to inform an integration about the peer public key that is associated with an Endpoint. The API is expected to return either the same or a new Endpoint in response to this function. A future version of this patch could potentially remove the returned Endpoint, but would require larger integrator changes downstream. This adds a small per-packet cost that could later be removed with a larger refactor of the wireguard-go interface and Tailscale magicsock code, as well as introducing a generic bound for Endpoint in a device & bind instance. Updates tailscale/corp#20732
External implementers of tun.Device may support GSO, and may also be platform-agnostic, e.g. gVisor. Signed-off-by: Jordan Whited <jordan@tailscale.com>
External implementers of tun.Device may support GRO, requiring checksum offload. Signed-off-by: Jordan Whited <jordan@tailscale.com>
When generating page-aligned random bytes, random data started at the beginning of the buffer that will be chopped off. When the page size differs, the start of the returned slice is different than expected for the expected checksums, causing the tests to fail.
torvalds/linux@e269d79 broke virtio_net TCP & UDP GRO causing GRO writes to return EINVAL. The bug was then resolved later in torvalds/linux@89add40. The offending commit was pulled into various LTS releases. Updates tailscale/tailscale#13041 Signed-off-by: Jordan Whited <jordan@tailscale.com>
The manual struct packing was suspect: tailscale/tailscale#11899 And no need for doing it manually if there's API for it already. Updates tailscale/tailscale#11899 Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Reviewed-by: James Tucker <james@tailscale.com>
Upstream doesn't use GitHub actions for CI as GitHub is simply a mirror. Our workflows involve GitHub, so establish some basic CI jobs. Updates tailscale/corp#28877 Signed-off-by: Jordan Whited <jordan@tailscale.com>
Only bother updating the rxBytes counter once we've processed a whole vector, since additions are atomic. cherry picked from commit WireGuard/wireguard-go@542e565 Updates tailscale/corp#28879 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
There is a possible deadlock in `device.Close()` when you try to close the device very soon after its start. The problem is that two different methods acquire the same locks in different order: 1. device.Close() - device.ipcMutex.Lock() - device.state.Lock() 2. device.changeState(deviceState) - device.state.Lock() - device.ipcMutex.Lock() Reproducer: func TestDevice_deadlock(t *testing.T) { d := randDevice(t) d.Close() } Problem: $ go clean -testcache && go test -race -timeout 3s -run TestDevice_deadlock ./device | grep -A 10 sync.runtime_SemacquireMutex sync.runtime_SemacquireMutex(0xc000117d20?, 0x94?, 0x0?) /usr/local/opt/go/libexec/src/runtime/sema.go:77 +0x25 sync.(*Mutex).lockSlow(0xc000130518) /usr/local/opt/go/libexec/src/sync/mutex.go:171 +0x213 sync.(*Mutex).Lock(0xc000130518) /usr/local/opt/go/libexec/src/sync/mutex.go:90 +0x55 golang.zx2c4.com/wireguard/device.(*Device).Close(0xc000130500) /Users/martin.basovnik/git/basovnik/wireguard-go/device/device.go:373 +0xb6 golang.zx2c4.com/wireguard/device.TestDevice_deadlock(0x0?) /Users/martin.basovnik/git/basovnik/wireguard-go/device/device_test.go:480 +0x2c testing.tRunner(0xc00014c000, 0x131d7b0) -- sync.runtime_SemacquireMutex(0xc000130564?, 0x60?, 0xc000130548?) /usr/local/opt/go/libexec/src/runtime/sema.go:77 +0x25 sync.(*Mutex).lockSlow(0xc000130750) /usr/local/opt/go/libexec/src/sync/mutex.go:171 +0x213 sync.(*Mutex).Lock(0xc000130750) /usr/local/opt/go/libexec/src/sync/mutex.go:90 +0x55 sync.(*RWMutex).Lock(0xc000130750) /usr/local/opt/go/libexec/src/sync/rwmutex.go:147 +0x45 golang.zx2c4.com/wireguard/device.(*Device).upLocked(0xc000130500) /Users/martin.basovnik/git/basovnik/wireguard-go/device/device.go:179 +0x72 golang.zx2c4.com/wireguard/device.(*Device).changeState(0xc000130500, 0x1) cherry picked from commit WireGuard/wireguard-go@12269c2 Updates tailscale/corp#28879 Signed-off-by: Martin Basovnik <martin.basovnik@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reduce allocations by eliminating byte reader, hand-rolled decoding and reusing message structs. Synthetic benchmark: var msgSink MessageInitiation func BenchmarkMessageInitiationUnmarshal(b *testing.B) { packet := make([]byte, MessageInitiationSize) reader := bytes.NewReader(packet) err := binary.Read(reader, binary.LittleEndian, &msgSink) if err != nil { b.Fatal(err) } b.Run("binary.Read", func(b *testing.B) { b.ReportAllocs() for range b.N { reader := bytes.NewReader(packet) _ = binary.Read(reader, binary.LittleEndian, &msgSink) } }) b.Run("unmarshal", func(b *testing.B) { b.ReportAllocs() for range b.N { _ = msgSink.unmarshal(packet) } }) } Results: │ - │ │ sec/op │ MessageInitiationUnmarshal/binary.Read-8 1.508µ ± 2% MessageInitiationUnmarshal/unmarshal-8 12.66n ± 2% │ - │ │ B/op │ MessageInitiationUnmarshal/binary.Read-8 208.0 ± 0% MessageInitiationUnmarshal/unmarshal-8 0.000 ± 0% │ - │ │ allocs/op │ MessageInitiationUnmarshal/binary.Read-8 2.000 ± 0% MessageInitiationUnmarshal/unmarshal-8 0.000 ± 0% cherry picked from commit WireGuard/wireguard-go@9e7529c Updates tailscale/corp#28879 Signed-off-by: Alexander Yastrebov <yastrebov.alex@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This is already enforced in receive.go, but if these unmarshallers are to have error return values anyway, make them as explicit as possible. cherry picked from commit WireGuard/wireguard-go@842888a Updates tailscale/corp#28879 Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Optimize message encoding by eliminating binary.Write (which internally uses reflection) in favour of hand-rolled encoding. This is companion to 9e7529c. Synthetic benchmark: var packetSink []byte func BenchmarkMessageInitiationMarshal(b *testing.B) { var msg MessageInitiation b.Run("binary.Write", func(b *testing.B) { b.ReportAllocs() for range b.N { var buf [MessageInitiationSize]byte writer := bytes.NewBuffer(buf[:0]) _ = binary.Write(writer, binary.LittleEndian, msg) packetSink = writer.Bytes() } }) b.Run("binary.Encode", func(b *testing.B) { b.ReportAllocs() for range b.N { packet := make([]byte, MessageInitiationSize) _, _ = binary.Encode(packet, binary.LittleEndian, msg) packetSink = packet } }) b.Run("marshal", func(b *testing.B) { b.ReportAllocs() for range b.N { packet := make([]byte, MessageInitiationSize) _ = msg.marshal(packet) packetSink = packet } }) } Results: │ - │ │ sec/op │ MessageInitiationMarshal/binary.Write-8 1.337µ ± 0% MessageInitiationMarshal/binary.Encode-8 1.242µ ± 0% MessageInitiationMarshal/marshal-8 53.05n ± 1% │ - │ │ B/op │ MessageInitiationMarshal/binary.Write-8 368.0 ± 0% MessageInitiationMarshal/binary.Encode-8 160.0 ± 0% MessageInitiationMarshal/marshal-8 160.0 ± 0% │ - │ │ allocs/op │ MessageInitiationMarshal/binary.Write-8 3.000 ± 0% MessageInitiationMarshal/binary.Encode-8 1.000 ± 0% MessageInitiationMarshal/marshal-8 1.000 ± 0% cherry picked from commit WireGuard/wireguard-go@264889f Updates tailscale/corp#28879 Signed-off-by: Alexander Yastrebov <yastrebov.alex@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This enables a conn.Bind to bring its own encapsulating transport, e.g. VXLAN/Geneve. Updates tailscale/corp#27502 Signed-off-by: Jordan Whited <jordan@tailscale.com>
It was previously suppressed if roaming was disabled for the peer. Tailscale always disables roaming as we explicitly configure conn.Endpoint's for all peers. This commit also modifies PeerAwareEndpoint usage such that wireguard-go never uses/sets it as a Peer Endpoint value. In theory we (Tailscale) always disable roaming, so we should always return early from SetEndpointFromPacket(), but this acts as an extra footgun guard and improves clarity around intended usage. Updates tailscale/corp#27502 Updates tailscale/corp#29422 Updates tailscale/corp#30042 Signed-off-by: Jordan Whited <jordan@tailscale.com>
To be implemented by [magicsock.lazyEndpoint], which is responsible for triggering JIT peer configuration. Updates tailscale/corp#20732 Updates tailscale/corp#30042 Signed-off-by: Jordan Whited <jordan@tailscale.com>
Updates tailscale/corp#30364 Signed-off-by: Jordan Whited <jordan@tailscale.com>
Peer.SetEndpointFromPacket is not called per-packet. It is guaranteed to be called at least once per packet batch. Updates tailscale/corp#30042 Updates tailscale/corp#20732 Signed-off-by: Jordan Whited <jordan@tailscale.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
this should be an issue, but there's no issue option in this repo
i use "github.com/tailscale/wireguard-go/tun" to customize the properties of the tun device,
which is used to avoid conflicts with the official client
but go mod tidy can only find the main branch, and the main branch's go.mod declare itself "module golang.zx2c4.com/wireguard"
then go mod tidy just dont wanna work any more
i resolve this by installing tstun first, then import this package
i also noticed in tstun, this dependency is write as:
github.com/tailscale/wireguard-go v0.0.0-20250716170648-1d0488a3d7da
of course there are some ways to specify it, like
require
orreplace
but this really confused me for a while, cause at the beginning i used
golang.zx2c4.com/wireguard
to replace, but my tun setting never effect(i was setting the wrong package, but i have no idea about it 😅)i know rename the default branch will cause a lot problems, so if that's what you meant
just feel free to close this pr
😊