Skip to content

Conversation

Need-an-AwP
Copy link

this should be an issue, but there's no issue option in this repo

i use "github.com/tailscale/wireguard-go/tun" to customize the properties of the tun device,
which is used to avoid conflicts with the official client
but go mod tidy can only find the main branch, and the main branch's go.mod declare itself "module golang.zx2c4.com/wireguard"
then go mod tidy just dont wanna work any more

github.com/tailscale/wireguard-go/tun: github.com/tailscale/wireguard-go@v0.0.20201118: parsing go.mod:
        module declares its path as: golang.zx2c4.com/wireguard   
                but was required as: github.com/tailscale/wireguard-go

i resolve this by installing tstun first, then import this package
i also noticed in tstun, this dependency is write as:
github.com/tailscale/wireguard-go v0.0.0-20250716170648-1d0488a3d7da

of course there are some ways to specify it, likerequire or replace
but this really confused me for a while, cause at the beginning i used golang.zx2c4.com/wireguard to replace, but my tun setting never effect(i was setting the wrong package, but i have no idea about it 😅)
i know rename the default branch will cause a lot problems, so if that's what you meant
just feel free to close this pr
😊

jwhited and others added 30 commits September 27, 2023 15:03
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: James Tucker <james@tailscale.com>
StdNetBind probes for UDP GSO and GRO support at runtime. UDP GSO is
dependent on checksum offload support on the egress netdev. UDP GSO
will be disabled in the event sendmmsg() returns EIO, which is a strong
signal that the egress netdev does not support checksum offload.

The iperf3 results below demonstrate the effect of this commit between
two Linux computers with i5-12400 CPUs. There is roughly ~13us of round
trip latency between them.

The first result is from commit 052af4a without UDP GSO or GRO.

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  9.85 GBytes  8.46 Gbits/sec  1139   3.01 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  9.85 GBytes  8.46 Gbits/sec  1139  sender
[  5]   0.00-10.04  sec  9.85 GBytes  8.42 Gbits/sec        receiver

The second result is with UDP GSO and GRO.

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  12.3 GBytes  10.6 Gbits/sec  232   3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.3 GBytes  10.6 Gbits/sec  232   sender
[  5]   0.00-10.04  sec  12.3 GBytes  10.6 Gbits/sec        receiver

Reviewed-by: Adrian Dewhurst <adrian@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
After reducing UDP stack traversal overhead via GSO and GRO,
runtime.chanrecv() began to account for a high percentage (20% in one
environment) of perf samples during a throughput benchmark. The
individual packet channel ops with the crypto goroutines was the primary
contributor to this overhead.

Updating these channels to pass vectors, which the device package
already handles at its ends, reduced this overhead substantially, and
improved throughput.

The iperf3 results below demonstrate the effect of this commit between
two Linux computers with i5-12400 CPUs. There is roughly ~13us of round
trip latency between them.

The first result is with UDP GSO and GRO, and with single element
channels.

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  12.3 GBytes  10.6 Gbits/sec  232   3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.3 GBytes  10.6 Gbits/sec  232   sender
[  5]   0.00-10.04  sec  12.3 GBytes  10.6 Gbits/sec        receiver

The second result is with channels updated to pass a slice of
elements.

Starting Test: protocol: TCP, 1 streams, 131072 byte blocks
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-10.00  sec  13.2 GBytes  11.3 Gbits/sec  182   3.15 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  13.2 GBytes  11.3 Gbits/sec  182   sender
[  5]   0.00-10.04  sec  13.2 GBytes  11.3 Gbits/sec        receiver

Reviewed-by: Adrian Dewhurst <adrian@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
$ benchstat old.txt new.txt
goos: linux
goarch: amd64
pkg: golang.zx2c4.com/wireguard/tun
cpu: 12th Gen Intel(R) Core(TM) i5-12400
                 │   old.txt    │               new.txt               │
                 │    sec/op    │   sec/op     vs base                │
Checksum/64-12     10.670n ± 2%   4.769n ± 0%  -55.30% (p=0.000 n=10)
Checksum/128-12    19.665n ± 2%   8.032n ± 0%  -59.16% (p=0.000 n=10)
Checksum/256-12     37.68n ± 1%   16.06n ± 0%  -57.37% (p=0.000 n=10)
Checksum/512-12     76.61n ± 3%   32.13n ± 0%  -58.06% (p=0.000 n=10)
Checksum/1024-12   160.55n ± 4%   64.25n ± 0%  -59.98% (p=0.000 n=10)
Checksum/1500-12   231.05n ± 7%   94.12n ± 0%  -59.26% (p=0.000 n=10)
Checksum/2048-12    309.5n ± 3%   128.5n ± 0%  -58.48% (p=0.000 n=10)
Checksum/4096-12    603.8n ± 4%   257.2n ± 0%  -57.41% (p=0.000 n=10)
Checksum/8192-12   1185.0n ± 3%   515.5n ± 0%  -56.50% (p=0.000 n=10)
Checksum/9000-12   1328.5n ± 5%   564.8n ± 0%  -57.49% (p=0.000 n=10)
Checksum/9001-12   1340.5n ± 3%   564.8n ± 0%  -57.87% (p=0.000 n=10)
geomean             185.3n        77.99n       -57.92%

Reviewed-by: Adrian Dewhurst <adrian@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
IPv4 header and pseudo header checksums were being computed on every
merge operation. Additionally, virtioNetHdr was being written at the
same time. This delays those operations until after all coalescing has
occurred.

Reviewed-by: Adrian Dewhurst <adrian@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Queue{In,Out}boundElement locking can contribute to significant
overhead via sync.Mutex.lockSlow() in some environments. These types
are passed throughout the device package as elements in a slice, so
move the per-element Mutex to a container around the slice.

Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>
This adds AMD64 assembly implementations of IP checksum computation, one
for baseline AMD64 and the other for v3 AMD64 (AVX2 and BMI2).

All performance numbers reported are from a Ryzen 7 4750U but similar
improvements are expected for a wide range of processors.

The generic IP checksum implementation has also been further improved to
be significantly faster using bits.AddUint64 (for a 64KiB buffer the
throughput improves from 15,000MiB/s to 27,600MiB/s; similar gains are
also reported on ARM64 but I do not have specific numbers).

The baseline AMD64 implementation for a 64KiB buffer reports 32,700MiB/s
and the AVX2 implementation is slightly over 107,000MiB/s.

Unfortunately, for very small sizes (e.g. the expected size for an IPv4
header) setting up SIMD computation involves some overhead that makes
computing a checksum for small buffers slower than a non-SIMD
implementation. Even more unfortunately, testing for this at runtimen in
Go and calling a func optimized for small buffers mitigates most of the
improvement due to call overhead. The break even point is around 256
byte buffers; IPv4 headers are no more than 60 bytes including
extensions. IPv6 headers do not have a checksum but are a fixed size of
40 bytes. As a result, the generated assembly code uses an alternate
approach for buffers of less than 256 bytes. Additionally, buffers of
less than 32 bytes need to be handled specially because the strategy for
reading buffers that are not a multiple of 8 bytes fails when the buffer
is too small.

As suggested by additional benchmarking, pseudo header computation has
been rewritten to be faster (benchmark time reduced by 1/2 to 1/4).

Updates tailscale/corp#9755

Signed-off-by: Adrian Dewhurst <adrian@tailscale.com>
Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Close closes the events channel, resulting in a panic from send on
closed channel.

Reported-By: Brad Fitzpatrick <brad@tailscale.com>
Link: tailscale/tailscale#9555
Signed-off-by: James Tucker <james@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Access to Peer.endpoint was previously synchronized by Peer.RWMutex.
This has now moved to Peer.endpoint.Mutex. Peer.SendBuffers() is now the
sole caller of Endpoint.ClearSrc(), which is signaled via a new bool,
Peer.endpoint.clearSrcOnTx. Previous Callers of Endpoint.ClearSrc() now
set this bool, primarily via peer.markEndpointSrcForClearing().
Peer.SetEndpointFromPacket() clears Peer.endpoint.clearSrcOnTx when an
updated conn.Endpoint is stored. This maintains the same event order as
before, i.e. a conn.Endpoint received after peer.endpoint.clearSrcOnTx
is set, but before the next Peer.SendBuffers() call results in the
latest conn.Endpoint source being used for the next packet transmission.

These changes result in throughput improvements for single flow,
parallel (-P n) flow, and bidirectional (--bidir) flow iperf3 TCP/UDP
tests as measured on both Linux and Windows. Latency under load improves
especially for high throughput Linux scenarios. These improvements are
likely realized on all platforms to some degree, as the changes are not
platform-specific.

Co-authored-by: James Tucker <james@tailscale.com>
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Peer.RoutineSequentialReceiver() deals with packet vectors and does not
need to perform timer and endpoint operations for every packet in a
given vector. Changing these per-packet operations to per-vector
improves throughput by as much as 10% in some environments.

Signed-off-by: Jordan Whited <jordan@tailscale.com>
Certain device drivers (e.g. vxlan, geneve) do not properly handle
coalesced UDP packets later in the stack, resulting in packet loss.

Signed-off-by: Jordan Whited <jordan@tailscale.com>
The sync.Locker used with a sync.Cond must be acquired when changing
the associated condition, otherwise there is a window within
sync.Cond.Wait() where a wake-up may be missed.

Fixes: 4846070 ("device: use a waiting sync.Pool instead of a channel")
Signed-off-by: Jordan Whited <jordan@tailscale.com>
…itPool

Fixes: 3bb8fec ("conn, device, tun: implement vectorized I/O plumbing")
Signed-off-by: Jordan Whited <jordan@tailscale.com>
Introduce an optional extension point for Endpoint that enables a path
for WireGuard to inform an integration about the peer public key that is
associated with an Endpoint.

The API is expected to return either the same or a new Endpoint in
response to this function. A future version of this patch could
potentially remove the returned Endpoint, but would require larger
integrator changes downstream.

This adds a small per-packet cost that could later be removed with a
larger refactor of the wireguard-go interface and Tailscale magicsock
code, as well as introducing a generic bound for Endpoint in a device &
bind instance.

Updates tailscale/corp#20732
External implementers of tun.Device may support GSO, and may also be
platform-agnostic, e.g. gVisor.

Signed-off-by: Jordan Whited <jordan@tailscale.com>
External implementers of tun.Device may support GRO, requiring checksum
offload.

Signed-off-by: Jordan Whited <jordan@tailscale.com>
When generating page-aligned random bytes, random data started at the
beginning of the buffer that will be chopped off. When the page size
differs, the start of the returned slice is different than expected for
the expected checksums, causing the tests to fail.
torvalds/linux@e269d79 broke virtio_net
TCP & UDP GRO causing GRO writes to return EINVAL. The bug was then
resolved later in
torvalds/linux@89add40. The offending
commit was pulled into various LTS releases.

Updates tailscale/tailscale#13041

Signed-off-by: Jordan Whited <jordan@tailscale.com>
The manual struct packing was suspect:
tailscale/tailscale#11899

And no need for doing it manually if there's API for it already.

Updates tailscale/tailscale#11899

Signed-off-by: Brad Fitzpatrick <bradfitz@tailscale.com>
Reviewed-by: James Tucker <james@tailscale.com>
Upstream doesn't use GitHub actions for CI as GitHub is simply a mirror.
Our workflows involve GitHub, so establish some basic CI jobs.

Updates tailscale/corp#28877

Signed-off-by: Jordan Whited <jordan@tailscale.com>
Only bother updating the rxBytes counter once we've processed a whole
vector, since additions are atomic.

cherry picked from commit WireGuard/wireguard-go@542e565

Updates tailscale/corp#28879

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
basovnik and others added 9 commits May 29, 2025 10:43
There is a possible deadlock in `device.Close()` when you try to close
the device very soon after its start. The problem is that two different
methods acquire the same locks in different order:

1. device.Close()
 - device.ipcMutex.Lock()
 - device.state.Lock()

2. device.changeState(deviceState)
 - device.state.Lock()
 - device.ipcMutex.Lock()

Reproducer:

    func TestDevice_deadlock(t *testing.T) {
    	d := randDevice(t)
    	d.Close()
    }

Problem:

    $ go clean -testcache && go test -race -timeout 3s -run TestDevice_deadlock ./device | grep -A 10 sync.runtime_SemacquireMutex
    sync.runtime_SemacquireMutex(0xc000117d20?, 0x94?, 0x0?)
            /usr/local/opt/go/libexec/src/runtime/sema.go:77 +0x25
    sync.(*Mutex).lockSlow(0xc000130518)
            /usr/local/opt/go/libexec/src/sync/mutex.go:171 +0x213
    sync.(*Mutex).Lock(0xc000130518)
            /usr/local/opt/go/libexec/src/sync/mutex.go:90 +0x55
    golang.zx2c4.com/wireguard/device.(*Device).Close(0xc000130500)
            /Users/martin.basovnik/git/basovnik/wireguard-go/device/device.go:373 +0xb6
    golang.zx2c4.com/wireguard/device.TestDevice_deadlock(0x0?)
            /Users/martin.basovnik/git/basovnik/wireguard-go/device/device_test.go:480 +0x2c
    testing.tRunner(0xc00014c000, 0x131d7b0)
    --
    sync.runtime_SemacquireMutex(0xc000130564?, 0x60?, 0xc000130548?)
            /usr/local/opt/go/libexec/src/runtime/sema.go:77 +0x25
    sync.(*Mutex).lockSlow(0xc000130750)
            /usr/local/opt/go/libexec/src/sync/mutex.go:171 +0x213
    sync.(*Mutex).Lock(0xc000130750)
            /usr/local/opt/go/libexec/src/sync/mutex.go:90 +0x55
    sync.(*RWMutex).Lock(0xc000130750)
            /usr/local/opt/go/libexec/src/sync/rwmutex.go:147 +0x45
    golang.zx2c4.com/wireguard/device.(*Device).upLocked(0xc000130500)
            /Users/martin.basovnik/git/basovnik/wireguard-go/device/device.go:179 +0x72
    golang.zx2c4.com/wireguard/device.(*Device).changeState(0xc000130500, 0x1)

cherry picked from commit WireGuard/wireguard-go@12269c2

Updates tailscale/corp#28879

Signed-off-by: Martin Basovnik <martin.basovnik@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Reduce allocations by eliminating byte reader, hand-rolled decoding and
reusing message structs.

Synthetic benchmark:

    var msgSink MessageInitiation
    func BenchmarkMessageInitiationUnmarshal(b *testing.B) {
        packet := make([]byte, MessageInitiationSize)
        reader := bytes.NewReader(packet)
        err := binary.Read(reader, binary.LittleEndian, &msgSink)
        if err != nil {
            b.Fatal(err)
        }
        b.Run("binary.Read", func(b *testing.B) {
            b.ReportAllocs()
            for range b.N {
                reader := bytes.NewReader(packet)
                _ = binary.Read(reader, binary.LittleEndian, &msgSink)
            }
        })
        b.Run("unmarshal", func(b *testing.B) {
            b.ReportAllocs()
            for range b.N {
                _ = msgSink.unmarshal(packet)
            }
        })
    }

Results:
                                         │      -      │
                                         │   sec/op    │
MessageInitiationUnmarshal/binary.Read-8   1.508µ ± 2%
MessageInitiationUnmarshal/unmarshal-8     12.66n ± 2%

                                         │      -       │
                                         │     B/op     │
MessageInitiationUnmarshal/binary.Read-8   208.0 ± 0%
MessageInitiationUnmarshal/unmarshal-8     0.000 ± 0%

                                         │      -       │
                                         │  allocs/op   │
MessageInitiationUnmarshal/binary.Read-8   2.000 ± 0%
MessageInitiationUnmarshal/unmarshal-8     0.000 ± 0%

cherry picked from commit WireGuard/wireguard-go@9e7529c

Updates tailscale/corp#28879

Signed-off-by: Alexander Yastrebov <yastrebov.alex@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This is already enforced in receive.go, but if these unmarshallers are
to have error return values anyway, make them as explicit as possible.

cherry picked from commit WireGuard/wireguard-go@842888a

Updates tailscale/corp#28879

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Optimize message encoding by eliminating binary.Write (which internally
uses reflection) in favour of hand-rolled encoding.

This is companion to 9e7529c.

Synthetic benchmark:

    var packetSink []byte
    func BenchmarkMessageInitiationMarshal(b *testing.B) {
        var msg MessageInitiation
        b.Run("binary.Write", func(b *testing.B) {
            b.ReportAllocs()
            for range b.N {
                var buf [MessageInitiationSize]byte
                writer := bytes.NewBuffer(buf[:0])
                _ = binary.Write(writer, binary.LittleEndian, msg)
                packetSink = writer.Bytes()
            }
        })
        b.Run("binary.Encode", func(b *testing.B) {
            b.ReportAllocs()
            for range b.N {
                packet := make([]byte, MessageInitiationSize)
                _, _ = binary.Encode(packet, binary.LittleEndian, msg)
                packetSink = packet
            }
        })
        b.Run("marshal", func(b *testing.B) {
            b.ReportAllocs()
            for range b.N {
                packet := make([]byte, MessageInitiationSize)
                _ = msg.marshal(packet)
                packetSink = packet
            }
        })
    }

Results:
                                             │      -      │
                                             │   sec/op    │
    MessageInitiationMarshal/binary.Write-8    1.337µ ± 0%
    MessageInitiationMarshal/binary.Encode-8   1.242µ ± 0%
    MessageInitiationMarshal/marshal-8         53.05n ± 1%

                                             │     -      │
                                             │    B/op    │
    MessageInitiationMarshal/binary.Write-8    368.0 ± 0%
    MessageInitiationMarshal/binary.Encode-8   160.0 ± 0%
    MessageInitiationMarshal/marshal-8         160.0 ± 0%

                                             │     -      │
                                             │ allocs/op  │
    MessageInitiationMarshal/binary.Write-8    3.000 ± 0%
    MessageInitiationMarshal/binary.Encode-8   1.000 ± 0%
    MessageInitiationMarshal/marshal-8         1.000 ± 0%

cherry picked from commit WireGuard/wireguard-go@264889f

Updates tailscale/corp#28879

Signed-off-by: Alexander Yastrebov <yastrebov.alex@gmail.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
This enables a conn.Bind to bring its own encapsulating transport, e.g.
VXLAN/Geneve.

Updates tailscale/corp#27502

Signed-off-by: Jordan Whited <jordan@tailscale.com>
It was previously suppressed if roaming was disabled for the peer.
Tailscale always disables roaming as we explicitly configure
conn.Endpoint's for all peers.

This commit also modifies PeerAwareEndpoint usage such that wireguard-go
never uses/sets it as a Peer Endpoint value. In theory we (Tailscale)
always disable roaming, so we should always return early from
SetEndpointFromPacket(), but this acts as an extra footgun guard and
improves clarity around intended usage.

Updates tailscale/corp#27502
Updates tailscale/corp#29422
Updates tailscale/corp#30042

Signed-off-by: Jordan Whited <jordan@tailscale.com>
To be implemented by [magicsock.lazyEndpoint], which is responsible for
triggering JIT peer configuration.

Updates tailscale/corp#20732
Updates tailscale/corp#30042

Signed-off-by: Jordan Whited <jordan@tailscale.com>
Updates tailscale/corp#30364

Signed-off-by: Jordan Whited <jordan@tailscale.com>
Peer.SetEndpointFromPacket is not called per-packet. It is
guaranteed to be called at least once per packet batch.

Updates tailscale/corp#30042
Updates tailscale/corp#20732

Signed-off-by: Jordan Whited <jordan@tailscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants