ipv6: Support new TSO without HBH #1329

gentoo-root · 2025-06-17T16:03:25Z

Currently, BIG TCP IPv6 inserts a hop-by-hop extension header with a jumbo payload option to reflect the real length of the packet bigger than 65535 bytes. New kernels will drop this extension header and just calculate the packet length from skb->len, like it's currently done for BIG TCP IPv4.

Reflect the future kernel change in tcpdump and support parsing such packets.

Kernel ref: https://lore.kernel.org/netdev/20250617144017.82931-1-maxim@isovalent.com/

Currently, BIG TCP IPv6 inserts a hop-by-hop extension header with a jumbo payload option to reflect the real length of the packet bigger than 65535 bytes. New kernels will drop this extension header and just calculate the packet length from skb->len, like it's currently done for BIG TCP IPv4. Reflect the future kernel change in tcpdump and support parsing such packets. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>

borkmann · 2025-06-25T02:59:51Z

Cc @fxlb if you have a chance to review, that would be awesome. tl;dr: from the kernel side we plan to align big tcp ipv6 to the way big tcp ipv4 is handled today.

borkmann · 2025-08-12T14:16:02Z

Hi @guyharris do you have a chance to take a look? Thanks so much

borkmann · 2025-09-02T07:27:04Z

Cc @infrastation do you have a chance to take a look please? Thx

borkmann · 2025-09-16T17:36:50Z

ping, anyone? Maybe @fenner?
fwiw, https://www.tcpdump.org/index.html#source says opening a PR is the right approach here for contributors, pls let us know if this is not the case.

infrastation · 2025-09-29T17:36:22Z

Thank you for waiting. Yes, this is the correct approach, but the maintainers are usually limited in the amount of time they can spend on the project, so the queues progress slow. I currently have time to contribute, but recently my focus has been on infrastructure, testing, code clean-ups and documentation rather than new features. In other words, one of my current priorities is to reduce the technical debt, and changes that have potential to increase the technical debt expectedly do not receive as much attention.

Anyway, I had a look, this change regards only a few lines of code, but I cannot validate it yet. As far as I could tell, a formal specification of BIG TCP does not seem to exist (please correct me if this is not the case). Could you clarify which effects of BIG TCP concern packets on the wire as captured by a transit host and which — packets captured by a host that terminates the connection? What would BIG TCP encoding mean if it appears in a packet on host that does not implement BIG TCP? Please keep in mind ip6_print() executes on non-Linux hosts as well.

borkmann · 2025-09-29T19:38:52Z

Thanks for taking a look, very much appreciated @infrastation! I'll answer the questions below in the meantime, let me know if you have any follow-up questions, happy to help!

As far as I could tell, a formal specification of BIG TCP does not seem to exist (please correct me if this is not the case).

There is no particular spec or RFC in this case given enabling BIG TCP does not leave the node, BIG TCP is basically just more aggressive node-local GRO/GSO batching on Linux. Normally, a TCP data stream can batch up to 64k sized super-packets which travel up or down the stack. On receive the GRO engine of Linux aggregates/batches the packets in software, on transmit the stack sends 64k sized super-packets down to the NIC which segments it in hardware (TSO), or if there is no hardware support then the stack can do this as well in software (aka GSO).

BIG TCP was merged longer time ago into the Linux kernel through these two patch series:

Its an opt-in that a user configures, and when enabled then the GRO/GSO batching will try to aggregate beyond the 64k limit. This limit is basically due to the IP header length encoding (16 bit field). Going beyond that the convention for IPv6 currently is to insert a hop-by-hop extension header when going up or down the stack. For IPv4, the solution was to encode a length of 0 into the packet header and use skb->len (which is 32 bit) as the concrete length. Note that none of this leaves the host onto the wire. Neither does it require MTU changes or so, its just related to host-local batching.

There has been desire in the Linux networking community to change the IPv6 implementation to be similar in terms of implementation as the BIG TCP IPv4 one, meaning, encode a length of 0 into the packet header and use skb->len as length indicator. This also opens up the possibility to get BIG TCP support for batching in case of vxlan/geneve tunnels which is under review in the kernel community.

This small PR in here basically makes it so that tcpdump can print the correct packet size in such case. Non-BIGTCP traffic is not affected.

Could you clarify which effects of BIG TCP concern packets on the wire as captured by a transit host and which — packets captured by a host that terminates the connection?

These aggregated packets never go onto the wire. So this is all just local on the host when packets traverse up or down the stack.

What would BIG TCP encoding mean if it appears in a packet on host that does not implement BIG TCP? Please keep in mind ip6_print() executes on non-Linux hosts as well.

When it is not enabled or for non-TCP or TCP control packets, the code works fine as is given the packet header will print the correct length in this case. Please note that tcpdump can handle all this for IPv4 already, this PR is just fixing IPv6 printer/dissector side to display the equivalent "presumed TSO" length.

The IPv4 equivalent in tcpdump's print-ip.c is:

	if (len == 0) {
		/* we guess that it is a TSO send */
		len = length;
		presumed_tso = 1;
	} else

[...]
	if (ndo->ndo_vflag) {
[...]
		if (presumed_tso)
			ND_PRINT(", length %u [was 0, presumed TSO]", length);
		else
			ND_PRINT(", length %u", GET_BE_U_2(ip->ip_len));
[...]
	}

fxlb · 2025-09-29T20:37:05Z

tests/icmpv6-length-zero.out

@@ -1 +1 @@
-    1  2010-05-20 04:24:49.656077 IP6 fe80::25a:28ff:fe08:f150 > 6e02::41: ICMP6, length 0 (invalid)
+    1  2010-05-20 04:24:49.656077 IP6 fe80::25a:28ff:fe08:f150 > 6e02::41: ICMP6, neighbor advertisement, tgt is fe80:0:aa:aaaa:aaaa:aaaa:aaaa:aaaa, length 32


Not a big TCP here.

infrastation · 2025-09-29T20:38:32Z

Thank you for explaining. Please note the "none of this leaves the host onto the wire" remark no longer holds as soon as the packet is captured and saved into a file. In that case the file can be decoded on a different host with a different OS, or even blindly replayed to network. Also packets can be generated, and the length argument of ip6_print() can come from a network protocol that encapsulates IPv6, GRE would be one example. The decoder must handle all possible input safely, not just the happy path (this is not a problem of Linux kernel developers, but still). Hence my attempts to understand it completely before merging.

@fxlb, would it make sense to extend the logic in your commit 3465ec4 onto IPv6?

fxlb · 2025-09-30T07:55:09Z

Should the change process only "packet bigger than 65535 bytes" ?
Should the change check it is a TCP packet?

fxlb · 2025-09-30T08:18:09Z

Thinking back to the IPv4 code, it should probably be:

        if (len == 0) {
                uint8_t nh = GET_U_1(ip->ip_p);

                if (nh == IPPROTO_TCP) {
                        /* we guess that it is a TSO send */
                        len = length;
                        presumed_tso = 1;
                }
        } else

To better show invalid non-TCP packets of length 0.

borkmann · 2025-09-30T08:41:13Z

Thinking back to the IPv4 code, it should probably be:

        if (len == 0) {
                uint8_t nh = GET_U_1(ip->ip_p);

                if (nh == IPPROTO_TCP) {
                        /* we guess that it is a TSO send */
                        len = length;
                        presumed_tso = 1;
                }
        } else

To better show invalid non-TCP packets of length 0.

That would break the goal of having BIG TCP for encapsulation (vxlan, geneve). The IPv4 side is fine as it is, it is really only about addressing the IPv6 side to be similar to how the IPv4 dissector prints the length when it presumes TSO.

gentoo-root · 2025-09-30T08:41:24Z

Thank you for your comments and having a look!

Addressing the questions about distinguishing BIG TCP packets from invalid packets with zero length:

My change for IPv6 basically mimics what we already have for IPv4. Currently, we don't have any extra checks in the IPv4 path that validate real length > 64k or transport protocols.
An explicit check for TCP would falsely mark tunneled BIG TCP packets as invalid. E.g., my kernel patchset enables BIG TCP for VXLAN tunnels, in which case we'll see the outer IPv6 header with payload_len = 0, but nexthdr will be UDP, not TCP.
I agree that an invalid ICMPv6 packet shouldn't be treated as BIG TCP. I believe, however, that we currently have the same behavior for ICMP/IPv4, we just don't have a test for such a packet.
Given 2 and 3, I'm not sure which way would work the best: check that nexthdr is either TCP or UDP? check that nexthdr is not ICMP or some other known protocols? keep it as is?

Please note the "none of this leaves the host onto the wire" remark no longer holds as soon as the packet is captured and saved into a file. In that case the file can be decoded on a different host with a different OS, or even blindly replayed to network.

Indeed, one can capture a BIG TCP packet on a Linux host into a pcap file, then try to replay it on a non-Linux machine. However, it can be done regardless of my change, because it just sends the bytes stored in the file. My change only affects the text representation of the packet when it's printed to stdout.

fxlb · 2025-09-30T08:46:18Z

tests/pim-packet-assortment-v.out

  178  2019-07-05 17:27:40.810753 IP6 (hlim 64, next-header PIM (103), payload length 48) 10::2 > 10::1: PIMv2, length 48
 	Register, cksum 0xcc3c (correct), Flags [ Null ]
-	IP6 (class 0xc0, hlim 1, next-header PIM (103), payload length 0) 1::2 > ff02::1:  [|pim]
+	IP6 (class 0xc0, hlim 1, next-header PIM (103), payload length 0) 1::2 > ff02::1: [real length 40, presumed TSO]  [|pim]


Same: Not a big TCP here.

infrastation · 2025-09-30T10:53:12Z

Would it be correct to say that you are on the way to generalize BIG TCP into BIG IP, and so far this mechanism covers TCP, VXLAN and GENEVE only?

borkmann · 2025-09-30T11:16:21Z

Would it be correct to say that you are on the way to generalize BIG TCP into BIG IP, and so far this mechanism covers TCP, VXLAN and GENEVE only?

Not quite, the mechanism in here is still mainly for TCP flows. Today, the Linux kernel networking stack is doing GRO/GSO/TSO also for TCP traffic over vxlan/geneve tunnels (TSO in this case only for most modern 100G NICs). This works similarly with regards to the aggregation as described above (max aggregating into 64k super-sized packet). The patch series for the kernel which we'd like to merge is so that also TCP traffic over tunnels can benefit from the same BIG TCP optimization (aka aggregating beyond 64k super-sized packet) as it exists for non-tunneled traffic already. So in that sense its a not a generalization into BIG IP, but its still targeted at TCP, thus BIG TCP (just that in this case the outer header has UDP, the inner TCP). So the "presumed TSO" case in the packet printer/dissector needs to be handled for TCP & UDP - with this PR as-is, the handling is then similar to what we have already in IPv4 and would cover the needed scenarios.

gentoo-root · 2025-09-30T12:30:24Z

Given 2 and 3, I'm not sure which way would work the best: check that nexthdr is either TCP or UDP? check that nexthdr is not ICMP or some other known protocols? keep it as is?

One more idea: check that len == 0 && length > 65535. This will be enough to prevent false positives on ICMP packets, still handling BIG TCP packets correctly, regardless of whether it's plain TCP or TCP inside VXLAN, and also allows to filter out bad packets that have len = 0, but are not BIG TCP (smaller than 64k).

fxlb · 2025-10-01T07:38:58Z

Do you have pcap files containing:
IPv4/(BIG)TCP
IPv4/UDP/Geneve/.../(BIG)TCP
IPv4/UDP/VXLAN/.../(BIG)TCP

IPv6/HBH_jumbo/(BIG)TCP
IPv6/(BIG)TCP
IPv6/UDP/Geneve/.../(BIG)TCP
IPv6/UDP/VXLAN/.../(BIG)TCP

gentoo-root · 2025-10-01T09:38:58Z

Do you have pcap files containing

I'll collect them and add them as tests.

One more idea: check that len == 0 && length > 65535.

This didn't seem to have worked well. While it successfully filters out bad ICMP packets, we have one test, in which ip_len = 0, but the entire length is 2k, way below 64k. I'm kind of confused about the origins of this capture, because in all my tests TSO fills out ip_len properly if it's smaller than 64k (i.e. not BIG TCP).

fxlb · 2025-10-01T14:28:43Z

I'm kind of confused about the origins of this capture,

Origin explained here: 3465ec4.
I don't know where this was captured.

fxlb reviewed Sep 29, 2025

View reviewed changes

fxlb reviewed Sep 30, 2025

View reviewed changes

		@@ -1 +1 @@
		1 2010-05-20 04:24:49.656077 IP6 fe80::25a:28ff:fe08:f150 > 6e02::41: ICMP6, length 0 (invalid)
		1 2010-05-20 04:24:49.656077 IP6 fe80::25a:28ff:fe08:f150 > 6e02::41: ICMP6, neighbor advertisement, tgt is fe80:0:aa:aaaa:aaaa:aaaa:aaaa:aaaa, length 32

ipv6: Support new TSO without HBH #1329

Are you sure you want to change the base?

ipv6: Support new TSO without HBH #1329

Conversation

gentoo-root commented Jun 17, 2025

Uh oh!

borkmann commented Jun 25, 2025

Uh oh!

borkmann commented Aug 12, 2025

Uh oh!

borkmann commented Sep 2, 2025

Uh oh!

borkmann commented Sep 16, 2025

Uh oh!

infrastation commented Sep 29, 2025

Uh oh!

borkmann commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxlb Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

infrastation commented Sep 29, 2025

Uh oh!

fxlb commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fxlb commented Sep 30, 2025

Uh oh!

borkmann commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gentoo-root commented Sep 30, 2025

Uh oh!

fxlb Sep 30, 2025

Choose a reason for hiding this comment

Uh oh!

infrastation commented Sep 30, 2025

Uh oh!

borkmann commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gentoo-root commented Sep 30, 2025

Uh oh!

fxlb commented Oct 1, 2025

Uh oh!

gentoo-root commented Oct 1, 2025

Uh oh!

fxlb commented Oct 1, 2025

Uh oh!

Uh oh!

borkmann commented Sep 29, 2025 •

edited

Loading

fxlb Sep 29, 2025 •

edited

Loading

fxlb commented Sep 30, 2025 •

edited

Loading

borkmann commented Sep 30, 2025 •

edited

Loading

borkmann commented Sep 30, 2025 •

edited

Loading