-
Notifications
You must be signed in to change notification settings - Fork 893
ipv6: Support new TSO without HBH #1329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Currently, BIG TCP IPv6 inserts a hop-by-hop extension header with a jumbo payload option to reflect the real length of the packet bigger than 65535 bytes. New kernels will drop this extension header and just calculate the packet length from skb->len, like it's currently done for BIG TCP IPv4. Reflect the future kernel change in tcpdump and support parsing such packets. Signed-off-by: Maxim Mikityanskiy <maxim@isovalent.com>
Cc @fxlb if you have a chance to review, that would be awesome. tl;dr: from the kernel side we plan to align big tcp ipv6 to the way big tcp ipv4 is handled today. |
Hi @guyharris do you have a chance to take a look? Thanks so much |
Cc @infrastation do you have a chance to take a look please? Thx |
ping, anyone? Maybe @fenner? |
Thank you for waiting. Yes, this is the correct approach, but the maintainers are usually limited in the amount of time they can spend on the project, so the queues progress slow. I currently have time to contribute, but recently my focus has been on infrastructure, testing, code clean-ups and documentation rather than new features. In other words, one of my current priorities is to reduce the technical debt, and changes that have potential to increase the technical debt expectedly do not receive as much attention. Anyway, I had a look, this change regards only a few lines of code, but I cannot validate it yet. As far as I could tell, a formal specification of BIG TCP does not seem to exist (please correct me if this is not the case). Could you clarify which effects of BIG TCP concern packets on the wire as captured by a transit host and which — packets captured by a host that terminates the connection? What would BIG TCP encoding mean if it appears in a packet on host that does not implement BIG TCP? Please keep in mind |
Thanks for taking a look, very much appreciated @infrastation! I'll answer the questions below in the meantime, let me know if you have any follow-up questions, happy to help!
There is no particular spec or RFC in this case given enabling BIG TCP does not leave the node, BIG TCP is basically just more aggressive node-local GRO/GSO batching on Linux. Normally, a TCP data stream can batch up to 64k sized super-packets which travel up or down the stack. On receive the GRO engine of Linux aggregates/batches the packets in software, on transmit the stack sends 64k sized super-packets down to the NIC which segments it in hardware (TSO), or if there is no hardware support then the stack can do this as well in software (aka GSO). BIG TCP was merged longer time ago into the Linux kernel through these two patch series:
Its an opt-in that a user configures, and when enabled then the GRO/GSO batching will try to aggregate beyond the 64k limit. This limit is basically due to the IP header length encoding (16 bit field). Going beyond that the convention for IPv6 currently is to insert a hop-by-hop extension header when going up or down the stack. For IPv4, the solution was to encode a length of 0 into the packet header and use skb->len (which is 32 bit) as the concrete length. Note that none of this leaves the host onto the wire. Neither does it require MTU changes or so, its just related to host-local batching. There has been desire in the Linux networking community to change the IPv6 implementation to be similar in terms of implementation as the BIG TCP IPv4 one, meaning, encode a length of 0 into the packet header and use skb->len as length indicator. This also opens up the possibility to get BIG TCP support for batching in case of vxlan/geneve tunnels which is under review in the kernel community. This small PR in here basically makes it so that tcpdump can print the correct packet size in such case. Non-BIGTCP traffic is not affected.
These aggregated packets never go onto the wire. So this is all just local on the host when packets traverse up or down the stack.
When it is not enabled or for non-TCP or TCP control packets, the code works fine as is given the packet header will print the correct length in this case. Please note that tcpdump can handle all this for IPv4 already, this PR is just fixing IPv6 printer/dissector side to display the equivalent "presumed TSO" length. The IPv4 equivalent in tcpdump's print-ip.c is:
|
@@ -1 +1 @@ | |||
1 2010-05-20 04:24:49.656077 IP6 fe80::25a:28ff:fe08:f150 > 6e02::41: ICMP6, length 0 (invalid) | |||
1 2010-05-20 04:24:49.656077 IP6 fe80::25a:28ff:fe08:f150 > 6e02::41: ICMP6, neighbor advertisement, tgt is fe80:0:aa:aaaa:aaaa:aaaa:aaaa:aaaa, length 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big TCP here.
Thank you for explaining. Please note the "none of this leaves the host onto the wire" remark no longer holds as soon as the packet is captured and saved into a file. In that case the file can be decoded on a different host with a different OS, or even blindly replayed to network. Also packets can be generated, and the @fxlb, would it make sense to extend the logic in your commit 3465ec4 onto IPv6? |
Should the change process only "packet bigger than 65535 bytes" ? |
Thinking back to the IPv4 code, it should probably be:
To better show invalid non-TCP packets of length 0. |
That would break the goal of having BIG TCP for encapsulation (vxlan, geneve). The IPv4 side is fine as it is, it is really only about addressing the IPv6 side to be similar to how the IPv4 dissector prints the length when it presumes TSO. |
Thank you for your comments and having a look! Addressing the questions about distinguishing BIG TCP packets from invalid packets with zero length:
Indeed, one can capture a BIG TCP packet on a Linux host into a pcap file, then try to replay it on a non-Linux machine. However, it can be done regardless of my change, because it just sends the bytes stored in the file. My change only affects the text representation of the packet when it's printed to stdout. |
178 2019-07-05 17:27:40.810753 IP6 (hlim 64, next-header PIM (103), payload length 48) 10::2 > 10::1: PIMv2, length 48 | ||
Register, cksum 0xcc3c (correct), Flags [ Null ] | ||
IP6 (class 0xc0, hlim 1, next-header PIM (103), payload length 0) 1::2 > ff02::1: [|pim] | ||
IP6 (class 0xc0, hlim 1, next-header PIM (103), payload length 0) 1::2 > ff02::1: [real length 40, presumed TSO] [|pim] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same: Not a big TCP here.
Would it be correct to say that you are on the way to generalize BIG TCP into BIG IP, and so far this mechanism covers TCP, VXLAN and GENEVE only? |
Not quite, the mechanism in here is still mainly for TCP flows. Today, the Linux kernel networking stack is doing GRO/GSO/TSO also for TCP traffic over vxlan/geneve tunnels (TSO in this case only for most modern 100G NICs). This works similarly with regards to the aggregation as described above (max aggregating into 64k super-sized packet). The patch series for the kernel which we'd like to merge is so that also TCP traffic over tunnels can benefit from the same BIG TCP optimization (aka aggregating beyond 64k super-sized packet) as it exists for non-tunneled traffic already. So in that sense its a not a generalization into BIG IP, but its still targeted at TCP, thus BIG TCP (just that in this case the outer header has UDP, the inner TCP). So the "presumed TSO" case in the packet printer/dissector needs to be handled for TCP & UDP - with this PR as-is, the handling is then similar to what we have already in IPv4 and would cover the needed scenarios. |
One more idea: check that |
Do you have pcap files containing: IPv6/HBH_jumbo/(BIG)TCP |
I'll collect them and add them as tests.
This didn't seem to have worked well. While it successfully filters out bad ICMP packets, we have one test, in which ip_len = 0, but the entire length is 2k, way below 64k. I'm kind of confused about the origins of this capture, because in all my tests TSO fills out ip_len properly if it's smaller than 64k (i.e. not BIG TCP). |
Origin explained here: 3465ec4. |
Currently, BIG TCP IPv6 inserts a hop-by-hop extension header with a jumbo payload option to reflect the real length of the packet bigger than 65535 bytes. New kernels will drop this extension header and just calculate the packet length from skb->len, like it's currently done for BIG TCP IPv4.
Reflect the future kernel change in tcpdump and support parsing such packets.
Kernel ref: https://lore.kernel.org/netdev/20250617144017.82931-1-maxim@isovalent.com/