Add the Tofino model source code as a package in the open P4Studio by fruffy · Pull Request #65 · p4lang/open-p4studio

fruffy · 2025-02-03T22:01:03Z

Compile on Ubuntu 20.04 and 22.04.
Use the generated binary instead of the available binary.
Add an option to disable model building and just use the latest binary.
Polish (Remove unnecessary code)

jafingerhut · 2025-02-03T23:00:12Z

"Files changed 5000+" Now there is a thing you don't see in a Github PR every day :-)

jafingerhut · 2025-02-04T13:22:24Z

pkgsrc/switch-p4-16/api/common/pal.cpp

+    //            __LINE__,
+    //            port_handle,
+    //            port_lane_list,
+    //            status);


Do you want to merge this PR with this code commented out? Or perhaps wrap them all in an #ifdef SOME_NEW_SYMBOL_NAME with the same symbol name as each other, to make them easier to grep and find later when someone wants to update all of these calls? I guess marking them all with an identical unique string in a comment works for that, too.

I will remove these changes later. But I need them to build on Ubuntu 24.04, which is the distribution I am currently using.

The problem is that this code doesn't compile with newer version of GCC/CLang, which enforce the C++ standard correctly.

fruffy · 2025-02-18T01:31:28Z

@jafingerhut Since you have been on a license hunt recently I wanted to point out that the model also uses pcap: https://github.com/p4lang/open-p4studio/blob/prathima/include_bf_model/pkgsrc/tofino-model/runner/bmi_interface.c#L24

Although I think the usage is fairly limited, I only see one file. And I do not know how important that file is.

jafingerhut · 2025-02-18T01:51:42Z

@jafingerhut Since you have been on a license hunt recently I wanted to point out that the model also uses pcap: https://github.com/p4lang/open-p4studio/blob/prathima/include_bf_model/pkgsrc/tofino-model/runner/bmi_interface.c#L24

Although I think the usage is fairly limited, I only see one file. And I do not know how important that file is.

Thanks for the heads up.

The C libpcap library is BSD licensed, https://github.com/the-tcpdump-group/libpcap/blob/master/LICENSE, which is compatible with using in an Apache-2.0 project.

In general, BSD, MIT, and libboost's BSL-1.0 license are all very compatible with just about any other common software license. It is Apache-2.0 and GPL licenses that are very questionable together, or in general any non-GPL license project using anything GPL'd.

scapy is the one that is causing us the current headaches.

fruffy · 2025-02-18T01:56:35Z

The C libpcap library is BSD licensed, https://github.com/the-tcpdump-group/libpcap/blob/master/LICENSE, which is compatible with using in an Apache-2.0 project.

Whoops! I mixed libpcap with scapy. For some reason I assumed these were part of the same project. Nevermind then!

jafingerhut · 2025-02-18T02:00:17Z

The C libpcap library is BSD licensed, https://github.com/the-tcpdump-group/libpcap/blob/master/LICENSE, which is compatible with using in an Apache-2.0 project.

Whoops! I mixed libpcap with scapy. For some reason I assumed these were part of the same project. Nevermind then!

Easy to do, given that our p4c test code uses small parts of Scapy to read and write pcap files.

I know one way to eliminate those uses of Scapy in p4c test code now, as over the weekend I found some pcap writer code in the ptf library (NOT in a GPL'd file), and am able to get most tests to pass using that: p4lang/ptf#214

fruffy · 2025-02-20T19:33:40Z

This is basically there but I am noticing some strange behavior in CI. It takes 5 minutes until the switch can connect. No idea why. I can not build the model yet because it does not compile on Ubuntu 24.04.

@jafingerhut if you have some time could you run a test and see whether things work for you?

jafingerhut · 2025-02-20T21:28:52Z

I tried following the top level README.md instructions that use the testing.yaml profile on a freshly install Ubuntu 20.04 x86_64 VM, and it got errors during the build. Some error about not being able to find a luaconf.h file while running cmake on some part of the project. I am attaching a zip file with 3 of the log files produced.
ubuntu20.04-errors.zip

jafingerhut · 2025-02-20T23:14:13Z

I tried again after your last commit "Fix CMake error.", and got a different error during the build. Log attached.
ubuntu20.04-try2-error.zip

jafingerhut · 2025-02-20T23:15:48Z

I guess there isn't an option to have CI run with "minimal install Ubuntu" images? That is pretty close to what I am running (plus with GUI desktop installed, for my convenience, which is a bunch of extra software that the CI images likely do not have).

fruffy · 2025-02-20T23:17:09Z

This seems to be the bf-asm code. So let's get #70 working before trying with the model I guess.

fruffy · 2025-04-02T15:22:51Z

@jafingerhut Once again, can I ask you to try this locally? I am seeing some strange behavior in CI I can not quite explain:

...
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
...

This goes on for ~5 minutes until the switch finally connects. I do not understand how the model built from source can influence this.

jafingerhut · 2025-04-02T16:11:35Z

@jafingerhut Once again, can I ask you to try this locally? I am seeing some strange behavior in CI I can not quite explain:
...
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
...
This goes on for ~5 minutes until the switch finally connects. I do not understand how the model built from source can influence this.

I will try it out on a local freshly installed Ubuntu 22.04 x86_64 VM, and let you know what I find.

I do not know the root cause, but I have found that on my local VM, when I configure it with 4 VCPUs and run the tests, it often takes a long time for some connection to be established between processes, but when I run that same test on the very same VM configured with only 1 VCPU, it almost always connects quickly (within a few seconds).

My only guess at this time is that there might be some kind of livelock between processes running actually in parallel on multiple CPU cores, e.g. something that causes connection request messages/packets between processes to be lost, that take a long time to time out and recover, and sometimes fail without recovering. But if you run with 1 VCPU, the fact that at most one process is physically running at a time somehow makes that less likely. Again, take that is conjecture, based on only a tiny bit of observed data.

fruffy · 2025-04-02T18:04:24Z

@jafingerhut Once again, can I ask you to try this locally? I am seeing some strange behavior in CI I can not quite explain:
...
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
...
This goes on for ~5 minutes until the switch finally connects. I do not understand how the model built from source can influence this.
I will try it out on a local freshly installed Ubuntu 22.04 x86_64 VM, and let you know what I find.

I do not know the root cause, but I have found that on my local VM, when I configure it with 4 VCPUs and run the tests, it often takes a long time for some connection to be established between processes, but when I run that same test on the very same VM configured with only 1 VCPU, it almost always connects quickly (within a few seconds).

My only guess at this time is that there might be some kind of livelock between processes running actually in parallel on multiple CPU cores, e.g. something that causes connection request messages/packets between processes to be lost, that take a long time to time out and recover, and sometimes fail without recovering. But if you run with 1 VCPU, the fact that at most one process is physically running at a time somehow makes that less likely. Again, take that is conjecture, based on only a tiny bit of observed data.

Well, the main branch doesn't have this problem, but this branch does. Which confuses me a bit.

jafingerhut · 2025-04-02T19:21:46Z

On personal freshly installed Ubuntu 22.04 x86_64 VM, I was able to build with the changes in this PR with no errors.

But I get a very quick error and exit when I try to run the model using the command that the CI test is currently using, and that I have been using successfully without the changes in this PR:

$ ./run_tofino_model.sh --arch tofino -p tna_counter
Using SDE /home/p4/open-p4studio
Using SDE_INSTALL /home/p4/open-p4studio/install
Model using test directory: /home/p4/open-p4studio/pkgsrc/p4-examples/p4_16_programs/tna_counter
Model using port info file: None
Using SDE_DEPENDENCIES /home/p4/open-p4studio/install
Using PATH /home/p4/open-p4studio/install/bin:/home/p4/open-p4studio/install/bin:/home/p4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
Using LD_LIBRARY_PATH /home/p4/open-p4studio/install/lib:/home/p4/open-p4studio/install/lib:/home/p4/open-p4studio/install/lib:/usr/local/lib
 PRIVS: Eff=0x00000000 Perm=0x00002006 Inh=0x00000000
chip_type 2 unavailable (Available types: Tofino=1,JBay=4)

If you look at the shell script run_tofino_model.sh here: https://github.com/p4lang/open-p4studio/blob/main/run_tofino_model.sh#L115-L125

You can see that when you use tofino as the value of CHIP_ARCH, it sets CHIPTYPE to 2, which it is then passing on as some kind of parameter when running the model, which complains that it doesn't support chip_type=2, but the error message implies that it does support chip_type 1 for Tofino, and 4 for JBay (Tofino2).

I hacked my local copy of the run_tofino_model.sh script to change those assignments of CHIPTYPE=2 to be CHIPTYPE=1 instead, and it does seem to start the model then, so it gets a bit farther. I haven't got a successful run with that change yet on a 4 VCPU VM, but I will try it with a 1 VCPU VM soon to see if that makes a difference.

I don't know where this off-by-one on the CHIPTYPE values is coming from.

jafingerhut · 2025-04-02T19:30:40Z

Although the CI tests are passing right now, I think it is incorrectly reporting a pass. I bet the CI test is also starting the model, but it is exiting quickly with the error of unsupported chip type, but the CI tests are written in such a way that it does not detect this and fail like we wish it would. The repeated switchd: connect failed. Error: Connection refused messages are the bf_switchd process trying but failing to connect to the model process, which has already exited.

In my local copy, I modified run_tofino_model.sh to change all CHIPTYPE=2 assignments to CHIPTYPE=1, and the Tofino1 test of tna_counter test program passed, and I modified all CHIPTYPE=5 assignments to CHIPTYPE=4, and the Tofino2 test of tna_counter passed.

I see the same odd behavior in my local testing as I have seen before: the test starts and finishes quite quickly when I configure the VM with 1 VCPU -- it hangs for a long time with 4 VCPU.

fruffy · 2025-04-02T20:34:30Z

Although the CI tests are passing right now, I think it is incorrectly reporting a pass. I bet the CI test is also starting the model, but it is exiting quickly with the error of unsupported chip type, but the CI tests are written in such a way that it does not detect this and fail like we wish it would. The repeated switchd: connect failed. Error: Connection refused messages are the bf_switchd process trying but failing to connect to the model process, which has already exited.

In my local copy, I modified run_tofino_model.sh to change all CHIPTYPE=2 assignments to CHIPTYPE=1, and the Tofino1 test of tna_counter test program passed, and I modified all CHIPTYPE=5 assignments to CHIPTYPE=4, and the Tofino2 test of tna_counter passed.

I see the same odd behavior in my local testing as I have seen before: the test starts and finishes quite quickly when I configure the VM with 1 VCPU -- it hangs for a long time with 4 VCPU.

Thanks, that is really good to know since I can not build things locally on Ubuntu 24.04 right now.

fruffy · 2025-04-03T06:40:22Z

The other problem is that the tests are passing. I am not sure why...

fruffy · 2025-04-09T06:26:10Z

@jafingerhut I got a little carried away with the testing script and generate a version using Gemini 2.5. This particular LLM works really well for boilerplate tasks like this one because its context window is actually large enough to generate lots of code. I will edit the script down and likely move it to a separate PR. I can also try to combine it with the ideas in #39.

The last two commits show a failing and a passing test for Ubuntu 22.04. Let me know whether the output works for you.

jafingerhut · 2025-04-09T11:52:00Z

@jafingerhut I got a little carried away with the testing script and generate a version using Gemini 2.5. This particular LLM works really well for boilerplate tasks like this one because its context window is actually large enough to generate lots of code. I will edit the script down and likely move it to a separate PR. I can also try to combine it with the ideas in #39.

The last two commits show a failing and a passing test for Ubuntu 22.04. Let me know whether the output works for you.

I will create a freshly installed Ubuntu 22.04 VM and try a build and test there, and let you know the results.

Unless you are strongly motivated to get tests passing for Ubuntu 20.04, too, it appears we have until only 2025-Apr-15 before Github CI disables that.

jafingerhut · 2025-04-09T11:52:11Z

FYI, DCO check is failing on this PR.

fruffy · 2025-04-09T11:59:03Z

FYI, DCO check is failing on this PR.

Fixed 🪄

Unless you are strongly motivated to get tests passing for Ubuntu 20.04

This already seems to be the brownout.

jafingerhut · 2025-04-09T15:53:48Z

With the latest version of this PR as of commit 18 on 2025-Apr-09, the build succeeds on an x86_64 Ubuntu 22.04 VM, as it has with earlier versions. After setting up SDE and SDE_INSTALL env variables as recommended in the README, the following command also successfully runs one test:

./ci/run-test.py tna_counter

However, if I then attempt to run a second test, it consistently fails, and I am pretty sure it is because after the first test completes, there is still a bf_switchd and tofino-model process running left over from the first test, which interferes with the second test passing.

I would recommend modifying run-test.py so that it ensures these processes are stopped before run-test.py exits, so that run-test.py can be used to run a sequence of many tests.

jafingerhut · 2025-04-09T16:16:07Z

With the latest version of this PR as of commit 18 on 2025-Apr-09, the build succeeds on an x86_64 Ubuntu 22.04 VM, as it has with earlier versions. After setting up SDE and SDE_INSTALL env variables as recommended in the README, the following command also successfully runs one test:
./ci/run-test.py tna_counter
However, if I then attempt to run a second test, it consistently fails, and I am pretty sure it is because after the first test completes, there is still a bf_switchd and tofino-model process running left over from the first test, which interferes with the second test passing.

I would recommend modifying run-test.py so that it ensures these processes are stopped before run-test.py exits, so that run-test.py can be used to run a sequence of many tests.

FYI, if you would prefer to merge this PR as is, and improve run-test.py's implementation with later PRs, I am OK with that. Approving this PR in case that is your preference.

jafingerhut

LGTM

fruffy · 2025-04-10T11:46:35Z

I would recommend modifying run-test.py so that it ensures these processes are stopped before run-test.py exits, so that run-test.py can be used to run a sequence of many tests.

It looks like the problem is that the processes of bf_switchd and tofino-model are run as sudo so they can not be killed easily after running the script. Need to kill the entire process group.

jafingerhut · 2025-04-10T12:12:44Z

Hence my use of these lines of code to clean up in my other proposed Bash script for running one test:

# This script only supports running one test on a system at a time,
# not multiple in parallel.  Thus killing all of these processes is
# OK.
sudo killall bf_switchd tofino-model

Signed-off-by: fruffy <fruffy@nyu.edu>

…4lang#65) Signed-off-by: fruffy <fruffy@nyu.edu> Co-authored-by: Prathima Kotikalapudi <prathima.kotikalapudi@intel.com>

…4lang#65) Signed-off-by: fruffy <fruffy@nyu.edu> Co-authored-by: Prathima Kotikalapudi <prathima.kotikalapudi@intel.com> Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>

jafingerhut reviewed Feb 4, 2025

View reviewed changes

fruffy force-pushed the prathima/include_bf_model branch from bdbc37f to d9bd9b9 Compare February 16, 2025 18:50

fruffy mentioned this pull request Feb 16, 2025

Force CPP++17 installation for Thrift and Boost. #66

Merged

fruffy force-pushed the prathima/include_bf_model branch 5 times, most recently from 2b668ab to 9ab5e07 Compare February 17, 2025 01:38

fruffy force-pushed the prathima/include_bf_model branch 3 times, most recently from 668d652 to f210503 Compare February 20, 2025 15:39

fruffy mentioned this pull request Mar 18, 2025

Move the assembler to P4C. #70

Merged

fruffy force-pushed the prathima/include_bf_model branch from cabed4a to 2475c42 Compare April 2, 2025 13:36

fruffy force-pushed the prathima/include_bf_model branch 2 times, most recently from db02cd9 to bf287b2 Compare April 8, 2025 14:08

jafingerhut approved these changes Apr 9, 2025

View reviewed changes

jafingerhut mentioned this pull request Apr 10, 2025

Add a test runner that can run tests in parallel and in network namespaces with temporary directories. #94

Merged

pkotikal and others added 13 commits April 10, 2025 20:09

Integrate tofino-model

0aa679a

Signed-off-by: fruffy <fruffy@nyu.edu>

libcrafter dependency

f0776e7

Signed-off-by: fruffy <fruffy@nyu.edu>

Fix librafter setup.

0f9dd29

Signed-off-by: fruffy <fruffy@nyu.edu>

Model version?!

2e43f81

Signed-off-by: fruffy <fruffy@nyu.edu>

Cmake cleanups.

49c86ef

Signed-off-by: fruffy <fruffy@nyu.edu>

More cleanup.

7cb70a4

Signed-off-by: fruffy <fruffy@nyu.edu>

Fix compilation failure because of unsanitized input.

e206995

Signed-off-by: fruffy <fruffy@nyu.edu>

Try to clean up some settings.

91ba605

Signed-off-by: fruffy <fruffy@nyu.edu>

Only use dynamic libpcap and try to set multiple definitions to extern?

0f888a4

Signed-off-by: fruffy <fruffy@nyu.edu>

Try some simplifications.

d9d914f

Signed-off-by: fruffy <fruffy@nyu.edu>

Try to get rid of some unnecessary boost.

285817a

Signed-off-by: fruffy <fruffy@nyu.edu>

Fix CMake error.

6ca3346

Signed-off-by: fruffy <fruffy@nyu.edu>

Enable B0 support for JBay and Tofino.

5908edc

Signed-off-by: fruffy <fruffy@nyu.edu>

fruffy force-pushed the prathima/include_bf_model branch from 7eae1b1 to 5908edc Compare April 10, 2025 18:12

fruffy merged commit fc66158 into main Apr 10, 2025
3 checks passed

fruffy deleted the prathima/include_bf_model branch April 10, 2025 20:50

Conversation

fruffy commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jafingerhut commented Feb 3, 2025

Uh oh!

jafingerhut Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

fruffy Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

fruffy commented Feb 18, 2025

Uh oh!

jafingerhut commented Feb 18, 2025

Uh oh!

fruffy commented Feb 18, 2025

Uh oh!

jafingerhut commented Feb 18, 2025

Uh oh!

fruffy commented Feb 20, 2025

Uh oh!

jafingerhut commented Feb 20, 2025

Uh oh!

jafingerhut commented Feb 20, 2025

Uh oh!

jafingerhut commented Feb 20, 2025

Uh oh!

fruffy commented Feb 20, 2025

Uh oh!

fruffy commented Apr 2, 2025

Uh oh!

jafingerhut commented Apr 2, 2025

Uh oh!

fruffy commented Apr 2, 2025

Uh oh!

jafingerhut commented Apr 2, 2025

Uh oh!

jafingerhut commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fruffy commented Apr 2, 2025

Uh oh!

fruffy commented Apr 3, 2025

Uh oh!

fruffy commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jafingerhut commented Apr 9, 2025

Uh oh!

jafingerhut commented Apr 9, 2025

Uh oh!

fruffy commented Apr 9, 2025

Uh oh!

jafingerhut commented Apr 9, 2025

Uh oh!

jafingerhut commented Apr 9, 2025

Uh oh!

jafingerhut left a comment

Choose a reason for hiding this comment

Uh oh!

fruffy commented Apr 10, 2025

Uh oh!

jafingerhut commented Apr 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fruffy commented Feb 3, 2025 •

edited

Loading

jafingerhut commented Apr 2, 2025 •

edited

Loading

fruffy commented Apr 9, 2025 •

edited

Loading