Skip to content

Add the Tofino model source code as a package in the open P4Studio#65

Merged
fruffy merged 13 commits intomainfrom
prathima/include_bf_model
Apr 10, 2025
Merged

Add the Tofino model source code as a package in the open P4Studio#65
fruffy merged 13 commits intomainfrom
prathima/include_bf_model

Conversation

@fruffy
Copy link
Contributor

@fruffy fruffy commented Feb 3, 2025

  • Compile on Ubuntu 20.04 and 22.04.
  • Use the generated binary instead of the available binary.
  • Add an option to disable model building and just use the latest binary.
  • Polish (Remove unnecessary code)

@jafingerhut
Copy link
Contributor

"Files changed 5000+" Now there is a thing you don't see in a Github PR every day :-)

// __LINE__,
// port_handle,
// port_lane_list,
// status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to merge this PR with this code commented out? Or perhaps wrap them all in an #ifdef SOME_NEW_SYMBOL_NAME with the same symbol name as each other, to make them easier to grep and find later when someone wants to update all of these calls? I guess marking them all with an identical unique string in a comment works for that, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove these changes later. But I need them to build on Ubuntu 24.04, which is the distribution I am currently using.

The problem is that this code doesn't compile with newer version of GCC/CLang, which enforce the C++ standard correctly.

@fruffy fruffy force-pushed the prathima/include_bf_model branch from bdbc37f to d9bd9b9 Compare February 16, 2025 18:50
@fruffy fruffy force-pushed the prathima/include_bf_model branch 5 times, most recently from 2b668ab to 9ab5e07 Compare February 17, 2025 01:38
@fruffy
Copy link
Contributor Author

fruffy commented Feb 18, 2025

@jafingerhut Since you have been on a license hunt recently I wanted to point out that the model also uses pcap: https://github.com/p4lang/open-p4studio/blob/prathima/include_bf_model/pkgsrc/tofino-model/runner/bmi_interface.c#L24

Although I think the usage is fairly limited, I only see one file. And I do not know how important that file is.

@jafingerhut
Copy link
Contributor

@jafingerhut Since you have been on a license hunt recently I wanted to point out that the model also uses pcap: https://github.com/p4lang/open-p4studio/blob/prathima/include_bf_model/pkgsrc/tofino-model/runner/bmi_interface.c#L24

Although I think the usage is fairly limited, I only see one file. And I do not know how important that file is.

Thanks for the heads up.

The C libpcap library is BSD licensed, https://github.com/the-tcpdump-group/libpcap/blob/master/LICENSE, which is compatible with using in an Apache-2.0 project.

In general, BSD, MIT, and libboost's BSL-1.0 license are all very compatible with just about any other common software license. It is Apache-2.0 and GPL licenses that are very questionable together, or in general any non-GPL license project using anything GPL'd.

scapy is the one that is causing us the current headaches.

@fruffy
Copy link
Contributor Author

fruffy commented Feb 18, 2025

The C libpcap library is BSD licensed, https://github.com/the-tcpdump-group/libpcap/blob/master/LICENSE, which is compatible with using in an Apache-2.0 project.

Whoops! I mixed libpcap with scapy. For some reason I assumed these were part of the same project. Nevermind then!

@jafingerhut
Copy link
Contributor

The C libpcap library is BSD licensed, https://github.com/the-tcpdump-group/libpcap/blob/master/LICENSE, which is compatible with using in an Apache-2.0 project.

Whoops! I mixed libpcap with scapy. For some reason I assumed these were part of the same project. Nevermind then!

Easy to do, given that our p4c test code uses small parts of Scapy to read and write pcap files.

I know one way to eliminate those uses of Scapy in p4c test code now, as over the weekend I found some pcap writer code in the ptf library (NOT in a GPL'd file), and am able to get most tests to pass using that: p4lang/ptf#214

@fruffy fruffy force-pushed the prathima/include_bf_model branch 3 times, most recently from 668d652 to f210503 Compare February 20, 2025 15:39
@fruffy
Copy link
Contributor Author

fruffy commented Feb 20, 2025

This is basically there but I am noticing some strange behavior in CI. It takes 5 minutes until the switch can connect. No idea why. I can not build the model yet because it does not compile on Ubuntu 24.04.

@jafingerhut if you have some time could you run a test and see whether things work for you?

@jafingerhut
Copy link
Contributor

I tried following the top level README.md instructions that use the testing.yaml profile on a freshly install Ubuntu 20.04 x86_64 VM, and it got errors during the build. Some error about not being able to find a luaconf.h file while running cmake on some part of the project. I am attaching a zip file with 3 of the log files produced.
ubuntu20.04-errors.zip

@jafingerhut
Copy link
Contributor

I tried again after your last commit "Fix CMake error.", and got a different error during the build. Log attached.
ubuntu20.04-try2-error.zip

@jafingerhut
Copy link
Contributor

I guess there isn't an option to have CI run with "minimal install Ubuntu" images? That is pretty close to what I am running (plus with GUI desktop installed, for my convenience, which is a bunch of extra software that the CI images likely do not have).

@fruffy
Copy link
Contributor Author

fruffy commented Feb 20, 2025

This seems to be the bf-asm code. So let's get #70 working before trying with the model I guess.

@fruffy fruffy mentioned this pull request Mar 18, 2025
@fruffy fruffy force-pushed the prathima/include_bf_model branch from cabed4a to 2475c42 Compare April 2, 2025 13:36
@fruffy
Copy link
Contributor Author

fruffy commented Apr 2, 2025

@jafingerhut Once again, can I ask you to try this locally? I am seeing some strange behavior in CI I can not quite explain:

...
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
...

This goes on for ~5 minutes until the switch finally connects. I do not understand how the model built from source can influence this.

@jafingerhut
Copy link
Contributor

@jafingerhut Once again, can I ask you to try this locally? I am seeing some strange behavior in CI I can not quite explain:

...
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
...

This goes on for ~5 minutes until the switch finally connects. I do not understand how the model built from source can influence this.

I will try it out on a local freshly installed Ubuntu 22.04 x86_64 VM, and let you know what I find.

I do not know the root cause, but I have found that on my local VM, when I configure it with 4 VCPUs and run the tests, it often takes a long time for some connection to be established between processes, but when I run that same test on the very same VM configured with only 1 VCPU, it almost always connects quickly (within a few seconds).

My only guess at this time is that there might be some kind of livelock between processes running actually in parallel on multiple CPU cores, e.g. something that causes connection request messages/packets between processes to be lost, that take a long time to time out and recover, and sometimes fail without recovering. But if you run with 1 VCPU, the fact that at most one process is physically running at a time somehow makes that less likely. Again, take that is conjecture, based on only a tiny bit of observed data.

@fruffy
Copy link
Contributor Author

fruffy commented Apr 2, 2025

@jafingerhut Once again, can I ask you to try this locally? I am seeing some strange behavior in CI I can not quite explain:

...
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
switchd: connect failed. Error: Connection refused
...

This goes on for ~5 minutes until the switch finally connects. I do not understand how the model built from source can influence this.

I will try it out on a local freshly installed Ubuntu 22.04 x86_64 VM, and let you know what I find.

I do not know the root cause, but I have found that on my local VM, when I configure it with 4 VCPUs and run the tests, it often takes a long time for some connection to be established between processes, but when I run that same test on the very same VM configured with only 1 VCPU, it almost always connects quickly (within a few seconds).

My only guess at this time is that there might be some kind of livelock between processes running actually in parallel on multiple CPU cores, e.g. something that causes connection request messages/packets between processes to be lost, that take a long time to time out and recover, and sometimes fail without recovering. But if you run with 1 VCPU, the fact that at most one process is physically running at a time somehow makes that less likely. Again, take that is conjecture, based on only a tiny bit of observed data.

Well, the main branch doesn't have this problem, but this branch does. Which confuses me a bit.

@jafingerhut
Copy link
Contributor

On personal freshly installed Ubuntu 22.04 x86_64 VM, I was able to build with the changes in this PR with no errors.

But I get a very quick error and exit when I try to run the model using the command that the CI test is currently using, and that I have been using successfully without the changes in this PR:

$ ./run_tofino_model.sh --arch tofino -p tna_counter
Using SDE /home/p4/open-p4studio
Using SDE_INSTALL /home/p4/open-p4studio/install
Model using test directory: /home/p4/open-p4studio/pkgsrc/p4-examples/p4_16_programs/tna_counter
Model using port info file: None
Using SDE_DEPENDENCIES /home/p4/open-p4studio/install
Using PATH /home/p4/open-p4studio/install/bin:/home/p4/open-p4studio/install/bin:/home/p4/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/snap/bin
Using LD_LIBRARY_PATH /home/p4/open-p4studio/install/lib:/home/p4/open-p4studio/install/lib:/home/p4/open-p4studio/install/lib:/usr/local/lib
 PRIVS: Eff=0x00000000 Perm=0x00002006 Inh=0x00000000
chip_type 2 unavailable (Available types: Tofino=1,JBay=4)

If you look at the shell script run_tofino_model.sh here: https://github.com/p4lang/open-p4studio/blob/main/run_tofino_model.sh#L115-L125

You can see that when you use tofino as the value of CHIP_ARCH, it sets CHIPTYPE to 2, which it is then passing on as some kind of parameter when running the model, which complains that it doesn't support chip_type=2, but the error message implies that it does support chip_type 1 for Tofino, and 4 for JBay (Tofino2).

I hacked my local copy of the run_tofino_model.sh script to change those assignments of CHIPTYPE=2 to be CHIPTYPE=1 instead, and it does seem to start the model then, so it gets a bit farther. I haven't got a successful run with that change yet on a 4 VCPU VM, but I will try it with a 1 VCPU VM soon to see if that makes a difference.

I don't know where this off-by-one on the CHIPTYPE values is coming from.

@jafingerhut
Copy link
Contributor

jafingerhut commented Apr 2, 2025

Although the CI tests are passing right now, I think it is incorrectly reporting a pass. I bet the CI test is also starting the model, but it is exiting quickly with the error of unsupported chip type, but the CI tests are written in such a way that it does not detect this and fail like we wish it would. The repeated switchd: connect failed. Error: Connection refused messages are the bf_switchd process trying but failing to connect to the model process, which has already exited.

In my local copy, I modified run_tofino_model.sh to change all CHIPTYPE=2 assignments to CHIPTYPE=1, and the Tofino1 test of tna_counter test program passed, and I modified all CHIPTYPE=5 assignments to CHIPTYPE=4, and the Tofino2 test of tna_counter passed.

I see the same odd behavior in my local testing as I have seen before: the test starts and finishes quite quickly when I configure the VM with 1 VCPU -- it hangs for a long time with 4 VCPU.

@fruffy
Copy link
Contributor Author

fruffy commented Apr 2, 2025

Although the CI tests are passing right now, I think it is incorrectly reporting a pass. I bet the CI test is also starting the model, but it is exiting quickly with the error of unsupported chip type, but the CI tests are written in such a way that it does not detect this and fail like we wish it would. The repeated switchd: connect failed. Error: Connection refused messages are the bf_switchd process trying but failing to connect to the model process, which has already exited.

In my local copy, I modified run_tofino_model.sh to change all CHIPTYPE=2 assignments to CHIPTYPE=1, and the Tofino1 test of tna_counter test program passed, and I modified all CHIPTYPE=5 assignments to CHIPTYPE=4, and the Tofino2 test of tna_counter passed.

I see the same odd behavior in my local testing as I have seen before: the test starts and finishes quite quickly when I configure the VM with 1 VCPU -- it hangs for a long time with 4 VCPU.

Thanks, that is really good to know since I can not build things locally on Ubuntu 24.04 right now.

@fruffy
Copy link
Contributor Author

fruffy commented Apr 3, 2025

The other problem is that the tests are passing. I am not sure why...

@fruffy fruffy force-pushed the prathima/include_bf_model branch 2 times, most recently from db02cd9 to bf287b2 Compare April 8, 2025 14:08
@fruffy
Copy link
Contributor Author

fruffy commented Apr 9, 2025

@jafingerhut I got a little carried away with the testing script and generate a version using Gemini 2.5. This particular LLM works really well for boilerplate tasks like this one because its context window is actually large enough to generate lots of code. I will edit the script down and likely move it to a separate PR. I can also try to combine it with the ideas in #39.

The last two commits show a failing and a passing test for Ubuntu 22.04. Let me know whether the output works for you.

@jafingerhut
Copy link
Contributor

@jafingerhut I got a little carried away with the testing script and generate a version using Gemini 2.5. This particular LLM works really well for boilerplate tasks like this one because its context window is actually large enough to generate lots of code. I will edit the script down and likely move it to a separate PR. I can also try to combine it with the ideas in #39.

The last two commits show a failing and a passing test for Ubuntu 22.04. Let me know whether the output works for you.

I will create a freshly installed Ubuntu 22.04 VM and try a build and test there, and let you know the results.

Unless you are strongly motivated to get tests passing for Ubuntu 20.04, too, it appears we have until only 2025-Apr-15 before Github CI disables that.

@jafingerhut
Copy link
Contributor

FYI, DCO check is failing on this PR.

@fruffy
Copy link
Contributor Author

fruffy commented Apr 9, 2025

FYI, DCO check is failing on this PR.

Fixed 🪄

Unless you are strongly motivated to get tests passing for Ubuntu 20.04

This already seems to be the brownout.

@jafingerhut
Copy link
Contributor

With the latest version of this PR as of commit 18 on 2025-Apr-09, the build succeeds on an x86_64 Ubuntu 22.04 VM, as it has with earlier versions. After setting up SDE and SDE_INSTALL env variables as recommended in the README, the following command also successfully runs one test:

./ci/run-test.py tna_counter

However, if I then attempt to run a second test, it consistently fails, and I am pretty sure it is because after the first test completes, there is still a bf_switchd and tofino-model process running left over from the first test, which interferes with the second test passing.

I would recommend modifying run-test.py so that it ensures these processes are stopped before run-test.py exits, so that run-test.py can be used to run a sequence of many tests.

@jafingerhut
Copy link
Contributor

With the latest version of this PR as of commit 18 on 2025-Apr-09, the build succeeds on an x86_64 Ubuntu 22.04 VM, as it has with earlier versions. After setting up SDE and SDE_INSTALL env variables as recommended in the README, the following command also successfully runs one test:

./ci/run-test.py tna_counter

However, if I then attempt to run a second test, it consistently fails, and I am pretty sure it is because after the first test completes, there is still a bf_switchd and tofino-model process running left over from the first test, which interferes with the second test passing.

I would recommend modifying run-test.py so that it ensures these processes are stopped before run-test.py exits, so that run-test.py can be used to run a sequence of many tests.

FYI, if you would prefer to merge this PR as is, and improve run-test.py's implementation with later PRs, I am OK with that. Approving this PR in case that is your preference.

Copy link
Contributor

@jafingerhut jafingerhut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@fruffy
Copy link
Contributor Author

fruffy commented Apr 10, 2025

I would recommend modifying run-test.py so that it ensures these processes are stopped before run-test.py exits, so that run-test.py can be used to run a sequence of many tests.

It looks like the problem is that the processes of bf_switchd and tofino-model are run as sudo so they can not be killed easily after running the script. Need to kill the entire process group.

@jafingerhut
Copy link
Contributor

Hence my use of these lines of code to clean up in my other proposed Bash script for running one test:

# This script only supports running one test on a system at a time,
# not multiple in parallel.  Thus killing all of these processes is
# OK.
sudo killall bf_switchd tofino-model

pkotikal and others added 13 commits April 10, 2025 20:09
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
@fruffy fruffy force-pushed the prathima/include_bf_model branch from 7eae1b1 to 5908edc Compare April 10, 2025 18:12
@fruffy fruffy merged commit fc66158 into main Apr 10, 2025
3 checks passed
@fruffy fruffy deleted the prathima/include_bf_model branch April 10, 2025 20:50
rcgoodfellow pushed a commit to oxidecomputer/tofino-sde that referenced this pull request Feb 21, 2026
…4lang#65)

Signed-off-by: fruffy <fruffy@nyu.edu>
Co-authored-by: Prathima Kotikalapudi <prathima.kotikalapudi@intel.com>
jafingerhut pushed a commit to jafingerhut/open-p4studio that referenced this pull request Feb 26, 2026
…4lang#65)

Signed-off-by: fruffy <fruffy@nyu.edu>
Co-authored-by: Prathima Kotikalapudi <prathima.kotikalapudi@intel.com>
Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants