Add the Tofino model source code as a package in the open P4Studio#65
Add the Tofino model source code as a package in the open P4Studio#65
Conversation
|
"Files changed 5000+" Now there is a thing you don't see in a Github PR every day :-) |
| // __LINE__, | ||
| // port_handle, | ||
| // port_lane_list, | ||
| // status); |
There was a problem hiding this comment.
Do you want to merge this PR with this code commented out? Or perhaps wrap them all in an #ifdef SOME_NEW_SYMBOL_NAME with the same symbol name as each other, to make them easier to grep and find later when someone wants to update all of these calls? I guess marking them all with an identical unique string in a comment works for that, too.
There was a problem hiding this comment.
I will remove these changes later. But I need them to build on Ubuntu 24.04, which is the distribution I am currently using.
The problem is that this code doesn't compile with newer version of GCC/CLang, which enforce the C++ standard correctly.
bdbc37f to
d9bd9b9
Compare
2b668ab to
9ab5e07
Compare
|
@jafingerhut Since you have been on a license hunt recently I wanted to point out that the model also uses pcap: https://github.com/p4lang/open-p4studio/blob/prathima/include_bf_model/pkgsrc/tofino-model/runner/bmi_interface.c#L24 Although I think the usage is fairly limited, I only see one file. And I do not know how important that file is. |
Thanks for the heads up. The C libpcap library is BSD licensed, https://github.com/the-tcpdump-group/libpcap/blob/master/LICENSE, which is compatible with using in an Apache-2.0 project. In general, BSD, MIT, and libboost's BSL-1.0 license are all very compatible with just about any other common software license. It is Apache-2.0 and GPL licenses that are very questionable together, or in general any non-GPL license project using anything GPL'd. scapy is the one that is causing us the current headaches. |
Whoops! I mixed libpcap with scapy. For some reason I assumed these were part of the same project. Nevermind then! |
Easy to do, given that our p4c test code uses small parts of Scapy to read and write pcap files. I know one way to eliminate those uses of Scapy in p4c test code now, as over the weekend I found some pcap writer code in the ptf library (NOT in a GPL'd file), and am able to get most tests to pass using that: p4lang/ptf#214 |
668d652 to
f210503
Compare
|
This is basically there but I am noticing some strange behavior in CI. It takes 5 minutes until the switch can connect. No idea why. I can not build the model yet because it does not compile on Ubuntu 24.04. @jafingerhut if you have some time could you run a test and see whether things work for you? |
|
I tried following the top level README.md instructions that use the testing.yaml profile on a freshly install Ubuntu 20.04 x86_64 VM, and it got errors during the build. Some error about not being able to find a luaconf.h file while running cmake on some part of the project. I am attaching a zip file with 3 of the log files produced. |
|
I tried again after your last commit "Fix CMake error.", and got a different error during the build. Log attached. |
|
I guess there isn't an option to have CI run with "minimal install Ubuntu" images? That is pretty close to what I am running (plus with GUI desktop installed, for my convenience, which is a bunch of extra software that the CI images likely do not have). |
|
This seems to be the bf-asm code. So let's get #70 working before trying with the model I guess. |
cabed4a to
2475c42
Compare
|
@jafingerhut Once again, can I ask you to try this locally? I am seeing some strange behavior in CI I can not quite explain: This goes on for ~5 minutes until the switch finally connects. I do not understand how the model built from source can influence this. |
I will try it out on a local freshly installed Ubuntu 22.04 x86_64 VM, and let you know what I find. I do not know the root cause, but I have found that on my local VM, when I configure it with 4 VCPUs and run the tests, it often takes a long time for some connection to be established between processes, but when I run that same test on the very same VM configured with only 1 VCPU, it almost always connects quickly (within a few seconds). My only guess at this time is that there might be some kind of livelock between processes running actually in parallel on multiple CPU cores, e.g. something that causes connection request messages/packets between processes to be lost, that take a long time to time out and recover, and sometimes fail without recovering. But if you run with 1 VCPU, the fact that at most one process is physically running at a time somehow makes that less likely. Again, take that is conjecture, based on only a tiny bit of observed data. |
Well, the main branch doesn't have this problem, but this branch does. Which confuses me a bit. |
|
On personal freshly installed Ubuntu 22.04 x86_64 VM, I was able to build with the changes in this PR with no errors. But I get a very quick error and exit when I try to run the model using the command that the CI test is currently using, and that I have been using successfully without the changes in this PR: If you look at the shell script run_tofino_model.sh here: https://github.com/p4lang/open-p4studio/blob/main/run_tofino_model.sh#L115-L125 You can see that when you use I hacked my local copy of the run_tofino_model.sh script to change those assignments of CHIPTYPE=2 to be CHIPTYPE=1 instead, and it does seem to start the model then, so it gets a bit farther. I haven't got a successful run with that change yet on a 4 VCPU VM, but I will try it with a 1 VCPU VM soon to see if that makes a difference. I don't know where this off-by-one on the CHIPTYPE values is coming from. |
|
Although the CI tests are passing right now, I think it is incorrectly reporting a pass. I bet the CI test is also starting the model, but it is exiting quickly with the error of unsupported chip type, but the CI tests are written in such a way that it does not detect this and fail like we wish it would. The repeated In my local copy, I modified run_tofino_model.sh to change all CHIPTYPE=2 assignments to CHIPTYPE=1, and the Tofino1 test of tna_counter test program passed, and I modified all CHIPTYPE=5 assignments to CHIPTYPE=4, and the Tofino2 test of tna_counter passed. I see the same odd behavior in my local testing as I have seen before: the test starts and finishes quite quickly when I configure the VM with 1 VCPU -- it hangs for a long time with 4 VCPU. |
Thanks, that is really good to know since I can not build things locally on Ubuntu 24.04 right now. |
|
The other problem is that the tests are passing. I am not sure why... |
db02cd9 to
bf287b2
Compare
|
@jafingerhut I got a little carried away with the testing script and generate a version using Gemini 2.5. This particular LLM works really well for boilerplate tasks like this one because its context window is actually large enough to generate lots of code. I will edit the script down and likely move it to a separate PR. I can also try to combine it with the ideas in #39. The last two commits show a failing and a passing test for Ubuntu 22.04. Let me know whether the output works for you. |
I will create a freshly installed Ubuntu 22.04 VM and try a build and test there, and let you know the results. Unless you are strongly motivated to get tests passing for Ubuntu 20.04, too, it appears we have until only 2025-Apr-15 before Github CI disables that. |
|
FYI, DCO check is failing on this PR. |
Fixed 🪄
This already seems to be the brownout. |
|
With the latest version of this PR as of commit 18 on 2025-Apr-09, the build succeeds on an x86_64 Ubuntu 22.04 VM, as it has with earlier versions. After setting up SDE and SDE_INSTALL env variables as recommended in the README, the following command also successfully runs one test: However, if I then attempt to run a second test, it consistently fails, and I am pretty sure it is because after the first test completes, there is still a bf_switchd and tofino-model process running left over from the first test, which interferes with the second test passing. I would recommend modifying run-test.py so that it ensures these processes are stopped before run-test.py exits, so that run-test.py can be used to run a sequence of many tests. |
FYI, if you would prefer to merge this PR as is, and improve run-test.py's implementation with later PRs, I am OK with that. Approving this PR in case that is your preference. |
It looks like the problem is that the processes of bf_switchd and tofino-model are run as sudo so they can not be killed easily after running the script. Need to kill the entire process group. |
|
Hence my use of these lines of code to clean up in my other proposed Bash script for running one test: |
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
Signed-off-by: fruffy <fruffy@nyu.edu>
7eae1b1 to
5908edc
Compare
…4lang#65) Signed-off-by: fruffy <fruffy@nyu.edu> Co-authored-by: Prathima Kotikalapudi <prathima.kotikalapudi@intel.com>
…4lang#65) Signed-off-by: fruffy <fruffy@nyu.edu> Co-authored-by: Prathima Kotikalapudi <prathima.kotikalapudi@intel.com> Signed-off-by: Andy Fingerhut <andy_fingerhut@alum.wustl.edu>
Uh oh!
There was an error while loading. Please reload this page.