[Tools] Improving integration test system #314

ebosnjak · 2025-02-26T16:18:21Z

ebosnjak
Feb 26, 2025
Maintainer

Hello everyone!

We recently introduced an Actions CI workflow and along with it, a Python script for running integration tests in a nicer way than before. However, this is not enough, because the new integration script only runs integration tests using "vanilla" Dynamatic, i.e. using only the default buffering algorithm when compiling.

This begs the question of how to make the script more flexible without making it complicated and tedious for use. For example, if I am developing some feature and (for whatever reason) I need to run a set of 10 tests with --buffer-algorithm fpga20, I now need to run both the "official" set of tests (to make sure that I did not break anything) and my "custom" set of tests. So, not only does there need to be a way to include/exclude tests (that is better than the one we have right now with --list and --ignore), but it also needs to allow having various options for each test. However, this needs to be done in a way that would not ruin the usability of the integration system by making it horribly convoluted and requiring lots of manual intervention.

Below I will give my suggestion, so please comment on it and propose your own if you have any. My absolute priority is usability; so I would be really happy if you suggested something that better minimizes the amount of manual labor required for using this system. So, I would like you, after reading this, to answer the question: Would you use this in your day-to-day work, or would you consider it tedious and write your own specific shell script instead?

Proposal

In order to allow running tests in with custom settings, there would be a way to define them in a configuration file. For example, having a file integration-test/config.json with the following content:

{
  "gcd": {
    "buffer-algorithm": "fpga20"  
  },
  "binary_search": {
    "timeout": 1000
  },
  "*": {
    "buffer-algorithm": "fpl22",
    "timeout": 200
  },
  "memory/*": {
    "timeout": 300
  }
}

I hope that the meaning is self-explanatory; the point is that individual tests could be made to have custom properties and wildcards would be supported to avoid unnecessary repetition. This would also allow managing tests that are grouped in a subfolder, as shown in the last example. Also, at some point in the future, we plan to add a way to consistently benchmark tests in the Actions workflow (so you would get feedback that your PR ruined/improved the performance), so this would be easily extended to define benchmarking-related settings (e.g. "run test binary_search with all buffering algorithms, but only compare performance using --buffer-algorithm fpga20"; this is just an example that I randomly thought about, please forgive me if it is nonsense).

Such a configuration would allow the tests to be run very simply, the configuration file would be written and put in the integration-test folder and the script would read it when running and apply the correct settings. However, one concern is that this file may grow too large, and a proposed solution is to have configuration files per-test, for example, a file integration-test/gcd/config.json that only contains settings for the gcd test. On the other hand, my opinion is that opening different folders and editing different files is more tedious than scrolling through a single file, so I am not a big fan of this.

Another goal would be to allow defining different sets of tests, in a more flexible way than the current run/ignore lists. For example, having a file integration-test/sets.json with the following contents:

{
  "ignore": ["gcd", "memory/test_memory_1"],
  "default": ["*", "!ignore"],
  "paolos_tests": ["kernel_3mm_float", "binary_search"]
}

With this, sets of tests could be run using:

python run_integration.py would run the default group
python run_integration.py paolos_tests would run the paolos_tests group
python run_integration.py ignore would run the ignore group (i.e. the "ignored" tests can also be run as any other set).

A problem that arises is how to reconcile the proposed config.json and sets.json, since, for example, what if Paolo wanted to run binary_search with timeout=80, buffer-algorithm=fpga20 when running the set paolos_tests? One thing that comes to mind is to allow aliases to be defined for tests in config.json, e.g.

{
  "kernel_3mm_float-paolo": {
    "actual-path": "kernel_3mm_float",
    "timeout": 80,
    "buffer-algorithm": "fpga20"
  }
}

and then add kernel_3mm_float-paolo to the list in sets.json.

Finally, there is a problem of the consistency of these configuration files between main and the current branch. If I change them because I want to run a bunch of tests, what happens when I want to merge? Do I have to just never commit the changes? This is also something that was brought up in a discussion with Paolo: with these integration tests, there should be a policy on what happens when someone adds new tests on their branch. In my opinion, if my feature required additional tests, it would make sense that these tests are added to the "default" set of tests and as such pushed to main. This is because these tests would make sure that no one breaks my feature later.

Conclusion

I understand that the implementation of these wildcards and whatever I mentioned above might be a bit tedious and not pretty, but this is just a suggestion that showcases its user experience, since that is my priority as I said before. So finally, I would like to ask you one more time to be very critical towards my idea and any other proposal that may be given by other people, since the goal is that this system could be used by even the laziest programmer, without him or her getting discouraged due to complexity. Honestly, being a lazy programmer myself, this seems like a bit too much manual labor, so I am looking forward to suggestions and improvements :)

pcineverdies · 2025-02-26T17:08:03Z

pcineverdies
Feb 26, 2025
Collaborator

I need to metabolize this (unfortunately my mind is somewhere else right now). However, I have two small comments over the runner. These comments are outside of the scope of the configuration file, but somehow related to the development of the CI/CD.

If you haven't done it yet, it might be the right moment to install Gurobi. Hopefully this does not violate its license.
Once this step is done, we should try running the current integration test (multi-threaded) with fpga20 as buffering algorithm and see how long it takes to terminate. This way I we have an idea of the consequences of moving smart buffering in the picture :)

0 replies

murphe67 · 2025-03-06T17:39:08Z

murphe67
Mar 6, 2025
Maintainer

I haven't actually read @shundroid's new speculation integration test approach yet, but I think this is probably a very nice example for you guys to road-test this proposal?

1 reply

shundroid Mar 8, 2025
Collaborator

I created another integration test script for speculation, based on @ebosnjak's original one. Speculation requires a different compilation flow (starting from the CF dialect, applying the speculation pass and a custom buffer placement pass, and using a new backend).

I prefer using a configuration file over a custom integration test script, also because speculation tests can easily integrate into the existing CI/CD pipeline.

However, my main concern is the growing number of configuration parameters. As an alternative, I'm also fine with placing a custom compilation flow script in each speculation integration test folder (e.g., single_loop) and having the top-level run_integration script call it, depending on the configuration. For example,

{
  "single_loop": {
    "custom-compilation-flow": true
  }
}

paolo-ienne · 2025-03-10T16:52:09Z

paolo-ienne
Mar 10, 2025
Maintainer

I would suggest a few enhancements to how things are captured in the configuration file.

In the following, I assume Dynamatic is invoked like gcc, ffmpeg, or rsync with everything as (a long sequence of) options, possibly very verbose; if this is not the case, I think we first need a script that encapsulates Dynamatic this way, pretending it to be a normal Unix CLI tool. This is paramount, to me, and has nothing to do with the present CI/CD discussion.

I would also like to separate clearly in this discussion capability from policy: the existence of a feature (a capability) does not imply that we will use it for standard testing (a policy); yet, the absence of one of the fundamental and typical features of such a "language" may mean that tomorrow it will be abandoned when someone needs it or when a new policy is required. This post is only about capability, not policy.

In the JSON file, if it is not an insulting concept to anyone, I would omit the outer (and pointless) braces, so that every test is a top level entity.
I would not invent specific keywords for buffer-algorithm or timeout. They should all use the same key option, so

"gcd": {
  "option": "--buffer-algorithm fpga20",  
},

"binary_search": {
  "option": "--timeout 1000",
},

"*": {
  "option": "--buffer-algorithm fpl22",  
  "option": "--timeout 200",
},

"memory/*": {
  "option": "--timeout 30",
},

I guess in JSON there is an issue with two keys named option; allow me to ignore this (perhaps JSON is not the right thing and there are tons of standard-ish choices).

Inheritance seems important. I would define file and directory matches exactly as Git does unless someone proves it insufficient. I would define also an order of the matches from the most general to the most specific, but I am less clear who to copy (maybe the .htaccess path matching rules?!); for equally specific rules, the latter one is the most specific for any practical purposes. Then the final set of keys that applies to a particular test is the union of all matching keys with more specific ones replacing less specific ones. For instance, in the example above gcd would have keys

"option": "--buffer-algorithm fpga20",  
"option": "--timeout 200",

Named macros are a classic in this kind of situations, but so far we have only names through the filenames. So, maybe we need this:

"__DEFAULT": {
  "option": "--buffer-algorithm fpl22",  
  "option": "--timeout 200",
}, 

"__MEMORY_DEFAULT": {
  "option": "--timeout 30",
}, 

"gcd": {
  "option": "--buffer-algorithm fpga20",  
},

"binary_search": {
  "option": "--timeout 1000",
},

"*": {
  "macro": "__DEFAULT",
},

"memory/*": {
  "macro": "__DEFAULT",
  "macro": "__MEMORY_DEFAULT",
},

Note that memory/* is now run with --buffer-algorithm fpl22 and --timeout 30 (more specific).

One could introduce multioption keys for options that accumulate (union instead of most specific one), such as --ignore in rsync (even if Dynamatic has no such option, it is reasonable to think it might have one eventually).
It is common, as mentioned in the original post, to have some include mechanism. I can hardly think of something else than this:

#INCLUDE integration-test/gcd/config.json

Whether this is good policy or not, it is open for discussion. I think it may be handy that a missing include is reported as a warning but not considered an error.

I think arbitrary labels are hugely flexible:

"gcd": {
...
"label": "pull",  
"label": "performance",  
"label": "nightly",  
"label": "lsq",  
},

"binary_search": {
...
"label": "performance",  
"label": "nightly",  
},

"memory/*": {
...
"label": "pull",  
"label": "nightly",  
"label": "lsq",  
},

The one could run python run_integration.py --label pull to mean "what the CI/CD should run at every PR", python run_integration.py --label nightly to mean "what the CI/CD should run every night" (or week, etc.), or python run_integration.py --label lsq to mean "what contains LSQs". This calls for a simple way of making unions and intersections (find does some of that, but I am not sure it is worth implementing it all, unless it proves to be supereasy).

Here is the way multiple compilation possibilities of the same benchmark would be encoded:

"__BUFFER_LANA": {
  "option": "--buffer-algorithm fpga20",  
},

"__BUFFER_CARMINE": {
  "option": "--buffer-algorithm fpl22",  
},

"__CONVERSION_LANA": {
  "option": "--conversion cfg",  
},

"__CONVERSION_AYA": {
  "option": "--conversion ftd",  
},

"__DEFAULT_PULL": {
  "macro": "__BUFFER_LANA",
  "macro": "__CONVERSION_LANA",
  "label": "pull",  
},

"gcd": {
  "macro": "__DEFAULT_PULL",
  "label": "gcd_basic",  
},

"gcd": {
  "macro": "__BUFFER_CARMINE",
  "macro": "__CONVERSION_AYA",
  "label": "gcd_advanced",  
  "label": "test_ftd",  
},

Since they entries have no name and the source files are the same, one need to rely on the existence of labels to identify them. Is it wise? I guess it is debatable.

Finally, it should be possible to negate labels and options:

"gcd": {
  "macro": "__DEFAULT_PULL",
  "macro": "__CONVERSION_AYA",
  "label": "gcd_paolo_stuff",
  "nolabel": "pull",
  "nooption": "--timeout",
},

The semantic should be defined properly but at worst it could be "if there is a label pull, remove it" and "if there is one or more options --timeout, remove it".

And, last but not least, maybe the whole command to execute should be configurable:

"*": {
  "command": "dynamatic %option%",  
},

"__SPECULATION": {
  "command": "my_script %option%",
  "label": "speculation",
},

"single_loop": {
  "macro": "__SPECULATION",
  "label": "shun_private",
}

0 replies

schilkp · 2025-03-11T05:45:57Z

schilkp
Mar 11, 2025
Maintainer

Hi All!

I have a few points, splitting them out to make replies easier.

In the following, I assume Dynamatic is invoked like gcc, ffmpeg, or rsync with everything as (a long sequence of) options, possibly very verbose; if this is not the case, I think we first need a script that encapsulates Dynamatic this way, pretending it to be a normal Unix CLI tool. This is paramount, to me, and has nothing to do with the present CI/CD discussion.

It is my understanding that, currently, this is not the case. The main dynamatic binary is a small "frontend"/"runner" with a shell-like interface that calls pre-written scripts (such as compile.sh). These scripts then string together the various dynamatic tools.

The closest thing currently possible to the above is to write all commands that the dynamatic tool should issue into a script file (usually .dyn) and then pass that to the dynamatic. This is also what is done in the integration test runner:

    # Write .dyn script with appropriate source file name
    dyn_file = DYNAMATIC_ROOT / "build" / f"test_{id}.dyn"
    write_string_to_file(SCRIPT_CONTENT.format(src_path=c_file), dyn_file)
    
    # ... snip ...

    with open(Path(out_dir) / "dynamatic_out.txt", "w") as stdout, \
            open(Path(out_dir) / "dynamatic_err.txt", "w") as stderr:
        # Run test and output result
        exit_code = run_command_with_timeout(
            DYNAMATIC_COMMAND.format(script_path=dyn_file),
            timeout=timeout,
            stdout=stdout,
            stderr=stderr
        )

I agree that this is not optimal.

8 replies

schilkp Mar 11, 2025
Maintainer

Also @shundroid - sorry to keep mentioning your speculation stuff here. This is not particular to your work, it is just the only other thing I have a rough enough idea about to use it as an example.

paolo-ienne Mar 11, 2025
Maintainer

I cannot say that I have a full understanding of the problem at hand, but I agree there is a looming problem and I like what I read. I have been worried for a while that we should have (a) one stable and usable HLS tool, usable by anyone, and that we do have (b) 75 different research ideas that are implemented inside and around it; realistically, only a minor fraction of the latter ones will deserve to become part of the former. Articulating the relations between these different elements is essential to us. I am largely unable to, but the above seems to go in the right direction, in my view.

schilkp Mar 11, 2025
Maintainer

(a) one stable and usable HLS tool, usable by anyone, and that we do have (b) 75 different research ideas that are implemented inside and around it

I think that Dynamatic, being a research project, will always be a mixture of (a) and (b) - and that is OK, as long as there is an easy way to access the (a) part. I don't know the project that well, but I assume that there will always be aspects of the project that fundamentally cannot be made compatible, and it is OK for them to exist in parallel as long as the flows to use them are clearly defined and continuously tested.

Some more thoughts on this, that are now really well beyond the scope of CI/integration testing:

On the other hand, I think it might be good, moving forward, to think about and have a clear policy on what should actually live in this repository/be merged into this repo.

I will take my XLS backend as an example: It is an experiment, and is not properly usable. I make no secret of that fact. I actually tried to convince @Jiahui17 and @Carmine50 not to merge it in.

Their argument, which I fully understand, is that it is better to merge it in so you have it. It can always be deleted if it starts causing problems down the road.

However, in my experience, it is very hard to actually delete code: You, effectively, need a sign-off from everyone using Dynamatic that it is OK to get rid of. With it in place (and assuming it is tied into CI, which it is not yet), any change to the handshake dialect definition/structure of handshake_export.mlir will break this backend and red lights to flash on the PR. That means, if such a change needs to be made in 6 months when I might not be available anymore, you are left with the following choices:

Keep it in, but leave it broken. I think we agree that this is a bad choice.
Remove it. This is easier said than done. The older the code is, the less context as to why it exists is around. Does anyone actually actively use it? Might it be useful in the future? Getting everyone's approval to remove it will take time, slowing down development.
Spend the time to update it. Possible, but assumes that someone is around that knows it well enough to actually do that. Besides, as the number of "offshoot" projects grow, this will continue to add more and more friction to the ability to improve Dynamatic. "Are you sure the effort of updating this is worth it? Do we use it? Will anyone ever need this?" "Should we really make this change to Dynamatic that would make the core flow 5% better, if it means we need to go and update all the off-shoot projects?"

If, instead, the XLS backend was simply left behind on an old branch or even a different repo, it will not interfere with new development. Should, down the road, someone decide that they want to use it, they can use the old version of dynamatic. If they want to use it with a more modern version of Dynamatic, then the discussion can be had on if it is worth updating it. But at least the effort spent on updating is not done speculatively because someone might want it in the future, and there is someone with at least enough knowledge of what this thing is to be interested in it.

Of course, I am not saying that nothing but core features should ever be merged, but I think it is worth a discussion on a per-feature basis. By merging it in, you are taking responsibility for it to be maintained and carried forward.

Finally, I think it would be very good to try and shy away from any (breaking) changes to the core handshake dialect that are not directly tied to actual improvements in Dynamatic proper. Especially for "peripheral" projects like the one above. The handshake dialect is the "foundation" that makes sure all the Dynamatic Lego pieces actually fit together.

Besides questioning if such changes are needed, we should consider if they are done in the least-invasive way possible.

For example, adding an extra optional attribute that other passes can simply ignore is much better than replacing 3 handshake operations with fundamentally different ones.

The distributed nature of the development here makes it difficult, but it would be very important if there is a basic process for such breaking changes. Maybe their design should be discussed in a biweekly meeting before being implemented, let alone merged?

It is no fun if someone has been hard at work, developing a big new feature, and every two weeks it becomes incompatible with Dynamatic because the dialect is being changed at will.

paolo-ienne Mar 11, 2025
Maintainer

Thanks @schilkp for this discussion. While I find I am ill-prepared to answer most of the conundrums, I think it is an important discussion to have and maybe @lana555 should eventually chime in.

Finally, I think it would be very good to try and shy away from any (breaking) changes to the core handshake dialect that are not directly tied to actual improvements in Dynamatic proper. Especially for "peripheral" projects like the one above. The handshake dialect is the "foundation" that makes sure all the Dynamatic Lego pieces actually fit together.

I agree with all this. I would only like to comment that at this early stage of the project, occasionally, it may be appropriate to annoy all the developers to change some early bad decision that otherwise would stay conceptually broken forever. It should not happen every two weeks and it must be closely watched and carefully executed--the biweekly meetings have been introduced mainly to discuss when this is worth it, so we perfectly agree. 😄

schilkp Mar 11, 2025
Maintainer

Great. Sorry to "parachute in" with a wall of text/brain-dump. It is a different issue than integration testing, but the decision as to "what actually is Dynamatic" is fundamental to "what parts of Dynamatic should be tested and how"

I would only like to comment that at this early stage of the project, occasionally, it may be appropriate to annoy all the developers to change some early bad decision that otherwise would stay conceptually broken forever.

Absolutely - and in my eyes the ability to do this fairly easily is the strength of a research project as opposed to a commercial product.

schilkp · 2025-03-11T05:53:45Z

schilkp
Mar 11, 2025
Maintainer

In the JSON file, if it is not an insulting concept to anyone, I would omit the outer (and pointless) braces, so that every test is a top level entity.

To be fully JSON "compliant"¹, the top level needs to be either a {} or []. My understanding is that a single JSON file should map to a single javascript value, so it needs be either an "object" ({}) or "array" ({}).

Still, many JSON parses would allow it, and simply implicitly add the enclosing {}, making it an object.

I personally would not rely on it.

I would not invent specific keywords for buffer-algorithm or timeout. They should all use the same key option, so

"*": {
"option": "--buffer-algorithm fpl22",
"option": "--timeout 200",
},

I guess in JSON there is an issue with two keys named option; allow me to ignore this (perhaps JSON is not the right thing and there are tons of standard-ish choices).

Yes this is also unfortunately not allowed in JSON. It would have to be an options key that accepts a list:

"*": {
  "options": [
     "--buffer-algorithm fpl22",  
     "--timeout 200"
   ]
},

For something that primarily needs to be hand-written, I personally prefer YAML - but deciding on a serialization/config format tends to be as divisive as picking between a certain two text editors, so I am sure that others will have other preferences.

"compliant" in quotes because JSON standards are a mess and nobody implements the same one. ↩

0 replies

schilkp · 2025-03-11T07:32:34Z

schilkp
Mar 11, 2025
Maintainer

I am slightly worried that forcing everything into a declarative format (like a json file) is not very future proof: What if we want to add a style of integration test in the future that does not fit into the datamodel we define for this setup? We would have to re-write the integration setup from scratch. Also I worry that it would effectively amount to writing a custom json-dsl, which makes the setup very hard to learn for new contributors. While I appreciate the simplicity of adding a new integration test by writing a few lines of JSON, I would prefer a setup that makes adding basic integration tests simple while not preventing more complicated setups.

I get the feeling that we are re-inventing the wheel here. There are existing test frameworks out there that have seen a lot of development. Would it be possible to just use one of them?

We could simply define each integration test as a small function that calls the relevant tools with the relevant inputs in one such framework. To make this simple, we could define the parameters as classes/types, and provide utility functions that abstract away the repeated code.

Typical integration test frameworks would provide the following benefits:

They usually feature a very nice CLI, making it very simple to:
- Run tests in parallel
- Manage groups/suites of tests
- Run all test, or just some specific ones
- Print a nice summary of what tests passed
- Generate log files with stderr/stdout output
- ...
They are not Dynamatic specific and usually well documented.
A new contributor may have already used such a framework in the past, or has a plethora of available documentation to draw from. Furthermore, because many students contribute, I like the idea that they would spend their time learning an industry-standard test setup that they might re-use in the future, instead of something super specific to this project.
By moving our integration test specification to code, we are much more flexible in what we can specify - it becomes trivial to loop through different parameters, or overwrite certain parameters in certain scenarios etc.

I acknowledge that with great power comes great responsibility: We do need to be careful to not make the integration test specification code too messy/complicated.
If we use a C++ framework, it would allow us to also write unit tests for our C++ code.
I personally find that very helpful while developing, plus they do more fine-grained testing than integration tests. Having spent some time in the huge XLS codebase that makes extensive use of integration tests, I also find they double as a very nice documentation of how a certain function is intended to be used: I can always just look at the test to see how it is used/what it is expected to do.
If we use a C++ framework, integration test specification is done in C++. This means a contributor does not need to know another language to setup integration tests.

As a rough example, here is what I imagine this could look like.

Note that I used GoogleTest because I am familiar with it from XLS, but it is not the only choice. Still - I really would push for a C++ testing framework.
I know the allure of python for scripting/orchestration is large, but I really think we could benefit from also unit testing our C++ code and don't like the idea of multiple unit test frameworks.

Everything that follows is rough pseudo code to give you an idea. I did not test it.

TEST(BasicIntegrationTests, GCD) {
    DynamaticFlags flags = DynamaticFlags(); // Use default flags
    RunIntegrationTest("gcd.c", flags);
}

Here is how some of the points mentioned above could map to this framework:

@ebosnjak:

The point is that individual tests could be made to have custom properties and wildcards would be supported to avoid unnecessary repetition.

Tests can pull in the default flags and override them at will, or even define their own flags from scratch:

TEST(BasicIntegrationTests, gcd) {
    DynamaticFlags flags = DynamaticFlags(); // Use default flags
    flags.buffer_algo = "fpga20";
    RunIntegrationTest("gcd.c", flags);
}

TEST(BasicIntegrationTests, binary_search) {
    DynamaticFlags flags = DynamaticFlags(); // Use default flags
    flags.timeout = 1000;
    RunIntegrationTest("binary_search.c", flags);
}

TEST(BasicIntegrationTests, schilk_special) {
    DynamaticFlags flags = {
        .buffer_algo = "schilks_secret_buffer_algo",
        .timeout = -1,
        ...
    }
    RunIntegrationTest("top_secret.c", flags);
}

@ebosnjak:

run test binary_search with all buffering algorithms

The simple approach is a loop:

TEST(BasicIntegrationTests, gcd_sweep) {
    for (auto buffer_algo: { "fpga20", "fpl22" }) {
        DynamaticFlags flags = DynamaticFlags(); // Use default flags
        flags.buffer_algo = buffer_algo;
        RunIntegrationTest("gcd.c", flags);
    }
}

However, this would run sequentially.

Google test provides a framework for parametric tests that allows a set of parameters to be applied to multiple test suites and the individual test/parameter pairs to run in parallel.

The following sweeps both gcd and binary_search across all buffer algos:

class SweepBufferAlgoFixture : public testing::TestWithParam<std::string> {
};

TEST_P(SweepBufferAlgoFixture, gcd) {
    std::string buffer_algo = GetParam();
    
    DynamaticFlags flags = DynamaticFlags(); 
    flags.buffer_algo = buffer_algo;
    RunIntegrationTest("gcd.c", flags);
}

TEST_P(SweepBufferAlgoFixture, binary_search) {
    std::string buffer_algo = GetParam();
    
    DynamaticFlags flags = DynamaticFlags(); 
    flags.buffer_algo = buffer_algo;
    RunIntegrationTest("binary_search.c", flags);
}

INSTANTIATE_TEST_SUITE_P(
    BufferAlgoSweep,
    SweepBufferAlgoFixture,
    testing::Values("fpga20", "fpl22")
);

@ebosnjak:

Another goal would be to allow defining different sets of tests, in a more flexible way than the current run/ignore lists

Gtest organizes tests into test suites that can be run separately. By default all tests are run, but if we have a known-broken test,
it can be marked as "SKIP".

It is also possible to skip tests dynamically. For example a test relying on gurobi could be skipped if gurobi is not available.

@ebosnjak:

This would also allow managing tests that are grouped in a subfolder

@paolo-ienne:

Inheritance seems important.
Then the final set of keys that applies to a particular test is the union of all matching keys with more specific ones replacing less specific ones.

This maps nicely to c++ inheritance. Imagine the following structure that roughly mimics the tests you have described:

└── integration_tests
    ├── binary_test.c
    ├── gcd.c
    ├── memory
    │   ├── some_mem_related_test.c
    │   └── TEST_SUITE.c
    └── TEST_SUITE.cc

(The TEST_SUITE.cc files contain the gtest integration test runners/specification.)

In integration_tests/TEST_SUITE.c, the tests simply use the default parameters:

TEST(BasicIntegrationTests, gcd) {
    DynamaticFlags flags = DynamaticFlags(); // Use default flags
    flags.buffer_algo = "fpga20";
    RunIntegrationTest("gcd.c", flags);
}

In integration_tests/memory/TEST_SUITE.c, the parameter class can be extended:

class MemoryDynamaticFlags : public DynamaticFlags {
public:
    MemoryDynamaticFlags() {
        // All memory tests use timeout 300
        timeout = 300;
    }
    virtual ~MemoryDynamaticFlags() = default;
};

TEST(MemIntegrationTests, some_mem_related_test) {
    DynamaticFlags flags = MemoryDynamaticFlags();  // use memory-specific flags
    RunIntegrationTest("some_mem_related_test.c", flags);
}

@paolo-ienne

The one could run python run_integration.py --label pull to mean "what the CI/CD should run at every PR", python run_integration.py --label nightly to mean "what the CI/CD should run every night" (or week, etc.), or python run_integration.py --label lsq to mean "what contains LSQs"

Google test provides a test filter flag: --gtest_filter=CI*

@paolo-ienne

Here is the way multiple compilation possibilities of the same benchmark would be encoded:

See loops/parametrics above.

@ebosnjak @paolo-ienne

Regarding benchmarking: I don't know the specifics well enough to provide a quick example, but I hope you could imagine how defining such test flows can easily be done in gtest.

Let me know what you think :)

Philipp

7 replies

pcineverdies Mar 11, 2025
Collaborator

Thanks @schilkp for this thoughtful message. What Paolo said makes perfect sense from a functional point of view; since everyone faces the same issues, it makes sense to adopt a standardized model. Also, the fact that this tool is used by the tools you mentioned is a proof of reliability.

I guess that, with such an approach, @shundroid should write his version of runIntegrationTest so that he can "hardwire" its setup as he does in here, right? (This applies to him or to anyone who wants to modify compile.sh to run the test rather than using the available flags)

murphe67 Mar 11, 2025
Maintainer

I think this conversation is moving in a nice direction (that I cannot contribute super much to) but just chiming in to echo @schilkp's comment-

I am slightly worried that forcing everything into a declarative format (like a json file) is not very future proof: What if we want to add a style of integration test in the future that does not fit into the datamodel we define for this setup? We would have to re-write the integration setup from scratch.

This is exactly what happened with the export-rtl.json file and what speculation needed.

For a specific direction- putting the tests in c++ seems really nice to me, but I have basically no experience with this kind of stuff.

shundroid Mar 11, 2025
Collaborator

Thanks for the great idea! I've used GoogleTest for C++ testing, and I remember it worked well. I have a small concern, which I'll post separately.

@shundroid should write his version of runIntegrationTest so that he can "hardwire" its setup as he does in here, right?

I think so. And in addition to that, instead of calling the method directly in Dynamatic’s C++, we’d spawn a process and verify the result in the C++ test file—just like we do in the Python script, right? I haven’t used GoogleTest for this kind of setup before, but I assume it should work fine.

This is exactly what happened with the export-rtl.json file and what speculation needed.

Yeah, we’re pretty much fed up with the JSON-based declarative approach for a supposedly flexible backend. Lesson learned! 😂

schilkp Mar 11, 2025
Maintainer

I guess that, with such an approach, @shundroid should write his version of runIntegrationTest so that he can "hardwire" its setup as he does in here, right? (This applies to him or to anyone who wants to modify compile.sh to run the test rather than using the available flags)

Please see my somewhat long and rambly response to paolo's response to my first response (...?$#!).

In short: I think this is exposing a much more fundamental problem with the dynamatic project structure, because there is not one compiler - but many very related but subtly different ones.

For now - yes, Shun could write his own runSpecIntegrationTest function. In the long run, this probably should be a parallel flow as I outline in my response above.

Does that make sense?

schilkp Mar 11, 2025
Maintainer

Instead of calling the method directly in Dynamatic’s C++, we’d spawn a process and verify the result in the C++ test file—just like we do in the Python script, right?

That is the easiest way - yes. But both are possible.

shundroid · 2025-03-11T11:10:52Z

shundroid
Mar 11, 2025
Collaborator

@schilkp Thank you for your great idea! I have a quick question about tagging/grouping tests in GoogleTest. I know --gtest_filter, as you mentioned, but do you know of any other methods for this? Since --gtest_filter relies on test names (with wildcards), it might not be the most flexible option for tagging/grouping.

For example, is there a way for each test to have multiple labels if we use --gtest_filter or other possible methods?

Just for reference for other people, people here seem to appreciate flexible tagging/grouping:

@ebosnjak

Another goal would be to allow defining different sets of tests, in a more flexible way than the current run/ignore lists. For example, having a file integration-test/sets.json with the following contents:

@paolo-ienne

I think arbitrary labels are hugely flexible:

2 replies

schilkp Mar 11, 2025
Maintainer

@schilkp Thank you for your great idea! I have a quick question about tagging/grouping tests in GoogleTest. I know --gtest_filter, as you mentioned, but do you know of any other methods for this? Since --gtest_filter relies on test names (with wildcards), it might not be the most flexible option for tagging/grouping.

For example, is there a way for each test to have multiple labels if we use --gtest_filter or other possible methods?

You are correct. The only way to to "tag" style filtering is including keywords in the name. For example all "CI" tests could include the word "CI":

Just for reference for other people, people here seem to appreciate flexible tagging/grouping:

@ebosnjak

Another goal would be to allow defining different sets of tests, in a more flexible way than the current run/ignore lists. For example, having a file integration-test/sets.json with the following contents:

@paolo-ienne

I think arbitrary labels are hugely flexible:

I would push back on the need for labels - why isn't the grouping and group-based filtering of google test sufficient?

First, we should be very careful in introducing mechanisms that prevent certain tests from running - that is exactly how you end up with tests not being run in CI and regressions going unnoticed because someone forgets a flag somewhere. By default, all tests should run.

For development, it is of course nice to only run the tests specific to the thing I am working on. But why can't I just pick the few groups of tests that are relevant? If the groups don't quite line up, google test can do quite some complex filtering:

"Run all the tests in the suite (XlsTests.*) except all the onces that have Spec in the name (:-XlsTests.*Spec*) and also this other specific test (+SpecTests.XlsRelatedSomething)":

--gtest_filter="XlsTests.*:-XlsTests.*Spec*:+SpecTests.XlsRelatedSomething"

At most, I can see a case for using tags to optionally exclude certain tests (again - not by default.). It is not uncommon to put "Slow" in the name of slow tests so I can do a :-*Slow* filter locally while developing.

@paolo-ienne @ebosnjak is there an application/use case for tags that I am missing?

paolo-ienne Mar 11, 2025
Maintainer

@paolo-ienne @ebosnjak is there an application/use case for tags that I am missing?

No, not really. I just think that using superlong highly-structured names to make up for the absence of labels/tags is clumsy, but I can definitely survive.... 😉

ebosnjak · 2025-03-11T22:29:28Z

ebosnjak
Mar 11, 2025
Maintainer Author

Thank you for your comments! I am glad to see that a great discussion came out of this proposal.

First of all, I would say that the most important point given here is to stop reinventing the wheel and use an existing framework. This is my oversight since I have not had experience with such tools in the past. Hence, I cannot really comment on the concrete details of moving to such a framework (although some were mentioned here), but I understand at a high level that it is obviously a better solution. My main concern with the previous approach was the potential feature creep making the tool a nightmare to use. Also JSON configurations are a disaster waiting to happen (apparently it wouldn't be the first time, as @murphe67 mentioned).

Another thing that worried me is that fact that @shundroid's testing requires a completely different compilation flow, i.e. overriding compile.sh. I liked that the answer to that problem is:

For now - yes, Shun could write his own runSpecIntegrationTest function. In the long run, this probably should be a parallel flow as I outline in my response above.

However, I think that this other point given by @schilkp is much more critical:

In short: I think this is exposing a much more fundamental problem with the dynamatic project structure, because there is not one compiler - but many very related but subtly different ones.

I will put it this way: currently I have a Python script (run_integration.py) that runs an executable (dynamatic) that runs a shell script (compile.sh) that runs an executable (dynamatic_opt). On the other hand, @shundroid has a Python script that goes around dynamatic and compile.sh in order to directly run the required compilation flow. Honestly, this seems a bit ridiculous to me, but more importantly, it seems to be another ticking time bomb in this project. I think that maybe we should start a separate discussion about this? I mean, it seems that these "fundamental problems", as you said, should be tackled before they cause a disaster. Especially if they could potentially cause big problems with the testing system (although these problems would probably be easier to solve if testing is based on a framework).

Finally, I would like to say something about "groups of tests", or whatever you want to call them. My problem is related to the following point by @schilkp:

First, we should be very careful in introducing mechanisms that prevent certain tests from running - that is exactly how you end up with tests not being run in CI and regressions going unnoticed because someone forgets a flag somewhere.

So, if I am developing a feature and I make a set of tests for that purpose ("my tests"), it does indeed make sense that I would only want to run these tests (to save time while testing locally). Then, when I am satisfied, I would also need to make sure that all other (let's call them "regular") tests still pass. I believe that we all understand that this is the purpose of testing. And now when everything works and I merge my feature to main, I would say that it makes sense that "my tests" are added to the set of "regular" tests, since my feature is now a part of main. This is because these tests will need to be "regular" for everyone else, i.e. they will need to be run by everyone else developing other features, to make sure that their feature does not break my feature. For this reason, I do not understand the need to have many different groups of tests. I understand that this is a matter of policy and not capability, as @paolo-ienne said, but I find it hard to justify going out of our way to add capabilities that enable policies that defeat the purpose of (integration) testing.

1 reply

paolo-ienne Mar 11, 2025
Maintainer

[...] I find it hard to justify going out of our way to add capabilities that enable policies that defeat the purpose of (integration) testing.

Point taken, and if Google can live without them for Chromium and LLVM, I guess we can find a way to live without for Dynamatic....

[Tools] Improving integration test system #314

Uh oh!

ebosnjak Feb 26, 2025 Maintainer

Proposal

Conclusion

Replies: 8 comments · 19 replies

Uh oh!

pcineverdies Feb 26, 2025 Collaborator

Uh oh!

murphe67 Mar 6, 2025 Maintainer

Uh oh!

shundroid Mar 8, 2025 Collaborator

Uh oh!

paolo-ienne Mar 10, 2025 Maintainer

Uh oh!

schilkp Mar 11, 2025 Maintainer

Uh oh!

schilkp Mar 11, 2025 Maintainer

Uh oh!

paolo-ienne Mar 11, 2025 Maintainer

Uh oh!

schilkp Mar 11, 2025 Maintainer

Uh oh!

paolo-ienne Mar 11, 2025 Maintainer

Uh oh!

Uh oh!

schilkp Mar 11, 2025 Maintainer

Uh oh!

schilkp Mar 11, 2025 Maintainer

Footnotes

Uh oh!

Uh oh!

schilkp Mar 11, 2025 Maintainer

Uh oh!

pcineverdies Mar 11, 2025 Collaborator

Uh oh!

murphe67 Mar 11, 2025 Maintainer

Uh oh!

Uh oh!

shundroid Mar 11, 2025 Collaborator

Uh oh!

schilkp Mar 11, 2025 Maintainer

Uh oh!

schilkp Mar 11, 2025 Maintainer

Uh oh!

Uh oh!

shundroid Mar 11, 2025 Collaborator

Uh oh!

schilkp Mar 11, 2025 Maintainer

Uh oh!

paolo-ienne Mar 11, 2025 Maintainer

Uh oh!

ebosnjak Mar 11, 2025 Maintainer Author

Uh oh!

paolo-ienne Mar 11, 2025 Maintainer

ebosnjak
Feb 26, 2025
Maintainer

Replies: 8 comments 19 replies

pcineverdies
Feb 26, 2025
Collaborator

murphe67
Mar 6, 2025
Maintainer

shundroid Mar 8, 2025
Collaborator

paolo-ienne
Mar 10, 2025
Maintainer

schilkp
Mar 11, 2025
Maintainer

schilkp Mar 11, 2025
Maintainer

paolo-ienne Mar 11, 2025
Maintainer

schilkp Mar 11, 2025
Maintainer

paolo-ienne Mar 11, 2025
Maintainer

schilkp Mar 11, 2025
Maintainer

schilkp
Mar 11, 2025
Maintainer

schilkp
Mar 11, 2025
Maintainer

pcineverdies Mar 11, 2025
Collaborator

murphe67 Mar 11, 2025
Maintainer

shundroid Mar 11, 2025
Collaborator

schilkp Mar 11, 2025
Maintainer

schilkp Mar 11, 2025
Maintainer

shundroid
Mar 11, 2025
Collaborator

schilkp Mar 11, 2025
Maintainer

paolo-ienne Mar 11, 2025
Maintainer

ebosnjak
Mar 11, 2025
Maintainer Author

paolo-ienne Mar 11, 2025
Maintainer