Skip to content

Conversation

rocallahan
Copy link
Contributor

What are the reasons/motivation for this change?

For large circuits AbcPass can run ABC hundreds or thousands of times, once per unique clkdomain_t. Some of those ABC runs take a while. Running those ABCs in parallel is possible, because the cells assigned to each ABC run are disjoint. This PR improves the runtime of AbcPass on one of our large circuits by 5x (which translates into a 3x speedup of synthesis end-to-end).

Explain how this is achieved.

This builds on PR #5239.

Reading and writing RTLIL are not thread-safe for various reasons, and fixing that would be difficult. So for now we stick with reading and writing RTLIL on the main thread. We split up the per-clkdomain work into a "prepare" phase which builds the gate netlist and removes the corresponding cells from the module, a "run ABC" phase which actually runs ABC, and an "extract" phase which processes the results of an ABC run to create fresh cells in the module. Everything happens on the main thread other than the "run ABC" phases. For parallelism, we create a set of worker threads which pull work from a concurrent queue fed by the main thread.

To simplify things and also provide a small performance boost, the writing of lutdefs.txt and stdcells.genlib is factored out so it happens just once per pass instead of once per ABC run.

Writing thread-safe code in C++ is very scary, especially in a large existing project like Yosys not designed for multithreading. Fortunately the "run ABC" phase is not large and mostly self-contained so the risk of this PR may be acceptable.

One thread safety problem I had to tackle was logging. Yosys::log() is not thread-safe and making it thread-safe would be very invasive. The obvious approach of putting locks around everything would slow down the single-threaded case and not scale well for parallel threads, plus the desired behavior of some of the logging functions w.r.t. concurrent logging calls is not clear. So I've created a DeferredLogs class which exposes a log() function which simply captures the logs for a particular work item into a buffer. Eventually, back on the main thread, those deferred logs are printed via the standard log() function. If log timing is enabled then the timestamps are not meaningful; we can fix that by extending the logging API so callers can pass in previously captured timestamps, but I'd prefer to do that after my logging PR #5243 has been merged.

If applicable, please suggest to reviewers how they can test the change.

The existing Yosys test suite exercises this code fairly well. If we take this PR and especially if we carry on down the road of adding more parallelism, it would be good to run the Yosys test suite with TSAN regularly.

@KrystalDelusion
Copy link
Member

WASI doesn't have threading support, so you'll need to add a way to downgrade to not using threads. There is already a DISABLE_ABC_THREADS make option that disables pthread when building ABC, which may still make sense here, but it may also be better to have an ENABLE_THREADS feature option which can be shared with anything else that ends up implementing threads.

@phsauter
Copy link
Contributor

phsauter commented Aug 4, 2025

To simplify things and also provide a small performance boost, the writing of lutdefs.txt and stdcells.genlib is factored out so it happens just once per pass instead of once per ABC run.

If you want to go even further, ABC will always convert liberty libraries to internal genlib, it does not work with full liberty data. So you could actually even factor this part out so liberty files are also only proceeded once. This provides very significant time savings if you have a larger (commercial) library.

@phsauter
Copy link
Contributor

phsauter commented Aug 4, 2025

I will test this on a few larger designs to see what it does to memory usage but I actually think this should be fine, in my experience with more 'trivial' multithreading of ABC using xargs it doesn't noticeably increase peak memory usage.

Another thing to consider is that it might be interesting to sort the extracted netlists by size and always start with the largest once as they will likely take the longest in ABC which could then limit time spent in multithreaded mode if they are queued too late.

@rocallahan
Copy link
Contributor Author

Another thing to consider is that it might be interesting to sort the extracted netlists by size and always start with the largest once as they will likely take the longest in ABC which could then limit time spent in multithreaded mode if they are queued too late.

That's a great idea. Could and probably should be done as a separate PR after this lands.

@whitequark
Copy link
Member

whitequark commented Aug 4, 2025

WASI doesn't have threading support,

WASI does: you need to build for the wasm32-wasip1-threads target instead of wasm32-wasip1. Since Yosys is single-threaded I didn't bother adding a compile-time option, but it may be worth it checking if threads are supported using #ifdef _REENTRANT.

There is a good reason to keep threading support optional: it requires more hostcalls from the runtime, and at least in the browser, it requires SharedArrayBuffer support, which means you need to have some quite annoying workarounds, and deploying the Wasm build from e.g. GitHub Pages becomes very tricky as it doesn't send the right CORS headers.

@ShinyKate ShinyKate requested a review from widlarizer August 4, 2025 18:49
@rocallahan rocallahan force-pushed the abc-parallel branch 3 times, most recently from 705cf74 to 7367ef3 Compare August 5, 2025 09:45
@rocallahan rocallahan marked this pull request as draft August 5, 2025 09:45
@rocallahan
Copy link
Contributor Author

There is a performance issue I need to investigate so it's not ready for review right now.

I'm not sure how to set YOSYS_ENABLE_THREADS in the Windows build. It should build OK on Windows with threads disabled.

@KrystalDelusion
Copy link
Member

I'm not sure how to set YOSYS_ENABLE_THREADS in the Windows build. It should build OK on Windows with threads disabled.

I think you need to modify the .vcxproj file, similar to how the cpp standard is overridden:

sed -i 's,</AdditionalIncludeDirectories>,</AdditionalIncludeDirectories>\n <LanguageStandard>stdcpp17</LanguageStandard>\n <AdditionalOptions>/Zc:__cplusplus %(AdditionalOptions)</AdditionalOptions>,g' "$vcxsrc"/YosysVS/YosysVS.vcxproj.new

I'm not sure exactly how you tell visual studio how to use the pthread lib, but there is a field in there for preprocessor definitions:

<PreprocessorDefinitions>_YOSYS_;_CRT_SECURE_NO_WARNINGS;WIN32;_DEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>

It's probably fine to leave it as-is though.

@rocallahan rocallahan force-pushed the abc-parallel branch 5 times, most recently from 4fd01fa to f71e9e3 Compare August 13, 2025 19:33
@rocallahan rocallahan marked this pull request as ready for review August 13, 2025 21:30
@rocallahan
Copy link
Contributor Author

rocallahan commented Aug 13, 2025

I've updated the PR. Mainly I've added another commit that uses a pool of ABC processes and reuses ABC processes instead of always spawning a new one for every ABC run. This avoids some situations where doing parallel ABC runs could actually be a regression.

@phsauter
Copy link
Contributor

Now that there is a threadpool I think it makes sense to have a scratchpad (Yosys' internal config system) value to set a maximum number of threads.
Its not a must have but a nice to have imo.

@rocallahan
Copy link
Contributor Author

Now that there is a threadpool I think it makes sense to have a scratchpad (Yosys' internal config system) value to set a maximum number of threads. Its not a must have but a nice to have imo.

Do you want it in this PR or a separate PR?

@clairexen
Copy link
Member

Can we get a seperate PR with everything from this PR except the changes to abc.cc? I think I can easily greenlight that part. I might have some questions regarding abc.cc itself. But I think need to look at it a bit longer first...

@rocallahan
Copy link
Contributor Author

Ok I put the non-abc.cc code in PR #5331.

Comment on lines 276 to 297
struct AbcModuleState {
const AbcConfig &config;

int map_autoidx = 0;
std::vector<gate_t> signal_list;
dict<RTLIL::SigBit, int> signal_map;
FfInitVals &initvals;
bool had_init = false;
bool did_run_abc = false;
bool run_abc_err = false;

bool clk_polarity = false;
bool en_polarity = false;
bool arst_polarity = false;
bool srst_polarity = false;
RTLIL::SigSpec clk_sig, en_sig, arst_sig, srst_sig;
dict<int, std::string> pi_map, po_map;

int undef_bits_lost = 0;

std::string tempdir_name;
DeferredLogs run_abc_logs;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that AbcModuleState contains various RTLIL structures, which, primarily due to the IdString interning, are not even safe to read from threads other than the main thread. To me this means that AbcModuleState itself should only be used on the main thread. (Continued below in this file.)

Comment on lines 2540 to 2602
int num_worker_threads = ThreadPool::pool_size(1, max_threads);
ConcurrentQueue<std::unique_ptr<AbcModuleState>> work_queue(num_worker_threads);
ConcurrentQueue<std::unique_ptr<AbcModuleState>> work_finished_queue;
int work_finished_count = 0;
ConcurrentStack<AbcProcess> process_pool;
ThreadPool worker_threads(num_worker_threads, [&](int){
while (std::optional<std::unique_ptr<AbcModuleState>> work =
work_queue.pop_front()) {
(*work)->run_abc(process_pool);
work_finished_queue.push_back(std::move(*work));
}
});
for (auto &it : assigned_cells) {
clk_polarity = std::get<0>(it.first);
clk_sig = assign_map(std::get<1>(it.first));
en_polarity = std::get<2>(it.first);
en_sig = assign_map(std::get<3>(it.first));
arst_polarity = std::get<4>(it.first);
arst_sig = assign_map(std::get<5>(it.first));
srst_polarity = std::get<6>(it.first);
srst_sig = assign_map(std::get<7>(it.first));
abc_module(design, mod, script_file, exe_file, liberty_files, genlib_files, constr_file, cleanup, lut_costs, !clk_sig.empty(), "$",
keepff, delay_target, sop_inputs, sop_products, lutin_shared, fast_mode, it.second, show_tempdir, sop_mode, abc_dress, dont_use_cells);
assign_map.set(mod);
// Process ABC results that have already finished before queueing another ABC.
// This should keep our memory usage down.
while (std::optional<std::unique_ptr<AbcModuleState>> work =
work_finished_queue.try_pop_front()) {
(*work)->extract(assign_map, design, mod);
++work_finished_count;
}
std::unique_ptr<AbcModuleState> state = std::make_unique<AbcModuleState>(config, initvals);
state->clk_polarity = std::get<0>(it.first);
state->clk_sig = assign_map(std::get<1>(it.first));
state->en_polarity = std::get<2>(it.first);
state->en_sig = assign_map(std::get<3>(it.first));
state->arst_polarity = std::get<4>(it.first);
state->arst_sig = assign_map(std::get<5>(it.first));
state->srst_polarity = std::get<6>(it.first);
state->srst_sig = assign_map(std::get<7>(it.first));
state->prepare_module(design, mod, assign_map, it.second, !state->clk_sig.empty(), "$");
if (num_worker_threads > 0) {
work_queue.push_back(std::move(state));
} else {
// Just run everything on the main thread.
state->run_abc(process_pool);
work_finished_queue.push_back(std::move(state));
}
}
work_queue.close();
while (work_finished_count < static_cast<int>(assigned_cells.size())) {
std::optional<std::unique_ptr<AbcModuleState>> work =
work_finished_queue.pop_front();
(*work)->extract(assign_map, design, mod);
++work_finished_count;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here AbcModuleState is enqueued and passed to the worker threads. This seems really dangerous because even running the destructors of anything containing an IdString would be a data race leading to undefined behavior and also corrupt refcounts in practice. The workers do make sure to return it back to the main thread, so it's not clear whether something like that is happening here, but at the point of writing this, I can't tell whether run_abc will access or destruct any of the contained main-thread-only RTLIL objects. (Update: After spending too much time reviewing this, my verdict is that this code does have data races but that those might very well be impossible to hit during testing without dedicated stress testing trying to hit those specifically. More importantly, my verdict changed multiple times while I kept looking at more and more code to figure out whether certain scenarios are possible or not.)

Independent of whether this code has any actual race conditions, I'd like to avoid passing ownership or sharing references across threads if the receiving thread isn't actually free to make normal use of the data it receives. If that's not guaranteed, reviewing isolated changes just becomes too involved. Even assuming run_abc wouldn't use any RTLIL objects, every change to run_abc or any other AbcModuleState methods called by run_abc has to be carefully checked for the absence of any accesses to its RTLIL(-containing) members, but there's not necessarily any indication of that being necessary when looking at the diff of such a future change in isolation.

In contrast, ensuring that we don't give worker threads unguarded access to RTLIL objects in the first place is something that has to be uphold at the point where we enqueue work for the workers, so the code that says "danger: threads ahead" would be the same code that mentions the data that says "danger: main thread only", making it much easier to spot issues during review. This is not a complete or perfect solution, there are some hypothetical sequences of changes that look reasonable when viewing each diff in isolation, but that still would end up tunneling RTLIL data into a worker thread, I just think those would be much less likely to occur naturally.

In this case I think the main-thread-only-data could either be kept separate from the enqueued work-item or if we want to store it within the work item, we should have some main_thread_only<T> wrapper that has destructor and accessor that log_error when invoked from a worker thread.

If all that is too restrictive, we first need to come up with a story to at least have data-race-free shared read-only worker-thread RTLIL access concurrent to the main-thread mutating different RTLIL objects (not shared with workers), at which point we could add and use wrapper types for read only RTLIL references that only expose the API that is safe in such a setting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I somewhat blundered here, sorry. run_abc() is the only code that runs on multiple threads, and my intent was that it would only read signal_list and associated non-RTLIL data structures from AbcModuleState, and that's almost true ... except where we call signal_str(), which does dive into some nontrivial RTLIL code outside this module. As it happens I'm pretty sure signal_str() doesn't bump IdString refcounts (and it certainly shouldn't; since it doesn't need to so, that would be inefficient), so I don't see any races here myself --- but you're right, making that a hard requirement is too fragile and not maintainable.

So one way forward here is to move BLIF emission back to the main thread and restructure AbcModuleState so the off-main threads have no direct or indirect references to RTLIL data.

Another way forward is to introduce wrapper types for relevant RTLIL types that are thread-safe and read-only, and require that off-main threads use those types instead of having direct access to RTLIL. I want something like this pretty soon anyway to parallelize other passes, so I think this is probably the way to go here, even though it widens the scope of the work.

I will move this back to draft status for now, come up with a prototype, and we can revisit this in a Discourse thread.

Copy link
Member

@jix jix Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, looking forward to the Discourse thread.

In case you're wondering what potential race I thought I spotted, it was between a) the uses of signal_str() on a worker thread, which will reach IdString::c_str executing global_id_storage_.at(index_), and b) the calls of AbcModuleState::prepare_module on the main thread which contains code like if (module->wire(RTLIL::escape_id(clk_str)) != nullptr) which constructs a temporary IdString from the std::string clk_str to look up a wire by a name. That check only makes sense when that wire might potentially be absent, in which case the IdString itself may also be absent. That could then cause the global_id_storage_ to reallocate a larger buffer to store the new name and that races with the worker thread access even though that is entirely read-only.

While checking the source again to copy out the exact snippet, I did just realize that this isn't reachable from the concurrently running prepare_module call because that passes "$" as clk_str and in that case the wire lookups are skipped.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch. We really can't allow main-thread writing to RTLIL in parallel with off-main-thread reading of RTLIL. So this is not the place to start adding off-main-thread reading of RTLIL.

Instead I've refactored the code a bit more in commit "Remove direct RTLIL access from gate_t" so that we don't need to call log_str()/signal_str() off the main thread, or indeed access any RTLIL at all off the main thread, and I've put the off-main-thread state in its own struct so that code clearly has no access to RTLIL.

@rocallahan rocallahan marked this pull request as draft September 9, 2025 22:57
@rocallahan rocallahan marked this pull request as ready for review September 10, 2025 23:59
Copy link
Member

@clairexen clairexen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally finished reviewing the entire PR. Some minor things I might have done differently, but that just means it took me a bit longer to understand the code as it is. So no additional notes from me; LGTM.

Copy link
Member

@clairexen clairexen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought abc_output_filter would use some locking to ensure no two threads try to write to RTIL in parallel, but it seems like it doesn't? So I have to un-approve for now (sorry for the confusion).

@jix jix assigned jix and unassigned widlarizer Sep 15, 2025
Copy link
Member

@jix jix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multi-threading story looks good to me now. All cross-thread synchronization is done via the higher level primitives in threading.cc/threading.h, there should be no races for the per worker data since the ownership of AbcModuleState moves to the worker threads and back to the main thread via the queues and there should be no races due to the global RTLIL state since workers now only work with RunAbcState from which no global RTLIL state is reachable.

@jix jix requested a review from clairexen September 15, 2025 14:33
`prepare_module()` will have to run on the main thread.
Large circuits can run hundreds or thousands of ABCs in a single AbcPass.
For some circuits, some of those ABC runs can run for hundreds of seconds.
Running ABCs in parallel with each other and in parallel with main-thread
processing (reading and writing BLIF files, copying ABC BLIF output into
the design) can give large speedups.
Doing ABC runs in parallel can actually make things slower when every ABC run requires
spawning an ABC subprocess --- especially when using popen(), which on glibc does not
use vfork(). What seems to happen is that constant fork()ing keeps making the main
process data pages copy-on-write, so the main process code that is setting up each ABC
call takes a lot of minor page-faults, slowing it down.

The solution is pretty straightforward although a little tricky to implement.
We just reuse ABC subprocesses. Instead of passing the ABC script name on the command
line, we spawn an ABC REPL and pipe a command into it to source the script. When that's
done we echo an `ABC_DONE` token instead of exiting. Yosys then puts the ABC process
onto a stack which we can pull from the next time we do an ABC run.

For one of our large designs, this is an additional 5x speedup of the primary AbcPass.
It does 5155 ABC runs, all very small; runtime of the AbcPass goes from 760s to 149s
(not very scientific benchmarking but the effect size is large).
@jix jix dismissed clairexen’s stale review September 15, 2025 15:43

From the brief discussion at the Dev JF, I'm fairly confident that Claire intended to just undo the approval, not actually request changes, but GitHub doesn't make it obvious how to do that.

@jix
Copy link
Member

jix commented Sep 15, 2025

I rebased on main, there were some conflicts with the logging changes merged in the meantime. If that passes CI, we're ready to merge this.

This still includes #5331, but I don't mind merging the thread support code as part of this PR and closing the then redundant #5331.

@jix jix merged commit 09742e2 into YosysHQ:main Sep 15, 2025
30 checks passed
@rocallahan rocallahan deleted the abc-parallel branch September 16, 2025 10:17
@KrystalDelusion KrystalDelusion mentioned this pull request Sep 16, 2025
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants