Run ABC passes in parallel #5266

rocallahan · 2025-08-04T04:06:51Z

What are the reasons/motivation for this change?

For large circuits AbcPass can run ABC hundreds or thousands of times, once per unique clkdomain_t. Some of those ABC runs take a while. Running those ABCs in parallel is possible, because the cells assigned to each ABC run are disjoint. This PR improves the runtime of AbcPass on one of our large circuits by 5x (which translates into a 3x speedup of synthesis end-to-end).

Explain how this is achieved.

This builds on PR #5239.

Reading and writing RTLIL are not thread-safe for various reasons, and fixing that would be difficult. So for now we stick with reading and writing RTLIL on the main thread. We split up the per-clkdomain work into a "prepare" phase which builds the gate netlist and removes the corresponding cells from the module, a "run ABC" phase which actually runs ABC, and an "extract" phase which processes the results of an ABC run to create fresh cells in the module. Everything happens on the main thread other than the "run ABC" phases. For parallelism, we create a set of worker threads which pull work from a concurrent queue fed by the main thread.

To simplify things and also provide a small performance boost, the writing of lutdefs.txt and stdcells.genlib is factored out so it happens just once per pass instead of once per ABC run.

Writing thread-safe code in C++ is very scary, especially in a large existing project like Yosys not designed for multithreading. Fortunately the "run ABC" phase is not large and mostly self-contained so the risk of this PR may be acceptable.

One thread safety problem I had to tackle was logging. Yosys::log() is not thread-safe and making it thread-safe would be very invasive. The obvious approach of putting locks around everything would slow down the single-threaded case and not scale well for parallel threads, plus the desired behavior of some of the logging functions w.r.t. concurrent logging calls is not clear. So I've created a DeferredLogs class which exposes a log() function which simply captures the logs for a particular work item into a buffer. Eventually, back on the main thread, those deferred logs are printed via the standard log() function. If log timing is enabled then the timestamps are not meaningful; we can fix that by extending the logging API so callers can pass in previously captured timestamps, but I'd prefer to do that after my logging PR #5243 has been merged.

If applicable, please suggest to reviewers how they can test the change.

The existing Yosys test suite exercises this code fairly well. If we take this PR and especially if we carry on down the road of adding more parallelism, it would be good to run the Yosys test suite with TSAN regularly.

KrystalDelusion · 2025-08-04T08:19:00Z

WASI doesn't have threading support, so you'll need to add a way to downgrade to not using threads. There is already a DISABLE_ABC_THREADS make option that disables pthread when building ABC, which may still make sense here, but it may also be better to have an ENABLE_THREADS feature option which can be shared with anything else that ends up implementing threads.

phsauter · 2025-08-04T10:45:08Z

To simplify things and also provide a small performance boost, the writing of lutdefs.txt and stdcells.genlib is factored out so it happens just once per pass instead of once per ABC run.

If you want to go even further, ABC will always convert liberty libraries to internal genlib, it does not work with full liberty data. So you could actually even factor this part out so liberty files are also only proceeded once. This provides very significant time savings if you have a larger (commercial) library.

phsauter · 2025-08-04T10:56:21Z

I will test this on a few larger designs to see what it does to memory usage but I actually think this should be fine, in my experience with more 'trivial' multithreading of ABC using xargs it doesn't noticeably increase peak memory usage.

Another thing to consider is that it might be interesting to sort the extracted netlists by size and always start with the largest once as they will likely take the longest in ABC which could then limit time spent in multithreaded mode if they are queued too late.

rocallahan · 2025-08-04T12:07:55Z

Another thing to consider is that it might be interesting to sort the extracted netlists by size and always start with the largest once as they will likely take the longest in ABC which could then limit time spent in multithreaded mode if they are queued too late.

That's a great idea. Could and probably should be done as a separate PR after this lands.

whitequark · 2025-08-04T15:18:15Z

WASI doesn't have threading support,

WASI does: you need to build for the wasm32-wasip1-threads target instead of wasm32-wasip1. Since Yosys is single-threaded I didn't bother adding a compile-time option, but it may be worth it checking if threads are supported using #ifdef _REENTRANT.

There is a good reason to keep threading support optional: it requires more hostcalls from the runtime, and at least in the browser, it requires SharedArrayBuffer support, which means you need to have some quite annoying workarounds, and deploying the Wasm build from e.g. GitHub Pages becomes very tricky as it doesn't send the right CORS headers.

rocallahan · 2025-08-05T09:46:19Z

There is a performance issue I need to investigate so it's not ready for review right now.

I'm not sure how to set YOSYS_ENABLE_THREADS in the Windows build. It should build OK on Windows with threads disabled.

KrystalDelusion · 2025-08-06T00:12:43Z

I'm not sure how to set YOSYS_ENABLE_THREADS in the Windows build. It should build OK on Windows with threads disabled.

I think you need to modify the .vcxproj file, similar to how the cpp standard is overridden:

yosys/misc/create_vcxsrc.sh

Line 33 in da01e17

    
           sed -i 's,</AdditionalIncludeDirectories>,</AdditionalIncludeDirectories>\n      <LanguageStandard>stdcpp17</LanguageStandard>\n      <AdditionalOptions>/Zc:__cplusplus %(AdditionalOptions)</AdditionalOptions>,g' "$vcxsrc"/YosysVS/YosysVS.vcxproj.new

I'm not sure exactly how you tell visual studio how to use the pthread lib, but there is a field in there for preprocessor definitions:

<PreprocessorDefinitions>_YOSYS_;_CRT_SECURE_NO_WARNINGS;WIN32;_DEBUG;_CONSOLE;_LIB;%(PreprocessorDefinitions)</PreprocessorDefinitions>

It's probably fine to leave it as-is though.

rocallahan · 2025-08-13T21:32:28Z

I've updated the PR. Mainly I've added another commit that uses a pool of ABC processes and reuses ABC processes instead of always spawning a new one for every ABC run. This avoids some situations where doing parallel ABC runs could actually be a regression.

phsauter · 2025-08-13T22:11:18Z

Now that there is a threadpool I think it makes sense to have a scratchpad (Yosys' internal config system) value to set a maximum number of threads.
Its not a must have but a nice to have imo.

rocallahan · 2025-08-14T22:07:04Z

Now that there is a threadpool I think it makes sense to have a scratchpad (Yosys' internal config system) value to set a maximum number of threads. Its not a must have but a nice to have imo.

Do you want it in this PR or a separate PR?

clairexen · 2025-09-04T14:53:14Z

Can we get a seperate PR with everything from this PR except the changes to abc.cc? I think I can easily greenlight that part. I might have some questions regarding abc.cc itself. But I think need to look at it a bit longer first...

rocallahan · 2025-09-06T04:26:30Z

Ok I put the non-abc.cc code in PR #5331.

jix · 2025-09-09T13:59:18Z

passes/techmap/abc.cc

+struct AbcModuleState {
+	const AbcConfig &config;
+
+	int map_autoidx = 0;
+	std::vector<gate_t> signal_list;
+	dict<RTLIL::SigBit, int> signal_map;
+	FfInitVals &initvals;
+	bool had_init = false;
+	bool did_run_abc = false;
+	bool run_abc_err = false;
+
+	bool clk_polarity = false;
+	bool en_polarity = false;
+	bool arst_polarity = false;
+	bool srst_polarity = false;
+	RTLIL::SigSpec clk_sig, en_sig, arst_sig, srst_sig;
+	dict<int, std::string> pi_map, po_map;
+
+	int undef_bits_lost = 0;
+
+	std::string tempdir_name;
+	DeferredLogs run_abc_logs;


I see that AbcModuleState contains various RTLIL structures, which, primarily due to the IdString interning, are not even safe to read from threads other than the main thread. To me this means that AbcModuleState itself should only be used on the main thread. (Continued below in this file.)

jix · 2025-09-09T17:53:49Z

passes/techmap/abc.cc

+			int num_worker_threads = ThreadPool::pool_size(1, max_threads);
+			ConcurrentQueue<std::unique_ptr<AbcModuleState>> work_queue(num_worker_threads);
+			ConcurrentQueue<std::unique_ptr<AbcModuleState>> work_finished_queue;
+			int work_finished_count = 0;
+			ConcurrentStack<AbcProcess> process_pool;
+			ThreadPool worker_threads(num_worker_threads, [&](int){
+					while (std::optional<std::unique_ptr<AbcModuleState>> work =
+							work_queue.pop_front()) {
+						(*work)->run_abc(process_pool);
+						work_finished_queue.push_back(std::move(*work));
+					}
+				});
 			for (auto &it : assigned_cells) {
-				clk_polarity = std::get<0>(it.first);
-				clk_sig = assign_map(std::get<1>(it.first));
-				en_polarity = std::get<2>(it.first);
-				en_sig = assign_map(std::get<3>(it.first));
-				arst_polarity = std::get<4>(it.first);
-				arst_sig = assign_map(std::get<5>(it.first));
-				srst_polarity = std::get<6>(it.first);
-				srst_sig = assign_map(std::get<7>(it.first));
-				abc_module(design, mod, script_file, exe_file, liberty_files, genlib_files, constr_file, cleanup, lut_costs, !clk_sig.empty(), "$",
-						keepff, delay_target, sop_inputs, sop_products, lutin_shared, fast_mode, it.second, show_tempdir, sop_mode, abc_dress, dont_use_cells);
-				assign_map.set(mod);
+				// Process ABC results that have already finished before queueing another ABC.
+				// This should keep our memory usage down.
+				while (std::optional<std::unique_ptr<AbcModuleState>> work =
+						work_finished_queue.try_pop_front()) {
+					(*work)->extract(assign_map, design, mod);
+					++work_finished_count;
+				}
+				std::unique_ptr<AbcModuleState> state = std::make_unique<AbcModuleState>(config, initvals);
+				state->clk_polarity = std::get<0>(it.first);
+				state->clk_sig = assign_map(std::get<1>(it.first));
+				state->en_polarity = std::get<2>(it.first);
+				state->en_sig = assign_map(std::get<3>(it.first));
+				state->arst_polarity = std::get<4>(it.first);
+				state->arst_sig = assign_map(std::get<5>(it.first));
+				state->srst_polarity = std::get<6>(it.first);
+				state->srst_sig = assign_map(std::get<7>(it.first));
+				state->prepare_module(design, mod, assign_map, it.second, !state->clk_sig.empty(), "$");
+				if (num_worker_threads > 0) {
+					work_queue.push_back(std::move(state));
+				} else {
+					// Just run everything on the main thread.
+					state->run_abc(process_pool);
+					work_finished_queue.push_back(std::move(state));
+				}
+			}
+			work_queue.close();
+			while (work_finished_count < static_cast<int>(assigned_cells.size())) {
+				std::optional<std::unique_ptr<AbcModuleState>> work =
+					work_finished_queue.pop_front();
+				(*work)->extract(assign_map, design, mod);
+				++work_finished_count;


Here AbcModuleState is enqueued and passed to the worker threads. This seems really dangerous because even running the destructors of anything containing an IdString would be a data race leading to undefined behavior and also corrupt refcounts in practice. The workers do make sure to return it back to the main thread, so it's not clear whether something like that is happening here, but at the point of writing this, I can't tell whether run_abc will access or destruct any of the contained main-thread-only RTLIL objects. (Update: After spending too much time reviewing this, my verdict is that this code does have data races but that those might very well be impossible to hit during testing without dedicated stress testing trying to hit those specifically. More importantly, my verdict changed multiple times while I kept looking at more and more code to figure out whether certain scenarios are possible or not.)

Independent of whether this code has any actual race conditions, I'd like to avoid passing ownership or sharing references across threads if the receiving thread isn't actually free to make normal use of the data it receives. If that's not guaranteed, reviewing isolated changes just becomes too involved. Even assuming run_abc wouldn't use any RTLIL objects, every change to run_abc or any other AbcModuleState methods called by run_abc has to be carefully checked for the absence of any accesses to its RTLIL(-containing) members, but there's not necessarily any indication of that being necessary when looking at the diff of such a future change in isolation.

In contrast, ensuring that we don't give worker threads unguarded access to RTLIL objects in the first place is something that has to be uphold at the point where we enqueue work for the workers, so the code that says "danger: threads ahead" would be the same code that mentions the data that says "danger: main thread only", making it much easier to spot issues during review. This is not a complete or perfect solution, there are some hypothetical sequences of changes that look reasonable when viewing each diff in isolation, but that still would end up tunneling RTLIL data into a worker thread, I just think those would be much less likely to occur naturally.

In this case I think the main-thread-only-data could either be kept separate from the enqueued work-item or if we want to store it within the work item, we should have some main_thread_only<T> wrapper that has destructor and accessor that log_error when invoked from a worker thread.

If all that is too restrictive, we first need to come up with a story to at least have data-race-free shared read-only worker-thread RTLIL access concurrent to the main-thread mutating different RTLIL objects (not shared with workers), at which point we could add and use wrapper types for read only RTLIL references that only expose the API that is safe in such a setting.

Yes, I somewhat blundered here, sorry. run_abc() is the only code that runs on multiple threads, and my intent was that it would only read signal_list and associated non-RTLIL data structures from AbcModuleState, and that's almost true ... except where we call signal_str(), which does dive into some nontrivial RTLIL code outside this module. As it happens I'm pretty sure signal_str() doesn't bump IdString refcounts (and it certainly shouldn't; since it doesn't need to so, that would be inefficient), so I don't see any races here myself --- but you're right, making that a hard requirement is too fragile and not maintainable.

So one way forward here is to move BLIF emission back to the main thread and restructure AbcModuleState so the off-main threads have no direct or indirect references to RTLIL data.

Another way forward is to introduce wrapper types for relevant RTLIL types that are thread-safe and read-only, and require that off-main threads use those types instead of having direct access to RTLIL. I want something like this pretty soon anyway to parallelize other passes, so I think this is probably the way to go here, even though it widens the scope of the work.

I will move this back to draft status for now, come up with a prototype, and we can revisit this in a Discourse thread.

Sounds good, looking forward to the Discourse thread.

In case you're wondering what potential race I thought I spotted, it was between a) the uses of signal_str() on a worker thread, which will reach IdString::c_str executing global_id_storage_.at(index_), and b) the calls of AbcModuleState::prepare_module on the main thread which contains code like if (module->wire(RTLIL::escape_id(clk_str)) != nullptr) which constructs a temporary IdString from the std::string clk_str to look up a wire by a name. That check only makes sense when that wire might potentially be absent, in which case the IdString itself may also be absent. That could then cause the global_id_storage_ to reallocate a larger buffer to store the new name and that races with the worker thread access even though that is entirely read-only.

While checking the source again to copy out the exact snippet, I did just realize that this isn't reachable from the concurrently running prepare_module call because that passes "$" as clk_str and in that case the wire lookups are skipped.

Great catch. We really can't allow main-thread writing to RTLIL in parallel with off-main-thread reading of RTLIL. So this is not the place to start adding off-main-thread reading of RTLIL.

Instead I've refactored the code a bit more in commit "Remove direct RTLIL access from gate_t" so that we don't need to call log_str()/signal_str() off the main thread, or indeed access any RTLIL at all off the main thread, and I've put the off-main-thread state in its own struct so that code clearly has no access to RTLIL.

clairexen

Finally finished reviewing the entire PR. Some minor things I might have done differently, but that just means it took me a bit longer to understand the code as it is. So no additional notes from me; LGTM.

clairexen

I thought abc_output_filter would use some locking to ensure no two threads try to write to RTIL in parallel, but it seems like it doesn't? So I have to un-approve for now (sorry for the confusion).

jix

The multi-threading story looks good to me now. All cross-thread synchronization is done via the higher level primitives in threading.cc/threading.h, there should be no races for the per worker data since the ownership of AbcModuleState moves to the worker threads and back to the main thread via the queues and there should be no races due to the global RTLIL state since workers now only work with RunAbcState from which no global RTLIL state is reachable.

`prepare_module()` will have to run on the main thread.

Large circuits can run hundreds or thousands of ABCs in a single AbcPass. For some circuits, some of those ABC runs can run for hundreds of seconds. Running ABCs in parallel with each other and in parallel with main-thread processing (reading and writing BLIF files, copying ABC BLIF output into the design) can give large speedups.

Doing ABC runs in parallel can actually make things slower when every ABC run requires spawning an ABC subprocess --- especially when using popen(), which on glibc does not use vfork(). What seems to happen is that constant fork()ing keeps making the main process data pages copy-on-write, so the main process code that is setting up each ABC call takes a lot of minor page-faults, slowing it down. The solution is pretty straightforward although a little tricky to implement. We just reuse ABC subprocesses. Instead of passing the ABC script name on the command line, we spawn an ABC REPL and pipe a command into it to source the script. When that's done we echo an `ABC_DONE` token instead of exiting. Yosys then puts the ABC process onto a stack which we can pull from the next time we do an ABC run. For one of our large designs, this is an additional 5x speedup of the primary AbcPass. It does 5155 ABC runs, all very small; runtime of the AbcPass goes from 760s to 149s (not very scientific benchmarking but the effect size is large).

From the brief discussion at the Dev JF, I'm fairly confident that Claire intended to just undo the approval, not actually request changes, but GitHub doesn't make it obvious how to do that.

jix · 2025-09-15T15:45:32Z

I rebased on main, there were some conflicts with the logging changes merged in the meantime. If that passes CI, we're ready to merge this.

This still includes #5331, but I don't mind merging the thread support code as part of this PR and closing the then redundant #5331.

ShinyKate requested a review from widlarizer August 4, 2025 18:49

rocallahan force-pushed the abc-parallel branch 3 times, most recently from 705cf74 to 7367ef3 Compare August 5, 2025 09:45

rocallahan marked this pull request as draft August 5, 2025 09:45

rocallahan force-pushed the abc-parallel branch from 7367ef3 to 52556cf Compare August 11, 2025 22:43

ShinyKate assigned widlarizer Aug 12, 2025

rocallahan force-pushed the abc-parallel branch 5 times, most recently from 4fd01fa to f71e9e3 Compare August 13, 2025 19:33

rocallahan marked this pull request as ready for review August 13, 2025 21:30

rocallahan mentioned this pull request Sep 6, 2025

Add thread support code #5331

Closed

jix requested changes Sep 9, 2025

View reviewed changes

rocallahan marked this pull request as draft September 9, 2025 22:57

rocallahan force-pushed the abc-parallel branch from f71e9e3 to 19a05fe Compare September 10, 2025 23:58

rocallahan marked this pull request as ready for review September 10, 2025 23:59

rocallahan force-pushed the abc-parallel branch from 19a05fe to 4b0b385 Compare September 11, 2025 06:15

clairexen approved these changes Sep 15, 2025

View reviewed changes

clairexen previously requested changes Sep 15, 2025

View reviewed changes

jix assigned jix and unassigned widlarizer Sep 15, 2025

jix approved these changes Sep 15, 2025

View reviewed changes

jix requested a review from clairexen September 15, 2025 14:33

rocallahan added 5 commits September 15, 2025 17:01

Split abc_module() into prepare_module() and run_abc()

13b3418

`prepare_module()` will have to run on the main thread.

Only write out stdcells/lutcosts once for all ABC runs

222f457

Remove direct RTLIL access from gate_t

38f8165

jix force-pushed the abc-parallel branch from 4b0b385 to ae0ca75 Compare September 15, 2025 15:39

jix merged commit 09742e2 into YosysHQ:main Sep 15, 2025
30 checks passed

rocallahan deleted the abc-parallel branch September 16, 2025 10:17

KrystalDelusion mentioned this pull request Sep 16, 2025

ABC full custom flow #4592

Draft

4 tasks

jix mentioned this pull request Sep 22, 2025

extract_fa: Add parallel processing of partitions #5373

Open

Run ABC passes in parallel #5266

Run ABC passes in parallel #5266

Uh oh!

Conversation

rocallahan commented Aug 4, 2025

Uh oh!

KrystalDelusion commented Aug 4, 2025

Uh oh!

phsauter commented Aug 4, 2025

Uh oh!

phsauter commented Aug 4, 2025

Uh oh!

rocallahan commented Aug 4, 2025

Uh oh!

whitequark commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocallahan commented Aug 5, 2025

Uh oh!

KrystalDelusion commented Aug 6, 2025

Uh oh!

rocallahan commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phsauter commented Aug 13, 2025

Uh oh!

rocallahan commented Aug 14, 2025

Uh oh!

clairexen commented Sep 4, 2025

Uh oh!

rocallahan commented Sep 6, 2025

Uh oh!

jix Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

jix Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

rocallahan Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

jix Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rocallahan Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

clairexen left a comment

Choose a reason for hiding this comment

Uh oh!

clairexen left a comment

Choose a reason for hiding this comment

Uh oh!

jix left a comment

Choose a reason for hiding this comment

Uh oh!

jix commented Sep 15, 2025

Uh oh!

Uh oh!

Uh oh!

whitequark commented Aug 4, 2025 •

edited

Loading

rocallahan commented Aug 13, 2025 •

edited

Loading

jix Sep 10, 2025 •

edited

Loading