-
Notifications
You must be signed in to change notification settings - Fork 990
Run ABC passes in parallel #5266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WASI doesn't have threading support, so you'll need to add a way to downgrade to not using threads. There is already a |
If you want to go even further, ABC will always convert liberty libraries to internal genlib, it does not work with full liberty data. So you could actually even factor this part out so liberty files are also only proceeded once. This provides very significant time savings if you have a larger (commercial) library. |
I will test this on a few larger designs to see what it does to memory usage but I actually think this should be fine, in my experience with more 'trivial' multithreading of ABC using xargs it doesn't noticeably increase peak memory usage. Another thing to consider is that it might be interesting to sort the extracted netlists by size and always start with the largest once as they will likely take the longest in ABC which could then limit time spent in multithreaded mode if they are queued too late. |
That's a great idea. Could and probably should be done as a separate PR after this lands. |
WASI does: you need to build for the There is a good reason to keep threading support optional: it requires more hostcalls from the runtime, and at least in the browser, it requires |
705cf74
to
7367ef3
Compare
There is a performance issue I need to investigate so it's not ready for review right now. I'm not sure how to set |
I think you need to modify the Line 33 in da01e17
I'm not sure exactly how you tell visual studio how to use the pthread lib, but there is a field in there for preprocessor definitions:
It's probably fine to leave it as-is though. |
7367ef3
to
52556cf
Compare
4fd01fa
to
f71e9e3
Compare
I've updated the PR. Mainly I've added another commit that uses a pool of ABC processes and reuses ABC processes instead of always spawning a new one for every ABC run. This avoids some situations where doing parallel ABC runs could actually be a regression. |
Now that there is a threadpool I think it makes sense to have a scratchpad (Yosys' internal config system) value to set a maximum number of threads. |
Do you want it in this PR or a separate PR? |
Can we get a seperate PR with everything from this PR except the changes to abc.cc? I think I can easily greenlight that part. I might have some questions regarding abc.cc itself. But I think need to look at it a bit longer first... |
Ok I put the non- |
passes/techmap/abc.cc
Outdated
struct AbcModuleState { | ||
const AbcConfig &config; | ||
|
||
int map_autoidx = 0; | ||
std::vector<gate_t> signal_list; | ||
dict<RTLIL::SigBit, int> signal_map; | ||
FfInitVals &initvals; | ||
bool had_init = false; | ||
bool did_run_abc = false; | ||
bool run_abc_err = false; | ||
|
||
bool clk_polarity = false; | ||
bool en_polarity = false; | ||
bool arst_polarity = false; | ||
bool srst_polarity = false; | ||
RTLIL::SigSpec clk_sig, en_sig, arst_sig, srst_sig; | ||
dict<int, std::string> pi_map, po_map; | ||
|
||
int undef_bits_lost = 0; | ||
|
||
std::string tempdir_name; | ||
DeferredLogs run_abc_logs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that AbcModuleState
contains various RTLIL structures, which, primarily due to the IdString interning, are not even safe to read from threads other than the main thread. To me this means that AbcModuleState
itself should only be used on the main thread. (Continued below in this file.)
int num_worker_threads = ThreadPool::pool_size(1, max_threads); | ||
ConcurrentQueue<std::unique_ptr<AbcModuleState>> work_queue(num_worker_threads); | ||
ConcurrentQueue<std::unique_ptr<AbcModuleState>> work_finished_queue; | ||
int work_finished_count = 0; | ||
ConcurrentStack<AbcProcess> process_pool; | ||
ThreadPool worker_threads(num_worker_threads, [&](int){ | ||
while (std::optional<std::unique_ptr<AbcModuleState>> work = | ||
work_queue.pop_front()) { | ||
(*work)->run_abc(process_pool); | ||
work_finished_queue.push_back(std::move(*work)); | ||
} | ||
}); | ||
for (auto &it : assigned_cells) { | ||
clk_polarity = std::get<0>(it.first); | ||
clk_sig = assign_map(std::get<1>(it.first)); | ||
en_polarity = std::get<2>(it.first); | ||
en_sig = assign_map(std::get<3>(it.first)); | ||
arst_polarity = std::get<4>(it.first); | ||
arst_sig = assign_map(std::get<5>(it.first)); | ||
srst_polarity = std::get<6>(it.first); | ||
srst_sig = assign_map(std::get<7>(it.first)); | ||
abc_module(design, mod, script_file, exe_file, liberty_files, genlib_files, constr_file, cleanup, lut_costs, !clk_sig.empty(), "$", | ||
keepff, delay_target, sop_inputs, sop_products, lutin_shared, fast_mode, it.second, show_tempdir, sop_mode, abc_dress, dont_use_cells); | ||
assign_map.set(mod); | ||
// Process ABC results that have already finished before queueing another ABC. | ||
// This should keep our memory usage down. | ||
while (std::optional<std::unique_ptr<AbcModuleState>> work = | ||
work_finished_queue.try_pop_front()) { | ||
(*work)->extract(assign_map, design, mod); | ||
++work_finished_count; | ||
} | ||
std::unique_ptr<AbcModuleState> state = std::make_unique<AbcModuleState>(config, initvals); | ||
state->clk_polarity = std::get<0>(it.first); | ||
state->clk_sig = assign_map(std::get<1>(it.first)); | ||
state->en_polarity = std::get<2>(it.first); | ||
state->en_sig = assign_map(std::get<3>(it.first)); | ||
state->arst_polarity = std::get<4>(it.first); | ||
state->arst_sig = assign_map(std::get<5>(it.first)); | ||
state->srst_polarity = std::get<6>(it.first); | ||
state->srst_sig = assign_map(std::get<7>(it.first)); | ||
state->prepare_module(design, mod, assign_map, it.second, !state->clk_sig.empty(), "$"); | ||
if (num_worker_threads > 0) { | ||
work_queue.push_back(std::move(state)); | ||
} else { | ||
// Just run everything on the main thread. | ||
state->run_abc(process_pool); | ||
work_finished_queue.push_back(std::move(state)); | ||
} | ||
} | ||
work_queue.close(); | ||
while (work_finished_count < static_cast<int>(assigned_cells.size())) { | ||
std::optional<std::unique_ptr<AbcModuleState>> work = | ||
work_finished_queue.pop_front(); | ||
(*work)->extract(assign_map, design, mod); | ||
++work_finished_count; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here AbcModuleState
is enqueued and passed to the worker threads. This seems really dangerous because even running the destructors of anything containing an IdString would be a data race leading to undefined behavior and also corrupt refcounts in practice. The workers do make sure to return it back to the main thread, so it's not clear whether something like that is happening here, but at the point of writing this, I can't tell whether run_abc
will access or destruct any of the contained main-thread-only RTLIL objects. (Update: After spending too much time reviewing this, my verdict is that this code does have data races but that those might very well be impossible to hit during testing without dedicated stress testing trying to hit those specifically. More importantly, my verdict changed multiple times while I kept looking at more and more code to figure out whether certain scenarios are possible or not.)
Independent of whether this code has any actual race conditions, I'd like to avoid passing ownership or sharing references across threads if the receiving thread isn't actually free to make normal use of the data it receives. If that's not guaranteed, reviewing isolated changes just becomes too involved. Even assuming run_abc
wouldn't use any RTLIL objects, every change to run_abc
or any other AbcModuleState
methods called by run_abc
has to be carefully checked for the absence of any accesses to its RTLIL(-containing) members, but there's not necessarily any indication of that being necessary when looking at the diff of such a future change in isolation.
In contrast, ensuring that we don't give worker threads unguarded access to RTLIL objects in the first place is something that has to be uphold at the point where we enqueue work for the workers, so the code that says "danger: threads ahead" would be the same code that mentions the data that says "danger: main thread only", making it much easier to spot issues during review. This is not a complete or perfect solution, there are some hypothetical sequences of changes that look reasonable when viewing each diff in isolation, but that still would end up tunneling RTLIL data into a worker thread, I just think those would be much less likely to occur naturally.
In this case I think the main-thread-only-data could either be kept separate from the enqueued work-item or if we want to store it within the work item, we should have some main_thread_only<T>
wrapper that has destructor and accessor that log_error
when invoked from a worker thread.
If all that is too restrictive, we first need to come up with a story to at least have data-race-free shared read-only worker-thread RTLIL access concurrent to the main-thread mutating different RTLIL objects (not shared with workers), at which point we could add and use wrapper types for read only RTLIL references that only expose the API that is safe in such a setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I somewhat blundered here, sorry. run_abc()
is the only code that runs on multiple threads, and my intent was that it would only read signal_list
and associated non-RTLIL data structures from AbcModuleState
, and that's almost true ... except where we call signal_str()
, which does dive into some nontrivial RTLIL code outside this module. As it happens I'm pretty sure signal_str()
doesn't bump IdString
refcounts (and it certainly shouldn't; since it doesn't need to so, that would be inefficient), so I don't see any races here myself --- but you're right, making that a hard requirement is too fragile and not maintainable.
So one way forward here is to move BLIF emission back to the main thread and restructure AbcModuleState
so the off-main threads have no direct or indirect references to RTLIL data.
Another way forward is to introduce wrapper types for relevant RTLIL types that are thread-safe and read-only, and require that off-main threads use those types instead of having direct access to RTLIL. I want something like this pretty soon anyway to parallelize other passes, so I think this is probably the way to go here, even though it widens the scope of the work.
I will move this back to draft status for now, come up with a prototype, and we can revisit this in a Discourse thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, looking forward to the Discourse thread.
In case you're wondering what potential race I thought I spotted, it was between a) the uses of signal_str()
on a worker thread, which will reach IdString::c_str
executing global_id_storage_.at(index_)
, and b) the calls of AbcModuleState::prepare_module
on the main thread which contains code like if (module->wire(RTLIL::escape_id(clk_str)) != nullptr)
which constructs a temporary IdString
from the std::string clk_str
to look up a wire by a name. That check only makes sense when that wire might potentially be absent, in which case the IdString itself may also be absent. That could then cause the global_id_storage_
to reallocate a larger buffer to store the new name and that races with the worker thread access even though that is entirely read-only.
While checking the source again to copy out the exact snippet, I did just realize that this isn't reachable from the concurrently running prepare_module
call because that passes "$"
as clk_str and in that case the wire lookups are skipped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch. We really can't allow main-thread writing to RTLIL in parallel with off-main-thread reading of RTLIL. So this is not the place to start adding off-main-thread reading of RTLIL.
Instead I've refactored the code a bit more in commit "Remove direct RTLIL access from gate_t" so that we don't need to call log_str()
/signal_str()
off the main thread, or indeed access any RTLIL at all off the main thread, and I've put the off-main-thread state in its own struct so that code clearly has no access to RTLIL.
f71e9e3
to
19a05fe
Compare
19a05fe
to
4b0b385
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally finished reviewing the entire PR. Some minor things I might have done differently, but that just means it took me a bit longer to understand the code as it is. So no additional notes from me; LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought abc_output_filter
would use some locking to ensure no two threads try to write to RTIL in parallel, but it seems like it doesn't? So I have to un-approve for now (sorry for the confusion).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multi-threading story looks good to me now. All cross-thread synchronization is done via the higher level primitives in threading.cc/threading.h, there should be no races for the per worker data since the ownership of AbcModuleState
moves to the worker threads and back to the main thread via the queues and there should be no races due to the global RTLIL state since workers now only work with RunAbcState
from which no global RTLIL state is reachable.
`prepare_module()` will have to run on the main thread.
Large circuits can run hundreds or thousands of ABCs in a single AbcPass. For some circuits, some of those ABC runs can run for hundreds of seconds. Running ABCs in parallel with each other and in parallel with main-thread processing (reading and writing BLIF files, copying ABC BLIF output into the design) can give large speedups.
Doing ABC runs in parallel can actually make things slower when every ABC run requires spawning an ABC subprocess --- especially when using popen(), which on glibc does not use vfork(). What seems to happen is that constant fork()ing keeps making the main process data pages copy-on-write, so the main process code that is setting up each ABC call takes a lot of minor page-faults, slowing it down. The solution is pretty straightforward although a little tricky to implement. We just reuse ABC subprocesses. Instead of passing the ABC script name on the command line, we spawn an ABC REPL and pipe a command into it to source the script. When that's done we echo an `ABC_DONE` token instead of exiting. Yosys then puts the ABC process onto a stack which we can pull from the next time we do an ABC run. For one of our large designs, this is an additional 5x speedup of the primary AbcPass. It does 5155 ABC runs, all very small; runtime of the AbcPass goes from 760s to 149s (not very scientific benchmarking but the effect size is large).
From the brief discussion at the Dev JF, I'm fairly confident that Claire intended to just undo the approval, not actually request changes, but GitHub doesn't make it obvious how to do that.
What are the reasons/motivation for this change?
For large circuits
AbcPass
can run ABC hundreds or thousands of times, once per uniqueclkdomain_t
. Some of those ABC runs take a while. Running those ABCs in parallel is possible, because the cells assigned to each ABC run are disjoint. This PR improves the runtime ofAbcPass
on one of our large circuits by 5x (which translates into a 3x speedup of synthesis end-to-end).Explain how this is achieved.
This builds on PR #5239.
Reading and writing RTLIL are not thread-safe for various reasons, and fixing that would be difficult. So for now we stick with reading and writing RTLIL on the main thread. We split up the per-clkdomain work into a "prepare" phase which builds the gate netlist and removes the corresponding cells from the module, a "run ABC" phase which actually runs ABC, and an "extract" phase which processes the results of an ABC run to create fresh cells in the module. Everything happens on the main thread other than the "run ABC" phases. For parallelism, we create a set of worker threads which pull work from a concurrent queue fed by the main thread.
To simplify things and also provide a small performance boost, the writing of
lutdefs.txt
andstdcells.genlib
is factored out so it happens just once per pass instead of once per ABC run.Writing thread-safe code in C++ is very scary, especially in a large existing project like Yosys not designed for multithreading. Fortunately the "run ABC" phase is not large and mostly self-contained so the risk of this PR may be acceptable.
One thread safety problem I had to tackle was logging.
Yosys::log()
is not thread-safe and making it thread-safe would be very invasive. The obvious approach of putting locks around everything would slow down the single-threaded case and not scale well for parallel threads, plus the desired behavior of some of the logging functions w.r.t. concurrent logging calls is not clear. So I've created aDeferredLogs
class which exposes alog()
function which simply captures the logs for a particular work item into a buffer. Eventually, back on the main thread, those deferred logs are printed via the standardlog()
function. If log timing is enabled then the timestamps are not meaningful; we can fix that by extending the logging API so callers can pass in previously captured timestamps, but I'd prefer to do that after my logging PR #5243 has been merged.If applicable, please suggest to reviewers how they can test the change.
The existing Yosys test suite exercises this code fairly well. If we take this PR and especially if we carry on down the road of adding more parallelism, it would be good to run the Yosys test suite with TSAN regularly.