-
Notifications
You must be signed in to change notification settings - Fork 13.9k
common : add minimalist multi-thread progress bar #17602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
common : add minimalist multi-thread progress bar #17602
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I missed something, I can't find any references to std::thread in download.cpp - do we actually using more than 1 thread here?
And AFAIK, httplib::Client is single-threaded.
Edit: nevermind, see next comment
common/download.cpp
Outdated
| static std::mutex mutex; | ||
| static std::map<std::thread::id, int> lines; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a local variable at first glance, but because it's static, that effectively make it a global variable.
IMO we should avoid global variable if possible. Probably a better way is to make a dedicated class to manage threads and implement the mutex as a member of the class.
|
Hmm, nevermind, I found it: the multi-threads is launched by In this case, I think it's better to introduce a mutex at This is the code where threads are created: std::vector<std::future<bool>> futures_download;
for (auto const & item : urls) {
futures_download.push_back(std::async(std::launch::async, [bearer_token, offline](const std::pair<std::string, std::string> & it) -> bool {
return common_download_file_single(it.first, it.second, bearer_token, offline);
}, item));
}The cleaner idea looks like this (pseudo-code): std::mutex mutex_progress;
std::vector<float> progress(urls.size(), 0); // from 0.0 to 100.0
auto update_progress = [&mutex_progress, &progress](size_t index, float value) {
std::lock_guard<std::mutex> lock(mutex_progress);
progress[index] = value;
// print all progress here
};
std::vector<std::future<bool>> futures_download;
for (auto const & item : urls) {
futures_download.push_back(std::async(std::launch::async, [bearer_token, offline, update_progress](const std::pair<std::string, std::string> & it) -> bool {
return common_download_file_single(it.first, it.second, bearer_token, offline, update_progress);
}, item));
} |
I disagree on that, moving the lock to the caller is error-prone. Statics are generally avoided for good reasons, but in this case it fits perfectly as there is physically only one terminal to share. The only bad thing is the static map, but it keeps the code extremely simple and works for the current usage (avoiding the need to change all prototypes). When wefully remove curl, we can refactor the entire download code for better structure and address this properly. |
Which kind of error can we get?
You can't assume this function will always be called in terminal application unless you explicit specify it in the function name or documentation. Currently, there is nothing prevent someone from calling Edit: the given example does not directly related to the placement of mutex, but I just want to highlight the fact that functions should be well-documented especially in term of thread safety |
We need a function to print the progress bar, as it could be called from many places. By moving the mutex outside of the function, you are literally asking for a non thread-safe function to be synchronized by the caller, which is error-prone by design. As this function shares a unique tty for the whole lifetime of the process, a static mutex is the perfect scenario here. There are no instances to handle.
The function is designed to be safe regardless of the context, whether it's running in a terminal or not.
It will work in that case as well, you just call the function from anywhere. If multiple threads call it, you will see multiple progress bars, which is the expected behavior :) |
|
Since #17427 is merged, I'll update the implementation to properly cleanup the map in all cases. |
In server's use case, I would expect to have an abstracted implementation of the progress bar (whose the state can be reported into webui). Each progress instance of the download should have their own progress bar and they none depends on the other. Probably this implementation is OK for just terminal and we can improve it later. But still, having a static mutex inside a function is obviously an anti-pattern (it should ideally in global scope) Probably better to move it to |
c180d51 to
09717b6
Compare
|
Totally agree, for the WebUI we'll need a better abstraction of the downloader, to completly remove all the tty code.
That wouldn't be good. The progress bar needs a tty fd, while the log goes to stderr (for a modern process) or a dedicated fd. They shouldn't be mixed (you don't want a progress bar in your logs anyway). |
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
09717b6 to
4387ab2
Compare
|
The new implementation works like the old one but cleanup the map in all cases thanks to RAII. Also, instead of the thread id, we now use the instance itself to track the bar, which is more flexible. |
I don't mean the progress bar must use the same tty as log, but it should use the same mutex as log. In theory, that should be the only mutex controlling output to console/tty/file |
Actually, this is a known anti-pattern we need to avoid. Since the download is a long process exposed to interrupts (signals, network loss, disk full...) sharing the mutex creates a guaranteed deadlock if we need to log something. The mutex is absolutely not here to protect the fd, it's purely cosmetic to ensure a clean display when everything goes well as we share the same tty. |
|
I cannot think of any cases that it can result in a deadlock:
At least, that's how I envision the system to work. Also, I think we try to reduce the number of |
Just to note that, download is a long process, but Edit: even better, I think we can just implement a new log macro: |
I intentionally kept the bar simple without specifying part numbers (which ultimately don't matter much) the only thing we care about is tracking progress