-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New llama-run #17554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
New llama-run #17554
Conversation
94a98cf to
500a90c
Compare
|
@angt @ggerganov @CISC @ngxson PTAL |
|
Considering these points:
So, I think it's better to repurpose one of the 3 binaries mentioned in my first point instead. |
|
f839cfb to
5b0c817
Compare
|
I think what's better doing at this point is improve the usability of |
5b0c817 to
fddbeed
Compare
I’m on board with the idea, but can we push ahead with this PR and plan the refactor for a later stage? To be honest, I think you’re the best person to handle the refactoring. You know exactly how you want the architecture to look, and it’s hard for others to guess those specifics. If someone else tries, we risk getting stuck in a loop of corrections, so it’s probably more efficient if you drive that part. |
For most parts, server code already refactored to be reused in another tool/example. I'm not sure what kind of refactoring you're talking about |
Moving the relevant code to llama-cli, llama-run, etc. or some other place. |
|
I think your
This way, you simply make In the future, we can make server as a static (or maybe shared) lib target and reuse it in both server and run |
0e3f326 to
f7a7775
Compare
f347cb1 to
2619c11
Compare
|
People should give this a shot, the UX is quite neat: |
|
Seems to be a better direction now. I'll see if it's worth splitting some changes specific to One question though: why do we need to split |
|
Btw maybe a TODO, we can now add multimodal support to the CLI too |
2619c11 to
f9ae221
Compare
I've received mixed feedback in the past regarding granular file separation versus consolidation, so I am unsure of the preferred direction here. run-chat.cpp and run.cpp seems like a reasonable split between chat-focused activities and other code required to get the tool running. |
f9ae221 to
a449065
Compare
|
If somebody could restart the failing builds I'd appreciate it, I don't have any sort of maintainer access anymore, as limited as a random contributor |
|
The failed builds are not related to server, we can ignore them anw. Beside, I usually open a mirror PR on my fork to skip the long waiting line of CIs on main repo, you can give that a try. |
|
@ericcurtin I'm about to merge a refactoring that will greatly reduce the complexity of using server code as CLI. Here is an example of a very simple CLI implementation via OAI-compat schema (no HTTP server, the schema is sent directly to server_context): https://gist.github.com/ngxson/f3e18888e88d87184f785bf0d4458bda I think I can go ahead and iterate from the draft version above (so it will supersede your PR). I think it would be useful to bring So I'm wondering if it's OK for you if I will firstly release a CLI without Otherwise, you can also continue the current PR with the new server API (again, based on my gist above) |
|
@ngxson I'll try and apply your gist here, might make the build green at least... |
d522e11 to
cded4e3
Compare
|
@ngxson changes made PTAL |
ngxson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On high-level this looks good (it is merge-able if CI is green)
An issue regarding logs should be addressed if you can, but not very important. Otherwise I will address it when I move llama-run code to llama-cli. There will be some additional works to allow features like multimodal or speculative to work without having to go through the OAI API (which is inefficient for large input)
tools/run/run.cpp
Outdated
| return 1; | ||
| // Set default parallel processing for better performance | ||
| // This is the same logic as in server.cpp | ||
| if (params.n_parallel == 1 && params.kv_unified == false && !params.has_speculative()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we never don't need n_parallel, this should be removed
tools/run/run.cpp
Outdated
| // Set minimal output by default for llama-run (suppress INFO/WARN logs) | ||
| // This must be set before parsing arguments because Docker model resolution | ||
| // triggers logging during the parse phase | ||
| common_log_set_verbosity_thold(-1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to suppress all error logs, including logs when there are an invalid argument and -h option
tools/run/CMakeLists.txt
Outdated
| if (NOT LLAMA_HTTPLIB) | ||
| message(FATAL_ERROR "LLAMA_HTTPLIB is OFF, cannot build llama-run. Hint: to skip building run, set -DLLAMA_BUILD_RUN=OFF") | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
tools/run/CMakeLists.txt
Outdated
| ${SERVER_DIR}/server-http.cpp | ||
| ${SERVER_DIR}/server-http.h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused source files (server-http)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if readline.cpp is an external dependency, probably it's better to explicitly include it in scripts/sync_vendor.py
place it under vendor directory instead
- Added readline.cpp include - Created run_chat_mode(): - Initializes readline with command history - Maintains conversation history - Applies chat templates to format messages - Submits completion tasks to the server queue - Displays assistant responses interactively Signed-off-by: Eric Curtin <eric.curtin@docker.com>
cded4e3 to
79069a0
Compare
Uh oh!
There was an error while loading. Please reload this page.