streaming-llm-api

Step 0: reopen in devcontainer

Bring it up and running:

starting up the flask websocket server with gunicorn gunicorn -b 0.0.0.0:5000 --workers 4 --threads 100 app.flask_socket:app, this is a service which establish websocket with backend. and the backend listens to rabbitmq.
Starting up the another backend which actually receives the question, python another_backend.py.
Using your browser to open localhost:5000 (please watch it, the essence of this repo is to show the phrase by phrase response to user...)

Using postman or curl to send in the question like below, then the localhost:5000 page should be able to show the answer

curl --location 'localhost:5001/questions' \
--header 'Content-Type: application/json' \
--data '{"question_id": "question1", "question": "how to play bass"}'

curl --location 'localhost:5001/questions' \
--header 'Content-Type: application/json' \
--data '{"question_id": "question2", "question": "how to play piano"}'

How it works

In this app folder, there are 2 backends
first backend is the app/flask_socket.py, this is the backend which allows browser to establish websocket with, and this backend listens to rabbitmq.
the second backend is the another_backend.py, this is the backend which answer user's question like how to play piano, and then, stream the steamed response from openai to rabbitmq (app/llm.py), then it will be consumed by the flask socket backend (which listens to the rabbitmq)

Why doing this

well user does not want to wait for the whole response to be generated, then send to user.
openai start to response after 1 seconds, and finish at 3 seconds for question like how to play piano.
so if we dont stream, user has to wait for 3 seconds!. (well you can observe the log when you trigger the app/llm.py)

How it can be implemeted to production environments

suppose we have a chatting app, which will involve some personal data, via RAG way of doing things.
user post a question like "how to play bass", here frontend needs to send the question_id to backend, instead of backend generates the id
backend receives it, saved into db, with the id generated by frontend, we also keep track of the question poster, and when it is posted, those admin info..
after saves into db, which is superfast, it can start to use the app/llm.py to put words by words into rabbitmq, but we need to put there with the question_id. unlike the demo app here, we have 2 websockets for 2 question, for actuall implementation, 1 user gets 1 websocket for chats. so we need to know where to put those words/phrases comming from the websocket, thus each word/phrase must be labeled with question_id. thus the actually info put into the rabbitmq will be json.dumps({"question_id": "Here are the basic steps to start learning how to play bass: "}) or something similar..thus frontend will know where to append the results.
after all words / phrases spit out, the another_backend will finally return 200 response, then frontend will know the question is fully answered!
the another_backend knows the user input and AI output, thus it is response to save the conversation into database for fetch later (when user refresh the page, he needs to see the past conversation!)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.devcontainer		.devcontainer
app		app
.gitignore		.gitignore
README.md		README.md
another_backend.py		another_backend.py
dev.env.example		dev.env.example
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

streaming-llm-api

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

streaming-llm-api

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages