Skip to content

yijunx/streaming-llm-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

streaming-llm-api

Step 0: reopen in devcontainer

Bring it up and running:

  • starting up the flask websocket server with gunicorn gunicorn -b 0.0.0.0:5000 --workers 4 --threads 100 app.flask_socket:app, this is a service which establish websocket with backend. and the backend listens to rabbitmq.

  • Starting up the another backend which actually receives the question, python another_backend.py.

  • Using your browser to open localhost:5000 (please watch it, the essence of this repo is to show the phrase by phrase response to user...)

  • Using postman or curl to send in the question like below, then the localhost:5000 page should be able to show the answer

    curl --location 'localhost:5001/questions' \
    --header 'Content-Type: application/json' \
    --data '{"question_id": "question1", "question": "how to play bass"}'
    
    curl --location 'localhost:5001/questions' \
    --header 'Content-Type: application/json' \
    --data '{"question_id": "question2", "question": "how to play piano"}'
    

How it works

  • In this app folder, there are 2 backends
  • first backend is the app/flask_socket.py, this is the backend which allows browser to establish websocket with, and this backend listens to rabbitmq.
  • the second backend is the another_backend.py, this is the backend which answer user's question like how to play piano, and then, stream the steamed response from openai to rabbitmq (app/llm.py), then it will be consumed by the flask socket backend (which listens to the rabbitmq)

Why doing this

  • well user does not want to wait for the whole response to be generated, then send to user.
  • openai start to response after 1 seconds, and finish at 3 seconds for question like how to play piano.
  • so if we dont stream, user has to wait for 3 seconds!. (well you can observe the log when you trigger the app/llm.py)

How it can be implemeted to production environments

  • suppose we have a chatting app, which will involve some personal data, via RAG way of doing things.
  • user post a question like "how to play bass", here frontend needs to send the question_id to backend, instead of backend generates the id
  • backend receives it, saved into db, with the id generated by frontend, we also keep track of the question poster, and when it is posted, those admin info..
  • after saves into db, which is superfast, it can start to use the app/llm.py to put words by words into rabbitmq, but we need to put there with the question_id. unlike the demo app here, we have 2 websockets for 2 question, for actuall implementation, 1 user gets 1 websocket for chats. so we need to know where to put those words/phrases comming from the websocket, thus each word/phrase must be labeled with question_id. thus the actually info put into the rabbitmq will be json.dumps({"question_id": "Here are the basic steps to start learning how to play bass: "}) or something similar..thus frontend will know where to append the results.
  • after all words / phrases spit out, the another_backend will finally return 200 response, then frontend will know the question is fully answered!
  • the another_backend knows the user input and AI output, thus it is response to save the conversation into database for fetch later (when user refresh the page, he needs to see the past conversation!)

About

try streaming, so response looks faster..

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors