Step 0: reopen in devcontainer
Bring it up and running:
-
starting up the flask websocket server with gunicorn
gunicorn -b 0.0.0.0:5000 --workers 4 --threads 100 app.flask_socket:app, this is a service which establish websocket with backend. and the backend listens to rabbitmq. -
Starting up the another backend which actually receives the question,
python another_backend.py. -
Using your browser to open localhost:5000 (please watch it, the essence of this repo is to show the phrase by phrase response to user...)
-
Using postman or curl to send in the question like below, then the localhost:5000 page should be able to show the answer
curl --location 'localhost:5001/questions' \ --header 'Content-Type: application/json' \ --data '{"question_id": "question1", "question": "how to play bass"}' curl --location 'localhost:5001/questions' \ --header 'Content-Type: application/json' \ --data '{"question_id": "question2", "question": "how to play piano"}'
How it works
- In this app folder, there are 2 backends
- first backend is the
app/flask_socket.py, this is the backend which allows browser to establish websocket with, and this backend listens to rabbitmq. - the second backend is the
another_backend.py, this is the backend which answer user's question likehow to play piano, and then, stream the steamed response from openai to rabbitmq (app/llm.py), then it will be consumed by the flask socket backend (which listens to the rabbitmq)
Why doing this
- well user does not want to wait for the whole response to be generated, then send to user.
- openai start to response after 1 seconds, and finish at 3 seconds for question like
how to play piano. - so if we dont stream, user has to wait for 3 seconds!. (well you can observe the log when you trigger the
app/llm.py)
How it can be implemeted to production environments
- suppose we have a chatting app, which will involve some personal data, via RAG way of doing things.
- user post a question like "how to play bass", here frontend needs to send the question_id to backend, instead of backend generates the id
- backend receives it, saved into db, with the id generated by frontend, we also keep track of the question poster, and when it is posted, those admin info..
- after saves into db, which is superfast, it can start to use the
app/llm.pyto put words by words into rabbitmq, but we need to put there with the question_id. unlike the demo app here, we have 2 websockets for 2 question, for actuall implementation, 1 user gets 1 websocket for chats. so we need to know where to put those words/phrases comming from the websocket, thus each word/phrase must be labeled with question_id. thus the actually info put into the rabbitmq will bejson.dumps({"question_id": "Here are the basic steps to start learning how to play bass: "})or something similar..thus frontend will know where to append the results. - after all words / phrases spit out, the another_backend will finally return 200 response, then frontend will know the question is fully answered!
- the
another_backendknows the user input and AI output, thus it is response to save the conversation into database for fetch later (when user refresh the page, he needs to see the past conversation!)