|
1 | | -## Design Doc for Optimization as a Service [WIP] |
2 | | - |
3 | | - |
4 | | - |
5 | | -### Contents |
6 | | - |
7 | | -- [Design Doc for Optimization as a Service \[WIP\]](#design-doc-for-optimization-as-a-service-wip) |
8 | | - - [Contents](#contents) |
9 | | - - [Overview](#overview) |
10 | | - - [Workflow of OaaS](#workflow-of-oaas) |
11 | | - - [Class definition diagram](#class-definition-diagram) |
12 | | - - [Extensibility](#extensibility) |
13 | | - |
14 | | -### Overview |
15 | | - |
16 | | -Optimization as a service(OaaS) is a platform that enables users to submit quantization tasks for their models and automatically dispatches these tasks to one or multiple nodes for accuracy-aware tuning. OaaS is designed to parallelize the tuning process in two levels: tuning and model. At the tuning level, OaaS execute the tuning process across multiple nodes for one model. At the model level, OaaS allocate free nodes to incoming requests automatically. |
17 | | - |
18 | | - |
19 | | -### Workflow of OaaS |
20 | | - |
21 | | -```mermaid |
22 | | -sequenceDiagram |
23 | | - participant Studio |
24 | | - participant TaskMonitor |
25 | | - participant Scheduler |
26 | | - participant Cluster |
27 | | - participant TaskLauncher |
28 | | - participant ResultMonitor |
29 | | - Par receive task |
30 | | - Studio ->> TaskMonitor: P1-1. Post quantization Request |
31 | | - TaskMonitor ->> TaskMonitor: P1-2. Add task to task DB |
32 | | - TaskMonitor ->> Studio: P1-3. Task received notification |
33 | | - and Schedule task |
34 | | - loop |
35 | | - Scheduler ->> Scheduler: P2-1. Pop task from task DB |
36 | | - Scheduler ->> Cluster: P2-2. Apply for resources |
37 | | - Note over Scheduler, Cluster: the number of Nodes |
38 | | - Cluster ->> Cluster: P2-3. Check the status of nodes in cluster |
39 | | - Cluster ->> Scheduler: P2-4. Resources info |
40 | | - Note over Scheduler, Cluster: host:socket list |
41 | | - Scheduler ->> TaskLauncher: P2-5. Dispatch task |
42 | | - end |
43 | | - and Run task |
44 | | - TaskLauncher ->> TaskLauncher: P3-1. Run task |
45 | | - Note over TaskLauncher, TaskLauncher: mpirun -np 4 -hostfile hostfile python main.py |
46 | | - TaskLauncher ->> TaskLauncher: P3-2. Wait task to finish... |
47 | | - TaskLauncher ->> Cluster: P3-3. Free resource |
48 | | - TaskLauncher ->> ResultMonitor: P3-4. Report the Acc and Perf |
49 | | - ResultMonitor ->> Studio: P3-5. Post result to Studio |
50 | | - and Query task status |
51 | | - Studio ->> ResultMonitor: P4-1. Query the status of the submitted task |
52 | | - ResultMonitor ->> Studio: P4-2. Post the status of queried task |
53 | | - End |
54 | | -
|
| 1 | +# Get started |
| 2 | + |
| 3 | +- [Get started](#get-started) |
| 4 | + - [Install Neural Solution](#install-neural-solution) |
| 5 | + - [Prerequisites](#prerequisites) |
| 6 | + - [Method 1. Using pip](#method-1-using-pip) |
| 7 | + - [Method 2. Building from source](#method-2-building-from-source) |
| 8 | + - [Start service](#start-service) |
| 9 | + - [Submit task](#submit-task) |
| 10 | + - [Query task status](#query-task-status) |
| 11 | + - [Stop service](#stop-service) |
| 12 | + - [Inspect logs](#inspect-logs) |
| 13 | + |
| 14 | +## Install Neural Solution |
| 15 | +### Prerequisites |
| 16 | +- Install [Anaconda](https://docs.anaconda.com/free/anaconda/install/) |
| 17 | +- Install [Open MPI](https://www.open-mpi.org/faq/?category=building#easy-build) |
| 18 | +- Python 3.8 or later |
| 19 | + |
| 20 | +There are two ways to install the neural solution: |
| 21 | +### Method 1. Using pip |
55 | 22 | ``` |
| 23 | +pip install neural-solution |
| 24 | +``` |
| 25 | +### Method 2. Building from source |
56 | 26 |
|
57 | | -The optimization process is divided into four parts, each executed in separate threads. |
58 | | - |
59 | | -- Part 1. Posting new quantization task. (P1-1 -> P1-2 -> P1-3) |
60 | | - |
61 | | -- Part 2. Resource allocation and scheduling. (P2-1 -> P2-2 -> P2-3 -> P2-4 -> P2-5) |
62 | | - |
63 | | -- Part 3. Task execution and reporting. (P3-1 -> P3-2 -> P3-3 -> P3-4 -> P3-5) |
| 27 | +```shell |
| 28 | +# get source code |
| 29 | +git clone https://github.com/intel/neural-compressor |
| 30 | +cd neural-compressor |
64 | 31 |
|
65 | | -- Part 4. Updating the status. (P4-1 -> P4-2) |
| 32 | +# install neural compressor |
| 33 | +pip install -r requirements.txt |
| 34 | +python setup.py install |
66 | 35 |
|
67 | | -### Class definition diagram |
| 36 | +# install neural solution |
| 37 | +pip install -r neural_solution/requirements.txt |
| 38 | +python setup.py neural_solution install |
| 39 | +``` |
68 | 40 |
|
| 41 | +## Start service |
| 42 | + |
| 43 | +```shell |
| 44 | +# Start neural solution service with custom configuration |
| 45 | +neural_solution start --task_monitor_port=22222 --result_monitor_port=33333 --restful_api_port=8001 |
| 46 | + |
| 47 | +# Help Manual |
| 48 | +neural_solution -h |
| 49 | +# Help output |
| 50 | + |
| 51 | +usage: neural_solution {start,stop} [-h] [--hostfile HOSTFILE] [--restful_api_port RESTFUL_API_PORT] [--grpc_api_port GRPC_API_PORT] |
| 52 | + [--result_monitor_port RESULT_MONITOR_PORT] [--task_monitor_port TASK_MONITOR_PORT] [--api_type API_TYPE] |
| 53 | + [--workspace WORKSPACE] [--conda_env CONDA_ENV] [--upload_path UPLOAD_PATH] |
| 54 | + |
| 55 | +Neural Solution |
| 56 | + |
| 57 | +positional arguments: |
| 58 | + {start,stop} start/stop service |
| 59 | + |
| 60 | +optional arguments: |
| 61 | + -h, --help show this help message and exit |
| 62 | + --hostfile HOSTFILE start backend serve host file which contains all available nodes |
| 63 | + --restful_api_port RESTFUL_API_PORT |
| 64 | + start restful serve with {restful_api_port}, default 8000 |
| 65 | + --grpc_api_port GRPC_API_PORT |
| 66 | + start gRPC with {restful_api_port}, default 8000 |
| 67 | + --result_monitor_port RESULT_MONITOR_PORT |
| 68 | + start serve for result monitor at {result_monitor_port}, default 3333 |
| 69 | + --task_monitor_port TASK_MONITOR_PORT |
| 70 | + start serve for task monitor at {task_monitor_port}, default 2222 |
| 71 | + --api_type API_TYPE start web serve with all/grpc/restful, default all |
| 72 | + --workspace WORKSPACE |
| 73 | + neural solution workspace, default "./ns_workspace" |
| 74 | + --conda_env CONDA_ENV |
| 75 | + specify the running environment for the task |
| 76 | + --upload_path UPLOAD_PATH |
| 77 | + specify the file path for the tasks |
69 | 78 |
|
| 79 | +``` |
70 | 80 |
|
71 | | -```mermaid |
72 | | -classDiagram |
| 81 | +## Submit task |
73 | 82 |
|
| 83 | +- For RESTful API: `[user@server hf_model]$ curl -H "Content-Type: application/json" --data @./task.json http://localhost:8000/task/submit/` |
| 84 | +- For gRPC API: `python -m neural_solution.frontend.gRPC.client submit --request="test.json"` |
74 | 85 |
|
| 86 | +> For more details, please reference the [API description](./description_api.md) and [examples](../../examples/README.md). |
75 | 87 |
|
76 | | -TaskDB "1" --> "*" Task |
77 | | -TaskMonitor --> TaskDB |
78 | | -ResultMonitor --> TaskDB |
79 | | -Scheduler --> TaskDB |
80 | | -Scheduler --> Cluster |
| 88 | +## Query task status |
81 | 89 |
|
| 90 | +Query the task status and result according to the `task_id`. |
82 | 91 |
|
83 | | -class Task{ |
84 | | - + status |
85 | | - + get_status() |
86 | | - + update_status() |
87 | | -} |
| 92 | +- For RESTful API: `[user@server hf_model]$ curl -X GET http://localhost:8000/task/status/{task_id}` |
| 93 | +- For gRPC API: `python -m neural_solution.frontend.gRPC.client query --task_id={task_id}` |
88 | 94 |
|
89 | | -class TaskDB{ |
90 | | - - task_collections |
91 | | - + append_task() |
92 | | - + get_all_pending_tasks() |
93 | | - + update_task_status() |
94 | | -} |
95 | | -class TaskMonitor{ |
96 | | - - task_db |
97 | | - + wait_new_task() |
98 | | -} |
99 | | -class Scheduler{ |
100 | | - - task_db |
101 | | - - cluster |
102 | | - + schedule_tasks() |
103 | | - + dispatch_task() |
104 | | - + launch_task() |
105 | | -} |
| 95 | +> For more details, please reference the [API description](./description_api.md) and [examples](../../examples/README.md). |
106 | 96 |
|
107 | | -class ResultMonitor{ |
108 | | - - task_db |
109 | | - + query_task_status() |
110 | | -} |
111 | | -class Cluster{ |
112 | | - - node_list |
113 | | - + free() |
114 | | - + reserve_resource() |
115 | | - + get_node_status() |
116 | | -} |
| 97 | +## Stop service |
117 | 98 |
|
| 99 | +```shell |
| 100 | +# Stop neural solution service with default configuration |
| 101 | +neural_solution stop |
118 | 102 | ``` |
119 | 103 |
|
| 104 | +## Inspect logs |
| 105 | + |
| 106 | +The default logs locate in `./ns_workspace/`. Users can specify a custom workspace by using `neural_solution ---workspace=/path/to/custom/workspace`. |
| 107 | + |
| 108 | +There are several logs under workspace: |
| 109 | + |
| 110 | +```shell |
| 111 | +(ns) [username@servers ns_workspace]$ tree |
| 112 | +. |
| 113 | +├── db |
| 114 | +│ └── task.db # database to save the task-related information |
| 115 | +├── serve_log # service running log |
| 116 | +│ ├── backend.log # backend log |
| 117 | +│ ├── frontend_grpc.log # grpc frontend log |
| 118 | +│ └── frontend.log # HTTP/RESTful frontend log |
| 119 | +├── task_log # overall log for each task |
| 120 | +│ ├── task_bdf0bd1b2cc14bc19bce12d4f9b333c7.txt # task log |
| 121 | +│ └── ... |
| 122 | +└── task_workspace # the log for each task |
| 123 | + ... |
| 124 | + ├── bdf0bd1b2cc14bc19bce12d4f9b333c7 # task_id |
| 125 | + ... |
120 | 126 |
|
121 | | -### Extensibility |
122 | | - |
123 | | -- The service can be deployed on various resource pool, including a set of worker nodes, such as a local cluster or cloud cluster (AWS and GCP). |
| 127 | +``` |
124 | 128 |
|
0 commit comments