|
| 1 | +[.enterprise-edition] |
| 2 | +[[graph-project-apache-arrow]] |
| 3 | += Projecting graphs using Apache Arrow |
| 4 | + |
| 5 | +[abstract] |
| 6 | +-- |
| 7 | +This chapter explains how to import data using Apache Arrow into the Graph Data Science library. |
| 8 | +-- |
| 9 | + |
| 10 | +include::../../management-ops/alpha-note.adoc[] |
| 11 | + |
| 12 | +include::../../common-usage/not-on-aurads-note.adoc[] |
| 13 | + |
| 14 | +Projecting graphs via https://arrow.apache.org/[Apache Arrow] allows importing graph data which is stored outside of Neo4j. |
| 15 | +Apache Arrow is a language-agnostic in-memory, columnar data structure specification. |
| 16 | +With Arrow Flight, it also contains a protocol for serialization and generic data transport. |
| 17 | + |
| 18 | +GDS exposes an Arrow Flight Server which accepts graph data from an Arrow Flight Client. |
| 19 | +The data that is being sent is represented using the Arrow columnar format. |
| 20 | +Projecting graphs via Arrow Flight follows a specific client-server protocol. |
| 21 | +In this chapter, we explain that protocol, message formats and schema constraints. |
| 22 | + |
| 23 | +In this chapter, we assume that a Flight server has been set up and configured. |
| 24 | +To learn more about the installation, please refer to the <<installation-apache-arrow, installation chapter>>. |
| 25 | + |
| 26 | + |
| 27 | +== Client-Server protocol |
| 28 | + |
| 29 | +The protocol describes the projection of a single in-memory graph into GDS. |
| 30 | +Each projection is represented as an import process on the server side. |
| 31 | +The protocol divides the import process into three phases. |
| 32 | + |
| 33 | +image::arrow/import-protocol.png[Client-server protocol for Arrow import in GDS,align="center"] |
| 34 | + |
| 35 | +1. Initialize the import process |
| 36 | ++ |
| 37 | +To initialize the import process, the client needs to execute a Flight action on the server. |
| 38 | +The action type is called `CREATE_GRAPH` and the action body configures the import process. |
| 39 | +The server receives the action, creates the import process and acknowledges success. |
| 40 | ++ |
| 41 | +See <<arrow-initialize-import-process>> for more details. |
| 42 | ++ |
| 43 | +2. Send node records via an Arrow Flight stream |
| 44 | ++ |
| 45 | +In the second phase, the client sends record batches of nodes via `PUT` as a Flight stream. |
| 46 | +Once all record batches are sent, the client needs to indicate that all nodes have been sent. |
| 47 | +This is done via sending another Flight action with type `NODE_LOAD_DONE`. |
| 48 | ++ |
| 49 | +See <<arrow-send-nodes>> for more details. |
| 50 | ++ |
| 51 | +3. Send relationship records via an Arrow Flight stream |
| 52 | ++ |
| 53 | +In the third and last phase, the client sends record batches of relationships via `PUT` as a Flight stream. |
| 54 | +Once all record batches are sent, the client needs to indicate that the import process is complete. |
| 55 | +This is done via sending another Flight action with type `RELATIONSIP_LOAD_DONE`. |
| 56 | +The server finalizes the construction of the in-memory graph and stores the graph in the graph catalog. |
| 57 | ++ |
| 58 | +See <<arrow-send-relationships>> for more details. |
| 59 | + |
| 60 | + |
| 61 | +[[arrow-initialize-import-process]] |
| 62 | +== Initializing the Import Process |
| 63 | + |
| 64 | +An import process is initialized by sending a Flight action using the action type `CREATE_GRAPH`. |
| 65 | +The action body is a JSON document containing metadata for the import process: |
| 66 | + |
| 67 | +---- |
| 68 | +{ |
| 69 | + name: "my_graph", |
| 70 | + database_name: "neo4j", |
| 71 | + concurrency: 4 |
| 72 | +} |
| 73 | +---- |
| 74 | + |
| 75 | +The `name` is used to identify the import process, it is also the name of the resulting in-memory graph in the graph catalog. |
| 76 | +The `database_name` is used to tell the server on which database the projected graph will be available. |
| 77 | +The `concurrency` key is optional, it is used during finalizing the in-memory graph on the server after all data has been received. |
| 78 | + |
| 79 | +The server acknowledges creating the import process by sending a result JSON document which contains the name of the import process. |
| 80 | +If an error occurs, e.g., if the graph already exists or if the server is not started, the client is informed accordingly. |
| 81 | + |
| 82 | + |
| 83 | +[[arrow-send-nodes]] |
| 84 | +== Sending node records via PUT as a Flight stream |
| 85 | + |
| 86 | +Nodes need to be turned into Arrow record batches and sent to the server via a Flight stream. |
| 87 | +Each stream needs to target an import process on the server. |
| 88 | +That information is encoded in the Flight descriptor body as a JSON document: |
| 89 | + |
| 90 | +---- |
| 91 | +{ |
| 92 | + name: "my_graph", |
| 93 | + entity_type: "node", |
| 94 | +} |
| 95 | +---- |
| 96 | + |
| 97 | +The server expects the node records to adhere to a specific schema. |
| 98 | +Given an example node such as `(:Pokemon { weight: 8.5, height: 0.6, hp: 39 })`, it's record must be represented as follows: |
| 99 | + |
| 100 | +[[arrow-node-schema]] |
| 101 | +[opts=header,cols="1m,1m,1m,1m,1m"] |
| 102 | +|=== |
| 103 | +| node_id | label | weight | height | hp |
| 104 | +| 0 | "Pokemon" | 8.5 | 0.6 | 39 |
| 105 | +|=== |
| 106 | + |
| 107 | +The following table describes the node columns with reserved names. |
| 108 | + |
| 109 | +[[arrow-node-columns]] |
| 110 | +[opts=header,cols="1m,1m,1m,1m,1"] |
| 111 | +|=== |
| 112 | +| Name | Type | Optional | Nullable | Description |
| 113 | +| node_id | Integer | No | No | Unique 64-bit node identifiers for the in-memory graph. Must be positive values. |
| 114 | +| label | String or Integer | Yes | No | Single node label. Either a string literal or a dictionary encoded number. |
| 115 | +|=== |
| 116 | + |
| 117 | +Any additional column is interpreted as a node property. |
| 118 | +The supported data types are equivalent to the GDS node property types, i.e., `long`, `double`, `long[]`, `double[]` and `float[]`. |
| 119 | + |
| 120 | +To increase the throughput, multiple Flight streams can be sent in parallel. |
| 121 | +The server manages multiple incoming streams for the same import process. |
| 122 | +In addition to the number of parallel streams, the size of a single record batch can also affect the overall throughput. |
| 123 | +The client has to make sure that node ids are unique across all streams. |
| 124 | + |
| 125 | +Once all node record batches are sent to the server, the client needs to indicate that node loading is done. |
| 126 | +This is achieved by sending another Flight action with the action type `NODE_LOAD_DONE` and the following JSON document as action body: |
| 127 | + |
| 128 | +---- |
| 129 | +{ |
| 130 | + name: "my_graph" |
| 131 | +} |
| 132 | +---- |
| 133 | + |
| 134 | +The server acknowledges the action by returning a JSON document including the name of the import process and the number of nodes that have been imported: |
| 135 | + |
| 136 | +---- |
| 137 | +{ |
| 138 | + name: "my_graph", |
| 139 | + node_count: 42 |
| 140 | +} |
| 141 | +---- |
| 142 | + |
| 143 | +[[arrow-send-relationships]] |
| 144 | +== Sending relationship records via PUT as a Flight stream |
| 145 | + |
| 146 | +Similar to nodes, relationships need to be turned into record batches in order to send them to the server via a Flight stream. |
| 147 | +The Flight descriptor is a JSON document containing the name of the import process as well as the entity type: |
| 148 | + |
| 149 | +---- |
| 150 | +{ |
| 151 | + name: "my_graph", |
| 152 | + entity_type: "relationship", |
| 153 | +} |
| 154 | +---- |
| 155 | + |
| 156 | +As for nodes, the server expects a specific schema for relationship records. |
| 157 | +For example, given the relationship `(a)-[:EVOLVES_TO { at_level: 16 }]->(b)` an assuming node id `0` for `a` and node id `1` for `b`, the record must be represented as follow: |
| 158 | + |
| 159 | +[[arrow-relationship-schema]] |
| 160 | +[opts=header,cols="1m,1m,1m,1m"] |
| 161 | +|=== |
| 162 | +| source_id | target_id | type | at_level |
| 163 | +| 0 | 1 | "EVOLVES_TO" | 16 |
| 164 | +|=== |
| 165 | + |
| 166 | +The following table describes the node columns with reserved names. |
| 167 | + |
| 168 | +[[arrow-relationship-columns]] |
| 169 | +[opts=header,cols="1m,1m,1m,1m,1"] |
| 170 | +|=== |
| 171 | +| Name | Type | Optional | Nullable | Description |
| 172 | +| source_id | Integer | No | No | Unique 64-bit source node identifiers. Must be positive values and present in the imported nodes. |
| 173 | +| target_id | Integer | No | No | Unique 64-bit target node identifiers. Must be positive values and present in the imported nodes. |
| 174 | +| type | String or Integer | Yes | No | Single relationship type. Either a string literal or a dictionary encoded number. |
| 175 | +|=== |
| 176 | + |
| 177 | +Any additional column is interpreted as a relationship property. |
| 178 | +GDS only supports relationship properties of type `double`. |
| 179 | + |
| 180 | +Similar to sending nodes, the overall throughput depends on the number of parallel Flight streams and the record batch size. |
| 181 | + |
| 182 | +Once all relationship record batches are sent to the server, the client needs to indicate that the import process is done. |
| 183 | +This is achieved by sending a final Flight action with the action type `RELATIONSHIP_LOAD_DONE` and the following JSON document as action body: |
| 184 | + |
| 185 | +---- |
| 186 | +{ |
| 187 | + name: "my_graph" |
| 188 | +} |
| 189 | +---- |
| 190 | + |
| 191 | + |
| 192 | +The server finalizes the graph projection and stores the in-memory graph in the graph catalog. |
| 193 | +Once completed, the server acknowledges the action by returning a JSON document including the name of the import process and the number of relationships that have been imported: |
| 194 | + |
| 195 | +---- |
| 196 | +{ |
| 197 | + name: "my_graph", |
| 198 | + relationship_count: 1337 |
| 199 | +} |
| 200 | +---- |
| 201 | + |
0 commit comments