Skip to content

Commit 5601af0

Browse files
authored
Merge pull request #5299 from s1ck/arrow-docs
[Arrow] Documentation
2 parents 3033764 + 0d0ce4d commit 5601af0

File tree

7 files changed

+315
-0
lines changed

7 files changed

+315
-0
lines changed

doc/antora/content-nav.adoc

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
** xref:installation/installation-enterprise-edition/index.adoc[]
88
** xref:installation/installation-docker/index.adoc[]
99
** xref:installation/installation-causal-cluster/index.adoc[]
10+
** xref:installation/installation-apache-arrow/index.adoc[]
1011
** xref:installation/additional-config-parameters/index.adoc[]
1112
** xref:installation/System-requirements/index.adoc[]
1213
* xref:common-usage/index.adoc[]
@@ -21,13 +22,15 @@
2122
*** xref:graph-project/index.adoc[]
2223
*** xref:graph-project-cypher/index.adoc[]
2324
*** xref:graph-project-cypher-aggregation/index.adoc[]
25+
*** xref:graph-project-apache-arrow/index.adoc[]
2426
*** xref:graph-list/index.adoc[]
2527
*** xref:graph-exists/index.adoc[]
2628
*** xref:graph-drop/index.adoc[]
2729
*** xref:graph-project-subgraph/index.adoc[]
2830
*** xref:graph-catalog-node-ops/index.adoc[]
2931
*** xref:graph-catalog-relationship-ops/index.adoc[]
3032
*** xref:graph-catalog-export-ops/index.adoc[]
33+
*** xref:graph-catalog-apache-arrow-ops/index.adoc[]
3134
** xref:management-ops/node-properties/index.adoc[]
3235
** xref:management-ops/utility-functions/index.adoc[]
3336
** xref:management-ops/create-cypher-db/index.adoc[]
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
[.enterprise-edition]
2+
[[installation-apache-arrow]]
3+
= Apache Arrow
4+
5+
[abstract]
6+
--
7+
This chapter explains how to set up Apache Arrow Flight in the Neo4j Graph Data Science library.
8+
--
9+
10+
include::../management-ops/alpha-note.adoc[]
11+
12+
include::../common-usage/not-on-aurads-note.adoc[]
13+
14+
GDS supports importing graphs and exporting properties via https://arrow.apache.org/[Apache Arrow Flight].
15+
This chapter is dedicated to configuring the Arrow Flight Server as part of the Neo4j and GDS installation.
16+
For using Arrow Flight with an Arrow client, please refer to our documentation for <<graph-project-apache-arrow, projecting graphs>> and <<graph-catalog-apache-arrow-ops, streaming properties>>.
17+
18+
Arrow is bundled with GDS Enterprise Edition which must be <<neo4j-server, installed>>.
19+
20+
21+
== Installation
22+
23+
On a standalone Neo4j Server, Arrow needs to be explicitly enabled and configured.
24+
The Flight Server is disabled by default, to enable it, add the following to your `$NEO4J_HOME/conf/neo4j.conf` file:
25+
26+
----
27+
gds.arrow.enabled=true
28+
----
29+
30+
The following additional settings are available:
31+
32+
[[table-arrow-settings]]
33+
[opts=header,cols="2m,1m,1,1"]
34+
|===
35+
| Name | Default | Optional | Description
36+
| gds.arrow.listen_address | localhost:8491 | Yes | Address the GDS Arrow Flight Server should bind to.
37+
| gds.arrow.abortion_timeout | 10 | Yes | The maximum time in minutes to wait for the next command before aborting the import process.
38+
| gds.arrow.batch_size | 10000 | Yes | The batch size used for arrow property export.
39+
|===
40+
41+
Note, that any change to the configuration requires a database restart.
42+
43+
44+
== Authentication
45+
46+
Client connections to the Arrow Flight server are authenticated using the https://neo4j.com/docs/operations-manual/current/authentication-authorization/introduction/[Neo4j native auth provider].
47+
Any authenticated user can perform all available Arrow operations, i.e., graph projection and property streaming.
48+
There are no dedicated roles to configure.
49+
50+
To enable authentication, use the following DBMS setting:
51+
52+
----
53+
dbms.security.auth_enabled=true
54+
----
55+
56+
57+
== Encryption
58+
59+
Communication between client and server can optionally be encrypted.
60+
The Arrow Flight server is re-using the https://neo4j.com/docs/operations-manual/current/security/ssl-framework/[Neo4j native SSL framework].
61+
In terms of https://neo4j.com/docs/operations-manual/current/security/ssl-framework/#ssl-configuration[configuration scope], the Arrow Server supports `https` and `bolt`.
62+
If both scopes are configured, the Arrow Server prioritizes the `https` scope.
63+
64+
To enable encryption for `https`, use the following DBMS settings:
65+
66+
----
67+
dbms.ssl.policy.https.enabled=true
68+
dbms.ssl.polict.https.private_key=private.key
69+
dbms.ssl.polict.https.public_certificate=public.crt
70+
----
71+
72+
73+
== Monitoring
74+
75+
To return details about the status of the GDS Flight server, GDS provides the `gds.debug.arrow` procedure.
76+
77+
======
78+
.Run the debug procedure.
79+
[source, cypher, role=noplay]
80+
----
81+
CALL gds.debug.arrow()
82+
YIELD
83+
running: Boolean,
84+
enabled: Boolean,
85+
listenAddress: String,
86+
batchSize: Integer,
87+
abortionTimeout: Integer
88+
----
89+
90+
.Results
91+
[opts="header",cols="1,1,6"]
92+
|===
93+
| Name | Type | Description
94+
| running | Boolean | True, if the Arrow Flight Server is currently running.
95+
| enabled | Boolean | True, if the corresponding setting is enabled.
96+
| listenAddress | String | Address (host and port) the Arrow Flight Server is bound to.
97+
| batchSize | Integer | The batch size used for arrow property export.
98+
| abortionTimeout | Duration | The maximum time to wait for the next command before aborting the import process.
99+
|===
100+
======

doc/asciidoc/installation/installation.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ This chapter is divided into the following sections:
1919
. <<installation-enterprise-edition>>
2020
. <<installation-docker>>
2121
. <<installation-causal-cluster>>
22+
. <<installation-apache-arrow>>
2223
. <<additional-config-parameters>>
2324
. <<System-requirements>>
2425

@@ -28,5 +29,6 @@ include::neo4j-server.adoc[leveloffset=+1]
2829
include::installation-enterprise-edition.adoc[leveloffset=+1]
2930
include::installation-docker.adoc[leveloffset=+1]
3031
include::installation-causal-cluster.adoc[leveloffset=+1]
32+
include::installation-apache-arrow.adoc[leveloffset=+1]
3133
include::additional-config-parameters.adoc[leveloffset=+1]
3234
include::system-requirements.adoc[leveloffset=+1]
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
[[graph-catalog-apache-arrow-ops]]
2+
= Apache Arrow operations
Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
[.enterprise-edition]
2+
[[graph-project-apache-arrow]]
3+
= Projecting graphs using Apache Arrow
4+
5+
[abstract]
6+
--
7+
This chapter explains how to import data using Apache Arrow into the Graph Data Science library.
8+
--
9+
10+
include::../../management-ops/alpha-note.adoc[]
11+
12+
include::../../common-usage/not-on-aurads-note.adoc[]
13+
14+
Projecting graphs via https://arrow.apache.org/[Apache Arrow] allows importing graph data which is stored outside of Neo4j.
15+
Apache Arrow is a language-agnostic in-memory, columnar data structure specification.
16+
With Arrow Flight, it also contains a protocol for serialization and generic data transport.
17+
18+
GDS exposes an Arrow Flight Server which accepts graph data from an Arrow Flight Client.
19+
The data that is being sent is represented using the Arrow columnar format.
20+
Projecting graphs via Arrow Flight follows a specific client-server protocol.
21+
In this chapter, we explain that protocol, message formats and schema constraints.
22+
23+
In this chapter, we assume that a Flight server has been set up and configured.
24+
To learn more about the installation, please refer to the <<installation-apache-arrow, installation chapter>>.
25+
26+
27+
== Client-Server protocol
28+
29+
The protocol describes the projection of a single in-memory graph into GDS.
30+
Each projection is represented as an import process on the server side.
31+
The protocol divides the import process into three phases.
32+
33+
image::arrow/import-protocol.png[Client-server protocol for Arrow import in GDS,align="center"]
34+
35+
1. Initialize the import process
36+
+
37+
To initialize the import process, the client needs to execute a Flight action on the server.
38+
The action type is called `CREATE_GRAPH` and the action body configures the import process.
39+
The server receives the action, creates the import process and acknowledges success.
40+
+
41+
See <<arrow-initialize-import-process>> for more details.
42+
+
43+
2. Send node records via an Arrow Flight stream
44+
+
45+
In the second phase, the client sends record batches of nodes via `PUT` as a Flight stream.
46+
Once all record batches are sent, the client needs to indicate that all nodes have been sent.
47+
This is done via sending another Flight action with type `NODE_LOAD_DONE`.
48+
+
49+
See <<arrow-send-nodes>> for more details.
50+
+
51+
3. Send relationship records via an Arrow Flight stream
52+
+
53+
In the third and last phase, the client sends record batches of relationships via `PUT` as a Flight stream.
54+
Once all record batches are sent, the client needs to indicate that the import process is complete.
55+
This is done via sending another Flight action with type `RELATIONSIP_LOAD_DONE`.
56+
The server finalizes the construction of the in-memory graph and stores the graph in the graph catalog.
57+
+
58+
See <<arrow-send-relationships>> for more details.
59+
60+
61+
[[arrow-initialize-import-process]]
62+
== Initializing the Import Process
63+
64+
An import process is initialized by sending a Flight action using the action type `CREATE_GRAPH`.
65+
The action body is a JSON document containing metadata for the import process:
66+
67+
----
68+
{
69+
name: "my_graph",
70+
database_name: "neo4j",
71+
concurrency: 4
72+
}
73+
----
74+
75+
The `name` is used to identify the import process, it is also the name of the resulting in-memory graph in the graph catalog.
76+
The `database_name` is used to tell the server on which database the projected graph will be available.
77+
The `concurrency` key is optional, it is used during finalizing the in-memory graph on the server after all data has been received.
78+
79+
The server acknowledges creating the import process by sending a result JSON document which contains the name of the import process.
80+
If an error occurs, e.g., if the graph already exists or if the server is not started, the client is informed accordingly.
81+
82+
83+
[[arrow-send-nodes]]
84+
== Sending node records via PUT as a Flight stream
85+
86+
Nodes need to be turned into Arrow record batches and sent to the server via a Flight stream.
87+
Each stream needs to target an import process on the server.
88+
That information is encoded in the Flight descriptor body as a JSON document:
89+
90+
----
91+
{
92+
name: "my_graph",
93+
entity_type: "node",
94+
}
95+
----
96+
97+
The server expects the node records to adhere to a specific schema.
98+
Given an example node such as `(:Pokemon { weight: 8.5, height: 0.6, hp: 39 })`, it's record must be represented as follows:
99+
100+
[[arrow-node-schema]]
101+
[opts=header,cols="1m,1m,1m,1m,1m"]
102+
|===
103+
| node_id | label | weight | height | hp
104+
| 0 | "Pokemon" | 8.5 | 0.6 | 39
105+
|===
106+
107+
The following table describes the node columns with reserved names.
108+
109+
[[arrow-node-columns]]
110+
[opts=header,cols="1m,1m,1m,1m,1"]
111+
|===
112+
| Name | Type | Optional | Nullable | Description
113+
| node_id | Integer | No | No | Unique 64-bit node identifiers for the in-memory graph. Must be positive values.
114+
| label | String or Integer | Yes | No | Single node label. Either a string literal or a dictionary encoded number.
115+
|===
116+
117+
Any additional column is interpreted as a node property.
118+
The supported data types are equivalent to the GDS node property types, i.e., `long`, `double`, `long[]`, `double[]` and `float[]`.
119+
120+
To increase the throughput, multiple Flight streams can be sent in parallel.
121+
The server manages multiple incoming streams for the same import process.
122+
In addition to the number of parallel streams, the size of a single record batch can also affect the overall throughput.
123+
The client has to make sure that node ids are unique across all streams.
124+
125+
Once all node record batches are sent to the server, the client needs to indicate that node loading is done.
126+
This is achieved by sending another Flight action with the action type `NODE_LOAD_DONE` and the following JSON document as action body:
127+
128+
----
129+
{
130+
name: "my_graph"
131+
}
132+
----
133+
134+
The server acknowledges the action by returning a JSON document including the name of the import process and the number of nodes that have been imported:
135+
136+
----
137+
{
138+
name: "my_graph",
139+
node_count: 42
140+
}
141+
----
142+
143+
[[arrow-send-relationships]]
144+
== Sending relationship records via PUT as a Flight stream
145+
146+
Similar to nodes, relationships need to be turned into record batches in order to send them to the server via a Flight stream.
147+
The Flight descriptor is a JSON document containing the name of the import process as well as the entity type:
148+
149+
----
150+
{
151+
name: "my_graph",
152+
entity_type: "relationship",
153+
}
154+
----
155+
156+
As for nodes, the server expects a specific schema for relationship records.
157+
For example, given the relationship `(a)-[:EVOLVES_TO { at_level: 16 }]->(b)` an assuming node id `0` for `a` and node id `1` for `b`, the record must be represented as follow:
158+
159+
[[arrow-relationship-schema]]
160+
[opts=header,cols="1m,1m,1m,1m"]
161+
|===
162+
| source_id | target_id | type | at_level
163+
| 0 | 1 | "EVOLVES_TO" | 16
164+
|===
165+
166+
The following table describes the node columns with reserved names.
167+
168+
[[arrow-relationship-columns]]
169+
[opts=header,cols="1m,1m,1m,1m,1"]
170+
|===
171+
| Name | Type | Optional | Nullable | Description
172+
| source_id | Integer | No | No | Unique 64-bit source node identifiers. Must be positive values and present in the imported nodes.
173+
| target_id | Integer | No | No | Unique 64-bit target node identifiers. Must be positive values and present in the imported nodes.
174+
| type | String or Integer | Yes | No | Single relationship type. Either a string literal or a dictionary encoded number.
175+
|===
176+
177+
Any additional column is interpreted as a relationship property.
178+
GDS only supports relationship properties of type `double`.
179+
180+
Similar to sending nodes, the overall throughput depends on the number of parallel Flight streams and the record batch size.
181+
182+
Once all relationship record batches are sent to the server, the client needs to indicate that the import process is done.
183+
This is achieved by sending a final Flight action with the action type `RELATIONSHIP_LOAD_DONE` and the following JSON document as action body:
184+
185+
----
186+
{
187+
name: "my_graph"
188+
}
189+
----
190+
191+
192+
The server finalizes the graph projection and stores the in-memory graph in the graph catalog.
193+
Once completed, the server acknowledges the action by returning a JSON document including the name of the import process and the number of relationships that have been imported:
194+
195+
----
196+
{
197+
name: "my_graph",
198+
relationship_count: 1337
199+
}
200+
----
201+

doc/docbook/content-map.xml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,9 @@
2929
<d:tocentry linkend="installation-causal-cluster">
3030
<?dbhtml filename="installation/installation-causal-cluster/index.html"?>
3131
</d:tocentry>
32+
<d:tocentry linkend="installation-apache-arrow">
33+
<?dbhtml filename="installation/installation-apache-arrow/index.html"?>
34+
</d:tocentry>
3235
<d:tocentry linkend="additional-config-parameters">
3336
<?dbhtml filename="installation/additional-config-parameters/index.html"?>
3437
</d:tocentry>
@@ -67,6 +70,8 @@
6770
</d:tocentry>
6871
<d:tocentry linkend="catalog-graph-project-cypher-aggregation"><?dbhtml filename="graph-project-cypher-aggregation/index.html"?>
6972
</d:tocentry>
73+
<d:tocentry linkend="graph-project-apache-arrow"><?dbhtml filename="graph-project-apache-arrow/index.html"?>
74+
</d:tocentry>
7075
<d:tocentry linkend="catalog-graph-list"><?dbhtml filename="graph-list/index.html"?>
7176
</d:tocentry>
7277
<d:tocentry linkend="catalog-graph-exists"><?dbhtml filename="graph-exists/index.html"?>
@@ -81,6 +86,8 @@
8186
</d:tocentry>
8287
<d:tocentry linkend="graph-catalog-export-ops"><?dbhtml filename="graph-catalog-export-ops/index.html"?>
8388
</d:tocentry>
89+
<d:tocentry linkend="graph-catalog-apache-arrow-ops"><?dbhtml filename="graph-catalog-apache-arrow-ops/index.html"?>
90+
</d:tocentry>
8491
</d:tocentry>
8592
<d:tocentry linkend="node-properties"><?dbhtml filename="management-ops/node-properties/index.html"?>
8693
</d:tocentry>
225 KB
Loading

0 commit comments

Comments
 (0)