itzmeanjan · itzmeanjan · Apr 6, 2025 · Mar 19, 2025 · Mar 19, 2025 · Mar 19, 2025
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "chalamet_pir"
-version = "0.4.0"
+version = "0.5.0"
 edition = "2024"
 resolver = "2"
 rust-version = "1.85.0"
@@ -9,14 +9,22 @@ description = "Simple, Stateful, Single-Server Private Information Retrieval for
 readme = "README.md"
 repository = "https://github.com/itzmeanjan/ChalametPIR.git"
 license = "MPL-2.0"
-keywords = ["priv-info-retrieval", "lwe-pir", "frodo-pir", "chalamet-pir"]
-categories = ["cryptography", "data-structures"]
+keywords = [
+    "priv-info-retrieval",
+    "lwe-pir",
+    "frodo-pir",
+    "chalamet-pir",
+    "gpu",
+]
+categories = ["cryptography", "data-structures", "concurrency"]
 
 [dependencies]
 turboshake = "=0.4.1"
 rayon = "=1.10.0"
 rand = "=0.9.0"
 rand_chacha = "=0.9.0"
+vulkano = { version = "=0.35.1", optional = true }
+vulkano-shaders = { version = "=0.35.0", optional = true }
 
 [dev-dependencies]
 test-case = "=3.3.1"
@@ -34,6 +42,7 @@ required-features = ["mutate_internal_client_state"]
 
 [features]
 mutate_internal_client_state = []
+gpu = ["dep:vulkano", "dep:vulkano-shaders"]
 
 [profile.optimized]
 inherits = "release"

diff --git a/README.md b/README.md
@@ -9,14 +9,12 @@ built on top of FrodoPIR - a practical, single-server, stateful LWE -based PIR s
 - Binary Fuse Filter was proposed in https://arxiv.org/pdf/2201.01174.
 - And ChalametPIR was proposed in https://ia.cr/2024/092.
 
-ChalametPIR allows a client to retrieve a specific value from a key-value database on a server without revealing the requested key.
-It uses Binary Fuse Filters to encode key-value pairs in form of a matrix. And then it applies FrodoPIR on the encoded database matrix
-to actually retrieve values for requested keys.
+ChalametPIR allows a client to retrieve a specific value from a key-value database, stored on a server, without revealing the requested key to the server. It uses Binary Fuse Filters to encode key-value pairs in form of a matrix. And then it applies FrodoPIR on the encoded database matrix to actually retrieve values for requested keys.
 
 The protocol has two participants:
 
 **Server:**
-* **`setup`:** Initializes the server with a key-value database, generating a public matrix, a hint matrix, and a Binary Fuse Filter (3-wise XOR or 4-wise XOR, compile-time configurable). Returns serialized representations of the hint matrix and filter parameters. This phase can be completed in offline and it's completely client agnostic.
+* **`setup`:** Initializes the server with a key-value database, generating a public matrix, a hint matrix, and a Binary Fuse Filter (3-wise XOR or 4-wise XOR, configurable at compile time). It returns serialized representations of the hint matrix and filter parameters. This phase can be completed offline and is completely client-agnostic. But it is very compute-intensive, which is why this library allows you to offload expensive matrix multiplication and transposition to a GPU, gated behind the opt-in `gpu` feature. For large key-value databases (e.g., with >= $2^{18}$ entries), I recommend enabling the `gpu` feature, as it can significantly reduce the cost of the server-setup phase.
 * **`respond`:** Processes a client's query and returns an encrypted response vector.
 
 **Client:**
@@ -28,8 +26,8 @@ To paint a more practical picture, imagine, we have a database with $2^{20}$ (~1
 
 Machine Type | Machine | Kernel | Compiler | Memory Read Speed
 --- | --- | --- | --- | ---
-aarch64 server | AWS EC2 `m8g.8xlarge` | `Linux 6.8.0-1021-aws aarch64` | `rustc 1.84.1 (e71f9a9a9 2025-01-27)` | 28.25 GB/s
-x86_64 server | AWS EC2 `m7i.8xlarge` | `Linux 6.8.0-1021-aws x86_64` | `rustc 1.84.1 (e71f9a9a9 2025-01-27)` | 10.33 GB/s
+aarch64 server | AWS EC2 `m8g.8xlarge` | `Linux 6.8.0-1021-aws aarch64` | `rustc 1.85.1 (e71f9a9a9 2025-01-27)` | 28.25 GB/s
+x86_64 server | AWS EC2 `m7i.8xlarge` | `Linux 6.8.0-1021-aws x86_64` | `rustc 1.85.1 (e71f9a9a9 2025-01-27)` | 10.33 GB/s
 
 and this implementation of ChalametPIR is compiled with specified compiler, in `optimized` profile. See [Cargo.toml](./Cargo.toml).
 
@@ -44,22 +42,34 @@ Step | `(a)` Time Taken on `aarch64` server | `(b)` Time Taken on `x86_64` serve
 `server_respond` | 18.01 milliseconds | 32.16 milliseconds | 0.56
 `client_process_response` | 11.73 microseconds | 16.75 microseconds | 0.7
 
-> [!NOTE]
-> In above table, I show only the median timing measurements, while the DB is encoded using a 3 -wise XOR Binary Fuse Filter. For more results, with more database configurations, see benchmarking [section](#benchmarking) below.
-
 So, the median bandwidth of the `server_respond` algorithm, which needs to traverse through the whole processed database, is
 - (a) For `aarch64` server: 53.82 GB/s
 - (b) For `x86_64` server: 30.12 GB/s
 
+For demonstrating the effectiveness of offloading parts of the server-setup phase to a GPU, I benchmark it on AWS EC2 instance `g6e.8xlarge`, which features a NVIDIA L40S Tensor Core GPU and $3^{rd}$ generation AMD EPYC CPUs.
+
+Number of entries in DB | Key length | Value length | `(a)` Time taken to setup PIR server on CPU | `(b)` Time taken to setup PIR server, partially offloading to GPU | Ratio `a / b`
+:-- | --: | --: | --: | --: | --:
+$2^{16}$ | 32B | 1kB | 19.55 seconds | 19.39 seconds | 1.0
+$2^{18}$ | 32B | 1kB | 6.0 minutes | 2.23 minutes | 2.69
+$2^{20}$ | 32B | 1kB | 25.89 minutes | 25.58 seconds | 60.72
+
+For small key-value databases, it is not worth offloading server-setup to the GPU, but for databases with entries >= $2^{18}$, it is recommended to enable `gpu` feature, when GPU is available.
+
+> [!NOTE]
+> In both of above tables, I show only the median timing measurements, while the DB is encoded using a 3 -wise XOR Binary Fuse Filter. For more results, with more database configurations, see benchmarking [section](#benchmarking) below.
+
 ## Prerequisites
-Rust stable toolchain; see https://rustup.rs for installation guide. MSRV for this crate is 1.84.0.
+Rust stable toolchain; see https://rustup.rs for installation guide. MSRV for this crate is 1.85.0.
 
 ```bash
 # While developing this library, I was using
 $ rustc --version
-rustc 1.84.1 (e71f9a9a9 2025-01-27)
+rustc 1.85.1 (e71f9a9a9 2025-01-27)
 ```
 
+If you plan to offload server-setup to GPU, you need to install Vulkan drivers and library for your target setup. I followed https://linux.how2shout.com/how-to-install-vulkan-on-ubuntu-24-04-or-22-04-lts-linux on Ubuntu 24.04 LTS, with Nvidia GPUs - it was easy to setup.
+
 ## Testing
 The `chalamet_pir` library includes comprehensive tests to ensure functional correctness.
 
@@ -69,8 +79,12 @@ The `chalamet_pir` library includes comprehensive tests to ensure functional cor
 To run the tests, go to the project's root directory and issue:
 
 ```bash
-cargo test --profile test-release # Custom profile to make tests run faster!
-                                  # Default debug mode is too slow!
+# Custom profile to make tests run faster!
+# Default debug mode is too slow!
+cargo test --profile test-release
+
+# For testing if offloading to GPU works as expected.
+cargo test --features gpu --profile test-release
 ```
 
 
@@ -80,9 +94,12 @@ Performance benchmarks are included to evaluate the efficiency of the PIR scheme
 To run the benchmarks, execute the following command from the root of the project:
 
 ```bash
-cargo bench --all-features --profile optimized # For benchmarking the online phase of the PIR,
-                                               # you need to enable feature `mutate_internal_client_state`,
-                                               # passing `--all-features` does that.
+# For benchmarking the online phase of the PIR, 
+# you need to enable feature `mutate_internal_client_state`.
+cargo bench --features mutate_internal_client_state --profile optimized
+
+# For benchmarking only the server-setup phase, offloaded to the GPU.
+cargo bench --features gpu --profile optimized --bench offline_phase -q server_setup
 ```
 
 > [!WARNING]
@@ -101,7 +118,11 @@ First, add this library crate as a dependency in your Cargo.toml file.
 
 ```toml
 [dependencies]
-chalamet_pir = "=0.4.0"
+chalamet_pir = "=0.5.0"
+# Or, if you want to offload server-setup to a GPU.
+# chalamet_pir = { version = "=0.5.0", features = ["gpu"] }
+rand = "=0.9.0"
+rand_chacha = "=0.9.0"
 ```
 
 Then, let's code a very simple keyword PIR scheme:

diff --git a/shaders/mat_transpose.glsl b/shaders/mat_transpose.glsl
@@ -0,0 +1,37 @@
+#version 460
+#pragma shader_stage(compute)
+
+layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in;
+
+layout(set = 0, binding = 0) buffer readonly MatrixA {
+  uint rows;
+  uint cols;
+  uint[] elems;
+}
+matrix_a;
+
+layout(set = 0, binding = 1) buffer writeonly MatrixB {
+  uint rows;
+  uint cols;
+  uint[] elems;
+}
+matrix_b;
+
+void main() {
+  const uint row_idx = gl_GlobalInvocationID.x;
+  const uint col_idx = gl_GlobalInvocationID.y;
+
+  if (row_idx >= matrix_a.rows || col_idx >= matrix_a.cols) {
+    return;
+  }
+
+  if ((row_idx == 0) && (col_idx == 0)) {
+    matrix_b.rows = matrix_a.cols;
+    matrix_b.cols = matrix_a.rows;
+  }
+
+  const uint src_index = row_idx * matrix_a.cols + col_idx;
+  const uint dst_index = col_idx * matrix_a.rows + row_idx;
+
+  matrix_b.elems[dst_index] = matrix_a.elems[src_index];
+}
diff --git a/shaders/mat_x_mat.glsl b/shaders/mat_x_mat.glsl
@@ -0,0 +1,47 @@
+#version 460
+#pragma shader_stage(compute)
+
+layout(local_size_x = 8, local_size_y = 8, local_size_z = 1) in;
+
+layout(set = 0, binding = 0) buffer readonly MatrixA {
+  uint rows;
+  uint cols;
+  uint[] elems;
+}
+matrix_a;
+
+layout(set = 0, binding = 1) buffer readonly MatrixB {
+  uint rows;
+  uint cols;
+  uint[] elems;
+}
+matrix_b;
+
+layout(set = 0, binding = 2) buffer writeonly MatrixC {
+  uint rows;
+  uint cols;
+  uint[] elems;
+}
+matrix_c;
+
+void main() {
+  const uint row_idx = gl_GlobalInvocationID.x;
+  const uint col_idx = gl_GlobalInvocationID.y;
+
+  if (row_idx >= matrix_a.rows || col_idx >= matrix_b.cols) {
+    return;
+  }
+
+  if ((row_idx == 0) && (col_idx == 0)) {
+    matrix_c.rows = matrix_a.rows;
+    matrix_c.cols = matrix_b.cols;
+  }
+
+  uint sum = 0;
+  for (uint i = 0; i < matrix_a.cols; i++) {
+    sum += matrix_a.elems[row_idx * matrix_a.cols + i] *
+           matrix_b.elems[i * matrix_b.cols + col_idx];
+  }
+
+  matrix_c.elems[row_idx * matrix_b.cols + col_idx] = sum;
+}
diff --git a/src/client.rs b/src/client.rs
@@ -42,7 +42,7 @@ impl Client {
         let filter = BinaryFuseFilter::from_bytes(filter_param_bytes)?;
 
         let pub_mat_a_num_rows = LWE_DIMENSION;
-        let pub_mat_a_num_cols = filter.num_fingerprints;
+        let pub_mat_a_num_cols = filter.num_fingerprints as u32;
 
         let pub_mat_a = Matrix::generate_from_seed(pub_mat_a_num_rows, pub_mat_a_num_cols, seed_μ)?;
         let hint_mat_m = Matrix::from_bytes(hint_bytes)?;
@@ -225,7 +225,7 @@ impl Client {
                 let hashed_key = binary_fuse_filter::hash_of_key(key);
                 let hash = binary_fuse_filter::mix256(&hashed_key, &self.filter.seed);
 
-                let recovered_row = (0..response_vector.num_cols())
+                let recovered_row = (0..response_vector.num_cols() as usize)
                     .map(|idx| {
                         let unscaled_res = response_vector[(0, idx)].wrapping_sub(secret_vec_c[(0, idx)]);
 

diff --git a/src/lib.rs b/src/lib.rs
@@ -8,6 +8,7 @@
 //! * **Secure Private Information Retrieval:**  Allows clients to retrieve value from a PIR server without disclosing corresponding key. Server learns neither the value nor the queried key.
 //! * **Error Handling:** Comprehensive error handling to catch and report issues during setup, query generation, and response processing.
 //! * **Flexibility:** Supports both 3-wise and 4-wise XOR Binary Fuse Filters, allowing a choice between trade-offs in client/server computation and communication costs.
+//! * **Efficient:** It supports offloading parts of the server-setup phase to a GPU, using Vulkan Compute API, which can drastically reduce time taken to setup PIR server, for large key-value databases.
 //!
 //! ## Usage
 //!
@@ -18,7 +19,9 @@
 //!
 //! ```toml
 //! [dependencies]
-//! chalametpir = "=0.4.0"
+//! chalametpir = "=0.5.0"
+//! # Or, if you want to offload server-setup to GPU.
+//! # chalamet_pir = { version = "=0.5.0", features = ["gpu"] }
 //! rand = "=0.9.0"
 //! rand_chacha = "=0.9.0"
 //! ```

diff --git a/src/pir_internals/error.rs b/src/pir_internals/error.rs
@@ -6,6 +6,21 @@ use std::{error::Error, fmt::Display};
 /// It includes errors related to matrix operations, binary fuse filter operations, and PIR operations.
 #[derive(Debug, PartialEq)]
 pub enum ChalametPIRError {
+    // GPU
+    VulkanLibraryNotFound,
+    VulkanInstanceCreationFailed,
+    VulkanPhysicalDeviceNotFound,
+    VulkanDeviceCreationFailed,
+    VulkanBufferCreationFailed,
+    VulkanCommandBufferBuilderCreationFailed,
+    VulkanCommandBufferRecordingFailed,
+    VulkanCommandBufferBuildingFailed,
+    VulkanCommandBufferExecutionFailed,
+    VulkanReadingFromBufferFailed,
+    VulkanComputeShaderLoadingFailed,
+    VulkanComputePipelineCreationFailed,
+    VulkanDescriptorSetCreationFailed,
+
     // Matrix
     InvalidMatrixDimension,
     IncompatibleDimensionForMatrixMultiplication,
@@ -36,6 +51,20 @@ pub enum ChalametPIRError {
 impl Display for ChalametPIRError {
     fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
         match self {
+            Self::VulkanLibraryNotFound => write!(f, "Failed to load the default Vulkan library for the system."),
+            Self::VulkanInstanceCreationFailed => write!(f, "Failed to create a new instance of Vulkan."),
+            Self::VulkanPhysicalDeviceNotFound => write!(f, "Failed to find a compatible Vulkan physical device."),
+            Self::VulkanDeviceCreationFailed => write!(f, "Failed to create a Vulkan device and associated queue."),
+            Self::VulkanBufferCreationFailed => write!(f, "Failed to create a Vulkan transfer source buffer."),
+            Self::VulkanCommandBufferBuilderCreationFailed => write!(f, "Failed to create a Vulkan command buffer builder."),
+            Self::VulkanCommandBufferRecordingFailed => write!(f, "Failed to record command in a Vulkan command buffer."),
+            Self::VulkanCommandBufferBuildingFailed => write!(f, "Failed to build a Vulkan command buffer."),
+            Self::VulkanCommandBufferExecutionFailed => write!(f, "Failed to execute the Vulkan command buffer."),
+            Self::VulkanReadingFromBufferFailed => write!(f, "Failed to read from Vulkan buuffer."),
+            Self::VulkanComputeShaderLoadingFailed => write!(f, "Failed to load Vulkan compute shader module."),
+            Self::VulkanComputePipelineCreationFailed => write!(f, "Failed to create Vulkan compute pipeline."),
+            Self::VulkanDescriptorSetCreationFailed => write!(f, "Failed to create descriptor set for Vulkan compute pipeline."),
+
             Self::InvalidMatrixDimension => write!(f, "The number of rows and columns in the matrix must be non-zero."),
             Self::IncompatibleDimensionForMatrixMultiplication => write!(f, "The matrix dimensions do not allow multiplication."),
             Self::IncompatibleDimensionForMatrixAddition => write!(f, "The matrix dimensions do not allow addition."),