Skip to content

parlay::sequence constructor much slower than reserve + fill, and both slower than std::vector #87

@WhateverLiu

Description

@WhateverLiu

The following code measures

  1. Time for creating parlay::sequence via reserve + fill and the number of distinct buffers allocated in 100000 iterations of a parallel for loop.
  2. Time for creating parlay::sequence via its constructor and the number of distinct buffers allocated in 100000 iterations of a parallel for loop.
  3. Time for creating std::vector via its constructor and the number of distinct buffers allocated in 100000 iterations of a parallel for loop.
#include <iostream>
#include <vector>
#include <chrono>
// #include "infra/infra.hpp"
#include <parlay>

int main() {
  int ntab = 100000;
  auto now = std::chrono::steady_clock::now();
  auto ptrs1 = parlay::tabulate(ntab, [&](int i)->auto {
    parlay::sequence<double> a; a.reserve(20000); std::fill(a.begin(), a.end(), 0);
    parlay::sequence<double> b; b.reserve(20000); std::fill(b.begin(), b.end(), 0);
    return std::pair(a.data(), b.data());
  });
  std::cout << "Reserve and fill, time cost = " << std::chrono::duration_cast<
    std::chrono::microseconds> (std::chrono::steady_clock::now() - now).count();
  std::cout << "\nNumber of distinct vectors allocated = " << 
  parlay::remove_duplicates(parlay::slice(
      &ptrs1.front().first, &ptrs1.back().second + 1)).size() << "\n\n";
  decltype(ptrs1)().swap(ptrs1);
  
  
  now = std::chrono::steady_clock::now();
  auto ptrs2 = parlay::tabulate(ntab, [&](int i)->auto {
    parlay::sequence<double> a(20000, 0);
    parlay::sequence<double> b(20000, 0);
    return std::pair(a.data(), b.data());
  });
  std::cout << "Plain initialization, time cost = " << std::chrono::duration_cast<
    std::chrono::microseconds> (std::chrono::steady_clock::now() - now).count();
  std::cout << "\nNumber of distinct vectors allocated = " << 
    parlay::remove_duplicates(parlay::slice(
        &ptrs2.front().first, &ptrs2.back().second + 1)).size() << "\n\n";
  decltype(ptrs2)().swap(ptrs2);
  
  
  now = std::chrono::steady_clock::now();
  auto ptrs3 = parlay::tabulate(ntab, [&](int i)->auto {
    std::vector<double> a(20000, 0);
    std::vector<double> b(20000, 0);
    return std::pair(a.data(), b.data());
  });
  std::cout << "Use std::vector, time cost = " << std::chrono::duration_cast<
    std::chrono::microseconds> (std::chrono::steady_clock::now() - now).count();
  std::cout << "\nNumber of distinct vectors allocated = " << 
    parlay::remove_duplicates(parlay::slice(
        &ptrs3.front().first, &ptrs3.back().second + 1)).size() << "\n\n";
  decltype(ptrs3)().swap(ptrs3);
} 

Compiling with g++ -std=c++20 -O2 -lpthread -march=native code/cpp/parlayTest.cpp -o test, a sample printout can be:

Reserve and fill, time cost = 59335
Number of distinct vectors allocated = 28

Plain initialization, time cost = 154726
Number of distinct vectors allocated = 49

Use std::vector, time cost = 50010
Number of distinct vectors allocated = 34

The numbers 28 and 49 will change in different runs.

What surprised me a little bit is that (i) parlay::sequence() is substantially slower than reserve + fill and triggers more new allocations, (ii) using parlay::sequence is slower than using std::vector.

I have long believed in the superiority of parlay's memory allocation, but this example says otherwise. Is there a simple explanation?

What's the rule of thumb for allocating sequences within a parallel for loop using parlay?

Thanks a lot!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions