Thread Pool#
Join provides thread pool implementations for efficient parallel task execution. The library offers both a persistent thread pool for continuous task processing and one-shot parallel algorithms for data distribution.
Thread pool features:
- topology-aware sizing — defaults to the number of physical CPU cores via
CpuTopology - work queue —
std::deque-backed job distribution protected by a mutex and condition variable - graceful shutdown — workers drain the queue before exiting
- parallel algorithms —
distribute()andparallelForEach()
ThreadPool#
Persistent pool of worker threads that process tasks from a shared queue.
Creating a pool#
#include <join/threadpool.hpp>
using namespace join;
// Default: one worker per physical core (CpuTopology)
ThreadPool pool;
// Custom worker count
ThreadPool pool(4);The worker count must be strictly positive — passing 0 or a negative value throws std::invalid_argument.
ThreadPool is neither copyable nor movable.
Submitting tasks#
void task(int id)
{
// ...
}
ThreadPool pool;
// Submit function with arguments
pool.push(task, 42);
// Submit lambda
pool.push([]() {
// ...
});
// Submit with captured variables
int value = 100;
pool.push([value]() {
// ...
});Tasks are enqueued as std::function<void()> objects and executed by the first available worker thread.
Pool size#
size_t n = pool.size();Shutdown behaviour#
The destructor:
- Sets the internal stop flag.
- Broadcasts to all waiting worker threads via
_condition.broadcast(). - Workers finish their current task, then drain any remaining queued jobs before exiting.
- Each
WorkerThreaddestructor joins its underlyingThread.
This guarantees that all queued tasks complete before the pool is destroyed.
{
ThreadPool pool(4);
for (int i = 0; i < 100; ++i)
{
pool.push([i]() { process(i); });
}
// All 100 tasks are guaranteed to complete here
}Parallel algorithms#
distribute()#
Splits an iterator range across threads and executes a function on each partition.
Uses CpuTopology to determine the number of threads — capped to the number of elements.
#include <join/threadpool.hpp>
#include <vector>
using namespace join;
std::vector<int> data(10000);
distribute(data.begin(), data.end(),
[](auto first, auto last) {
for (auto it = first; it != last; ++it)
{
// process *it
}
}
);How it works:
- Computes the number of threads as
min(cores, count). - Divides the range into equal partitions, distributing remainder elements to the first partitions.
- Spawns
concurrency - 1threads to process their partitions. - The calling thread processes the last partition directly.
- Joins all spawned threads before returning.
parallelForEach()#
Applies a unary function to each element in the range in parallel.
This is a convenience wrapper around distribute().
std::vector<int> data = {1, 2, 3, 4, 5, 6, 7, 8};
parallelForEach(data.begin(), data.end(),
[](int& value) {
value *= 2;
}
);Equivalent to std::for_each but executed in parallel across physical cores.
Usage patterns#
Background job processor#
class JobProcessor
{
public:
JobProcessor() : _pool(4) {}
void submit(const std::string& data)
{
_pool.push([data]() { process(data); });
}
private:
static void process(const std::string& data) { /* ... */ }
ThreadPool _pool;
};Parallel computation#
std::vector<double> results(1000);
parallelForEach(results.begin(), results.end(),
[](double& result) {
result = expensiveComputation();
}
);Map-reduce pattern#
std::vector<int> data(10000);
std::atomic<int> total{0};
parallelForEach(data.begin(), data.end(),
[&total](int value) {
total += compute(value);
}
);Performance considerations#
Sizing for CPU vs. I/O workloads#
// CPU-bound: one thread per core (default)
ThreadPool cpuPool;
// I/O-bound: more threads to overlap blocking waits
ThreadPool ioPool(std::thread::hardware_concurrency() * 2);Task granularity#
// BAD: per-element push — queue overhead dominates
for (int i = 0; i < 1000000; ++i)
{
pool.push([i]() { data[i] *= 2; });
}
// GOOD: use parallel algorithms for data parallelism
parallelForEach(data.begin(), data.end(), [](int& v) { v *= 2; });Prefer parallelForEach() and distribute() for data-parallel loops.
Reserve ThreadPool::push() for heterogeneous tasks that must be queued independently.
Best practices#
- Let the default size stand —
CpuTopology-based sizing avoids over-subscription on CPU-bound work - Batch small tasks — group tiny operations to amortise queue and mutex overhead
- Prefer parallel algorithms —
distribute()andparallelForEach()handle partitioning automatically - Synchronize shared state — protect any data written by multiple tasks with mutexes or atomics
- Avoid blocking in tasks — long blocking calls stall a worker and reduce throughput
- Let the destructor handle shutdown — it drains the queue and joins all workers
Comparison: ThreadPool vs. distribute()#
| Feature | ThreadPool | distribute() |
|---|---|---|
| Lifetime | Persistent workers | One-shot execution |
| Use case | Ongoing heterogeneous tasks | Parallel data processing |
| Thread reuse | ✅ | ❌ Creates new threads |
| Overhead | Lower for many tasks | Lower for single batch |
| Shutdown | Drains full queue | Waits for all partitions |
Choose ThreadPool when processing a continuous or heterogeneous stream of tasks.
Choose distribute() when processing a single large dataset in one parallel pass.
Summary#
| Feature | Supported |
|---|---|
| Persistent worker pool | ✅ |
| Topology-aware sizing | ✅ |
| Task queue | ✅ |
| Graceful shutdown | ✅ |
distribute() | ✅ |
parallelForEach() | ✅ |
| Work stealing | ❌ |
| Future/promise support | ❌ |