When std::shared_mutex Outperforms std::mutex: A Google Benchmark Study

In multi-threaded C++ environments, the primary concern remains protecting shared resources from concurrent access. Developers typically reach for std::mutex and std::lock_guard or other C++ RAII lock variants to achieve this. However, as C++ system complexity grows, it becomes clear that no “one-size-fits-all” C++ locking policy exists. Different C++ concurrency scenarios demand distinct synchronization designs, an aspect modern C++ multithreading often overlooks.

Is std::mutex always the best choice in C++ multithreading? Consider a classic C++ reader-writer scenario where multiple threads need to read shared data, but only one thread writes to it. Developers often default to a standard std::mutex for protection in this kind of scenario.

However, readers pose no threat to shared data since they access it without modification. Forcing readers to wait behind an exclusive std::mutex lock means only one reader accesses the data at a time, even though reading doesn’t alter shared state.

This C++ mutex serialization turns potentially parallel read operations into strictly sequential processing, creating an unnecessary concurrency bottleneck in read-heavy workloads.

The only time C++ truly demands exclusive locking is when a writer enters the critical section. For these C++ reader-writer scenarios, C++ offers a specialized read-write mutex called std::shared_mutex that supports both shared reader access and exclusive writer access.

std::shared_mutex is designed for read-heavy C++ multithreaded workloads with many concurrent readers and few writers, where using std::mutex would serialise read access, reduce parallelism, and waste available CPU cores on modern multi-core systems. The performance advantages aren’t immediately obvious, so we’ll use Google Benchmark C++ to measure std::mutex vs std::shared_mutex scaling and reveal how these read-write locks behave under real multi-threaded C++ workloads.

In this post we will discuss the following topics:

Installing Google Benchmark on Ubuntu

Before we dive into the code, the benchmark library must be installed on your system. On Ubuntu, you can set this up using the following commands:

sudo apt update
sudo apt install libbenchmark-dev libgtest-dev cmake g++

Designing the GoogleBenchmark Test Code

We will design the test step by step to compare a standard mutex against a shared mutex.

Defining the Data Context

We encapsulate the shared data and the mutexes into a single structure. To ensure the most accurate results, we use alignas(64). This prevents “False Sharing,” where the hardware cache coherency protocol causes performance degradation because two different mutexes happen to reside on the same CPU cache line.

struct BenchmarkContext {
    std::map<int, double> data;
    // Mutexes are separated by 64 bytes to avoid cache line contention
    alignas(64) std::mutex regular_mtx;
    alignas(64) std::shared_mutex shared_mtx;

    void setup() {
        if (data.empty()) {
            for (int i = 0; i < 1000; ++i) {
                data[i] = std::sqrt(i);
            }
        }
    }
};
static BenchmarkContext g_ctx;

Simulating the Workload

We require a “Heavy Read” to make the test realistic. If the work inside the lock is too small, the benchmark will only measure the overhead of the lock mechanism itself. By performing trigonometric calculations, we simulate actual data processing.

void DoHeavyRead() {
    double total = 0;
    // Performing math operations forces the CPU to spend time inside the lock
    for (int i = 0; i < 50; ++i) {
        total += std::sin(g_ctx.data[i % 1000]);
    }
    benchmark::DoNotOptimize(total);
}

benchmark::DoNotOptimize(total) prevents compiler over-optimization from discarding our heavy read workload, ensuring Google Benchmark captures true C++ lock contention timing.

The Comparison Logic

In our Google Benchmark tests, we use state.thread_index() to identify each thread’s unique index. Thread index zero becomes the writer, while all remaining threads act as concurrent readers.

Regular Mutex: Every thread uses std::lock_guard, which is always exclusive.
Shared Mutex: The Writer uses std::unique_lock (exclusive), while Readers use std::shared_lock (shared).

// Logic inside the loop for Shared Mutex
if (state.thread_index() == 0) {
    std::unique_lock<std::shared_mutex> lock(g_ctx.shared_mtx);
    DoWrite(); // Exclusive access
} else {
    std::shared_lock<std::shared_mutex> lock(g_ctx.shared_mtx);
    DoHeavyRead(); // Parallel access
}

The Write Simulation

The writer’s responsibility remains modifying shared state under exclusive access to guarantee concurrency correctness in multithreaded environments. In our benchmark, this critical role lives in a small, focused write function:

void DoWrite()
{
    g_ctx.data[0] += 1.1;
    benchmark::DoNotOptimize(g_ctx.data[0]);
}

This writer function performs a single shared data update, incrementing the first element of g_ctx.data. The operation stays intentionally simple: we’re isolating synchronization overhead in std::shared_mutex benchmarks.

Under std::unique_lock, this execution blocks all concurrent readers temporarily, both active and pending.

Complete Code Listing

The following program implements the logic discussed above using the Google Benchmark framework.

Example: A Performance Comparison of Exclusive vs. Shared Locking.

/**
 * @file shared_mutex_vs_mutex_bench.cpp
 * @brief Performance comparison between std::mutex and std::shared_mutex.
 */

#include <benchmark/benchmark.h>
#include <cmath>
#include <map>
#include <mutex>
#include <shared_mutex>
#include <vector>

struct BenchmarkContext {
    std::map<int, double> data;
    alignas(64) std::mutex regular_mtx;
    alignas(64) std::shared_mutex shared_mtx;

    void setup() {
        if (data.empty()) {
            for (int i = 0; i < 1000; ++i) {
                data[i] = std::sqrt(i);
            }
        }
    }
};

static BenchmarkContext g_ctx;

void DoHeavyRead() {
    double total = 0;
    for (int i = 0; i < 50; ++i) {
        total += std::sin(g_ctx.data[i % 1000]);
    }
    benchmark::DoNotOptimize(total);
}

/**
 * @brief Simulates a writer modifying shared state.
 */
void DoWrite() {
    g_ctx.data[0] += 1.1;
    benchmark::DoNotOptimize(g_ctx.data[0]);
}

static void BM_RegularMutex_Mixed(benchmark::State &state) {
    g_ctx.setup();
    for (auto _ : state) {
        // All threads—readers and writers—must wait for exclusive access.
        std::lock_guard<std::mutex> lock(g_ctx.regular_mtx);
        if (state.thread_index() == 0) {
            DoWrite();
        } else {
            DoHeavyRead();
        }
    }
}
BENCHMARK(BM_RegularMutex_Mixed)->ThreadRange(2, 8)->UseRealTime();

static void BM_SharedMutex_Mixed(benchmark::State &state) {
    g_ctx.setup();
    for (auto _ : state) {
        if (state.thread_index() == 0) {
            // Writer acquires exclusive access, pausing all readers.
            std::unique_lock<std::shared_mutex> lock(g_ctx.shared_mtx);
            DoWrite();
        } else {
            // Readers acquire shared access, allowing parallel execution.
            std::shared_lock<std::shared_mutex> lock(g_ctx.shared_mtx);
            DoHeavyRead();
        }
    }
}
BENCHMARK(BM_SharedMutex_Mixed)->ThreadRange(2, 8)->UseRealTime();

BENCHMARK_MAIN();

Compiling and Running the Program

To compile the code, we use the -O3 optimization flag and link the benchmark and pthread libraries.

g++ -O3 shared_mutex_vs_mutex_bench.cpp -o lock_bench -lbenchmark -lpthread
./lock_bench

Benchmark Test Environment

Tested on 8-thread Intel i7-3770 @ 3.40GHz (4 cores, 8 threads with HT) running Ubuntu 22.04.5 LTS (Linux 6.8.0 kernel). All std::mutex vs std::shared_mutex benchmarks executed with g++ -O3, ThreadRange(2,8), and UseRealTime() for wall-clock accuracy.

Key Hardware Factors

8 logical cores via Hyper-Threading on 4 physical cores
8MB L3 cache per core impacts lock contention patterns
64-byte cache line alignment (alignas(64)) prevents false sharing
PREEMPT_DYNAMIC kernel minimizes scheduling variability

Performance Comparison: std::mutex vs std::shared_mutex Benchmarks

When the test binary is executed, Google Benchmark provides a detailed breakdown of the performance metrics. Below is a sample run from a machine with an 8-core CPU.

Understanding the Columns

Here is what the headers in this output represent:

Benchmark: The name of the test followed by the thread count. For example, threads:8 means 1 writer and 7 readers were active.
Time (Real Time): This is the “Wall Clock” time. It is the most important metric for multi-threaded apps as it measures how long the task actually took from start to finish, including time spent waiting for locks.
CPU: This measures the total CPU time consumed. In multi-threaded tests, this is often higher than Real Time because it sums the work done across all active cores.
Iterations: The number of times the framework ran the loop to get a statistically stable average.

Analysis of BM_RegularMutex_Mixed (std::mutex)

Now, let us we explore the performance of std::mutex in a multi-threaded C++ program where one writer thread updates shared data and multiple reader threads access it concurrently. Understanding this behaviour is essential for developers looking to optimise C++ concurrency and multi-threaded read performance. The table below shows benchmark results from Google Benchmark for different numbers of threads:

Thread	Real Time (ns)	CPU (ns)
2	848	1369
4	2137	4588
8	1791	3599

Scaling of std::mutex shows increased contention as threads grow, limiting read throughput.

The Real Time column measures wall-clock time per iteration, including time spent waiting for the mutex. With 2 threads, Real Time is 848 nanoseconds. In this case, contention is minimal because only one reader shares access with the writer, and the mutex enforces exclusive access.

When the thread count increases to 4, Real Time rises sharply to 2137 nanoseconds. This reflects the serialized nature of std::mutex. Every thread, whether reading or writing, must wait its turn. As more threads compete, thread contention increases, reducing the efficiency of multi-threaded reads.

At 8 threads, Real Time drops slightly to 1791 nanoseconds. This minor variation is within the expected statistical margin of multi-threaded benchmarking and does not indicate improved performance.

The CPU Time column indicates the total CPU cycles consumed across all threads.

Analysis of BM_SharedMutex_Mixed (std::shared_mutex)

Now, let us examine how std::shared_mutex performs under a mixed workload consisting of one writer and multiple readers, as measured using Google Benchmark and compared against std::mutex.

The following table presents the performance statistics observed for std::shared_mutex during the benchmark run.

Thread	Real Time (ns)	CPU (ns)
2	7875	7846
4	344	1021
8	222	1538

From the benchmark results, it is clear that with two threads std::shared_mutex records a Real Time of 7875 nanoseconds, which is significantly higher than the 848 nanoseconds observed with std::mutex. This behaviour is expected in a low-concurrency configuration where there is only one reader and one writer. In such a case, access to the shared data is effectively serialized, and the shared mutex is unable to exploit its core advantage of concurrent reads.

Unlike std::mutex, std::shared_mutex must maintain additional internal state to distinguish between shared and exclusive ownership and to manage transitions between them. When the number of readers is small, this extra bookkeeping cost is not amortised and directly increases the cost of lock acquisition and release, resulting in higher observed latency.

As the number of threads increases, this trade-off changes. With four threads, the Real Time drops sharply as multiple readers are now able to acquire the shared lock concurrently whenever the writer releases it. This behaviour follows directly from the use of std::shared_lock in the code and reflects the scenario for which std::shared_mutex is designed. With eight threads, the effect becomes more pronounced, as several reader threads execute the read workload in parallel, allowing the system to make effective use of available CPU cores.

Performance crossover between `std::mutex` and `std::shared_mutex` in a read-heavy C++ benchmark

Addendum: Multiple Readers with Lighter Read Workloads

The situation may change if the read-side critical section is much lighter. To test this case, I added the following code and ran the benchmark again:

Light Read Source Code for testing

The only change here is that the read workload is comparatively lighter as below:

/**
 * @brief EXTREMELY LIGHT Read Workload.
 * Perform a single lookup instead of 50 trig calculations.
 */
void DoLightRead()
{
    double value = g_ctx.data[500]; 
    benchmark::DoNotOptimize(value);
}

With this code in place we can run the test again and see the following output:

Lets compare these in a side by side view as the below:

Threads	std::mutex (Real Time)	std::shared_mutex (Real Time)	Observations
2	101 ns	5872 ns	shared_mutex much slower
4	77.4 ns	1473 ns	shared_mutex much slower
8	127 ns	80.3 ns	shared_mutex faster
16	131 ns	86.0 ns	shared_mutex faster

The above table shows test results from a much lighter read but increased number of threads. You can see that with extremely light read-side critical sections, std::shared_mutex performs worse. This happens because std::mutex at low thread counts suffers from higher locking overhead. As thread count increases, the dominant cost becomes exclusive locking in std::mutex. A crossover reappears. In this scenario, std::shared_mutex scales better by allowing concurrent readers.

What to Keep in Mind

There is no single best mutex for all situations. The right choice depends entirely on how your data is accessed in a multi-threaded environment.
std::shared_mutex is designed to benefit read-heavy workloads, where many threads frequently read shared data and write operations are relatively rare.
The performance advantage of std::shared_mutex appears only when there are enough reader threads to take advantage of concurrent access.
With few readers, shared locking provides little or no benefit and can perform worse than a regular mutex due to its additional management overhead.
std::mutex remains a strong and efficient choice for low concurrency scenarios or write-heavy designs, where exclusive access is unavoidable anyway.
As reader concurrency increases, exclusive locking becomes a bottleneck, while shared locking allows the system to make better use of available CPU cores.
The key to good concurrency design is understanding your workload, especially the balance between reads and writes, rather than assuming that a more advanced synchronization primitive will always be faster.
Measuring and observing real behaviour, even with simple benchmarks, leads to better design decisions than relying on assumptions or defaults.

Source Code

All the source code for the benchmark programs discussed in this article is available in the following GitHub repository.

View Source Code Repository on GitHub

When std::shared_mutex Outperforms std::mutex: A Google Benchmark Study

Installing Google Benchmark on Ubuntu

Designing the GoogleBenchmark Test Code

Defining the Data Context

Simulating the Workload

The Comparison Logic

The Write Simulation

Complete Code Listing

Compiling and Running the Program

Benchmark Test Environment

Key Hardware Factors

Performance Comparison: std::mutex vs std::shared_mutex Benchmarks

Understanding the Columns

Analysis of BM_RegularMutex_Mixed (std::mutex)

Analysis of BM_SharedMutex_Mixed (std::shared_mutex)

Addendum: Multiple Readers with Lighter Read Workloads

What to Keep in Mind

Source Code

1 Comment

Leave a Reply Cancel reply

Installing Google Benchmark on Ubuntu

Designing the GoogleBenchmark Test Code

Defining the Data Context

Simulating the Workload

The Comparison Logic

The Write Simulation

Complete Code Listing

Compiling and Running the Program

Benchmark Test Environment

Key Hardware Factors

Performance Comparison: std::mutex vs std::shared_mutex Benchmarks

Understanding the Columns

Analysis of BM_RegularMutex_Mixed (std::mutex)

Analysis of BM_SharedMutex_Mixed (std::shared_mutex)

Addendum: Multiple Readers with Lighter Read Workloads

What to Keep in Mind

Source Code

Share this:

Related

1 Comment

Leave a Reply Cancel reply