Tracing memory allocations is critical when debugging performance or memory-related issues in high-performance applications. However, Rust still lacks a good toolset to do so and often some techniques require code changes such as changing default global allocator.
In this article we will explore how we were able to track large memory allocations we were seeing on Readyset, such as the one in the above graph by using bpftrace.
BPFtrace is a high-level tracing language for Linux, built on the Berkeley Packet Filter (BPF) technology. BPFtrace allows developers, system administrators, and performance engineers to write concise, expressive scripts to probe kernel and user-space events for debugging, performance tuning, and observability.
It provides a powerful way to interact with the system’s BPF subsystem without requiring in-depth kernel programming expertise. BPFtrace is inspired by tools like awk and DTrace, offering a scripting interface that abstracts the complexities of writing raw eBPF programs.
Avoiding Overhead from tracing malloc
Tracing every malloc call is impractical in high-performance applications due to the sheer volume of allocations and the resulting overhead. Profiling tools that hook into malloc can introduce significant slowdowns, making them unsuitable for production workloads or scenarios requiring minimal performance impact.
Memory allocators like glibc’s malloc or JEMALLOC manage memory allocation requests in user-space programs, and for large allocations, they often rely on system calls like brk or mmap. These calls allow allocators to handle memory efficiently by choosing the appropriate mechanism depending on the size and nature of the allocation.
The brk system call adjusts the program’s data segment size by modifying the program break, which marks the end of the process’s heap segment. Allocators typically use brk for small or medium-sized allocations, as it allows them to expand the heap in a contiguous manner. When a program requests memory, the allocator can increase the program break to expand the heap and carve out memory for subsequent allocations from this newly acquired region. This approach is efficient for managing contiguous memory and works well for allocations that fit comfortably within the heap’s structure.
For large memory allocations, allocators often use the mmap system call instead of relying on the heap managed by brk. Unlike brk, which simply extends the heap’s size, mmap works by mapping memory directly into the process’s virtual address space, outside the heap. This allows the allocator to request memory at arbitrary locations in the address space, independent of the existing heap layout.
The use of mmap is particularly suited for large allocations because it avoids the fragmentation that can occur when large and small allocations are mixed in the same region of memory. Instead of crowding the heap, large allocations are handled in separate, dedicated regions of the address space. Additionally, memory provided by mmap is naturally aligned to page boundaries, which matches how modern operating systems manage virtual memory and ensures efficient access.
Readyset uses jemalloc as the default global allocator, Jemalloc uses mmap to allocate large chunks of memory. Because mmap calls happen much less often than malloc, tracing them introduces negligible overhead while still offering valuable information about large memory allocations.
Finding mmap symbol to trace
To trace JEMALLOC's use of mmap, we need to find the relevant symbol in the compiled binary. Using tools such as nm, we can find the global text segment looking for mmap calls:
In the above example, the relevant symbol is _rjem_je_extent_alloc_mmap
.
Using BPFtrace to Trace Allocations Larger than 128 MB
With the symbol identified, a BPFtrace script can be written to monitor memory allocations exceeding 128 MB. The following script attaches a probe to _rjem_je_extent_alloc_mmap
and filters for allocations above the specified size:
Script Breakdown:
- Probe: A user-space probe (uprobe) is attached to the
_rjem_je_extent_alloc_mmap
symbol in the binary.
2. Filter: The script filters calls where the first argument (arg1
) exceeds 128 MB.
3. Output: For each matching event, the script logs:
- Timestamp
- Process ID (PID
)
- Thread ID (TID
)
- Thread name (comm
)
- Allocation size in MB
- A user stack trace for context
Example output:
Rust incomplete stack traces
First experiments with BPFtrace, showed incomplete stack traces, making it difficult to track back where in the code base the allocation was triggered. In the above example we can tell there is a 2GB being mmap’ed from a vector resize, however we cannot tell from which internal readyset code this is being triggered.
BPFtrace uses frame pointers to generate full and accurate stack traces when profiling or tracing user-space and kernel-space applications.
Frame pointers are a convention used in compiled programs to manage and trace function call stacks. They involve the use of a dedicated CPU register (commonly the base pointer or frame pointer, often represented as %rbp on x86-64 systems) to point to the start of the current function’s stack frame. This allows debuggers, profilers, and tracing tools to walk the stack easily and reconstruct the sequence of function calls. Since this additional register is not strictly necessary to run the program, some compilers optimize them out by default.
To turn it on, we can either pass it as an environment variable or adjust .cargo/config.toml
:
If we check now, we will get the full stack trace:
With the complete stack trace, we are able to successfully track down which particular vector is growing in size, in this case it was a vector from an indexmap
coming from our readermap
.
Performance Impact
When using tracing tools like BPFtrace, a common concern is whether attaching a probe to a running program will slow it down. However, during testing, we found that running the program with BPFtrace attached to trace mmap calls caused no noticeable performance impact.
This is largely because mmap is called relatively infrequently compared to other memory operations like malloc. Since mmap is typically used for large memory allocations, the overhead of tracing these calls is minimal due to their low frequency. BPFtrace efficiently handles these events, and its reliance on the eBPF framework ensures that most of the work is done in the kernel with minimal disruption to the program.
The low frequency of mmap calls, combined with the efficiency of eBPF, makes this approach safe for observing large memory allocations without affecting the program’s performance. This means BPFtrace can be used confidently even in production environments where performance is critical.
Summary
In this post, we explored how to trace large memory allocations in Rust applications using BPFtrace, focusing on the less frequent mmap calls used by JEMALLOC for large allocations. By identifying the relevant symbol in the binary with nm
and attaching a BPFtrace uprobe, we efficiently tracked allocations over 128 MB with minimal performance impact.
Initially, incomplete stack traces made it hard to pinpoint allocation sources, but enabling frame pointers during compilation resolved this, allowing us to trace memory growth back to specific data structures. Our tests showed that tracing mmap calls with BPFtrace introduced no noticeable slowdown, thanks to their low frequency and eBPF’s efficiency.
This approach provided valuable insights into memory usage in Readyset, offering a practical, low-overhead solution for debugging and optimizing high-performance applications.
Authors