Engineering

Tracing Large Memory Allocations in Rust with BPFtrace

7 min read

4 days ago

Tracing memory allocations is critical when debugging performance or memory-related issues in high-performance applications. However, Rust still lacks a good toolset to do so and often some techniques require code changes such as changing default global allocator.

In this article we will explore how we were able to track large memory allocations we were seeing on Readyset, such as the one in the above graph by using bpftrace.

BPFtrace is a high-level tracing language for Linux, built on the Berkeley Packet Filter (BPF) technology. BPFtrace allows developers, system administrators, and performance engineers to write concise, expressive scripts to probe kernel and user-space events for debugging, performance tuning, and observability.

It provides a powerful way to interact with the system’s BPF subsystem without requiring in-depth kernel programming expertise. BPFtrace is inspired by tools like awk and DTrace, offering a scripting interface that abstracts the complexities of writing raw eBPF programs.

Avoiding Overhead from tracing malloc

Tracing every malloc call is impractical in high-performance applications due to the sheer volume of allocations and the resulting overhead. Profiling tools that hook into malloc can introduce significant slowdowns, making them unsuitable for production workloads or scenarios requiring minimal performance impact.

Memory allocators like glibc’s malloc or JEMALLOC manage memory allocation requests in user-space programs, and for large allocations, they often rely on system calls like brk or mmap. These calls allow allocators to handle memory efficiently by choosing the appropriate mechanism depending on the size and nature of the allocation.

The brk system call adjusts the program’s data segment size by modifying the program break, which marks the end of the process’s heap segment. Allocators typically use brk for small or medium-sized allocations, as it allows them to expand the heap in a contiguous manner. When a program requests memory, the allocator can increase the program break to expand the heap and carve out memory for subsequent allocations from this newly acquired region. This approach is efficient for managing contiguous memory and works well for allocations that fit comfortably within the heap’s structure.

For large memory allocations, allocators often use the mmap system call instead of relying on the heap managed by brk. Unlike brk, which simply extends the heap’s size, mmap works by mapping memory directly into the process’s virtual address space, outside the heap. This allows the allocator to request memory at arbitrary locations in the address space, independent of the existing heap layout.

The use of mmap is particularly suited for large allocations because it avoids the fragmentation that can occur when large and small allocations are mixed in the same region of memory. Instead of crowding the heap, large allocations are handled in separate, dedicated regions of the address space. Additionally, memory provided by mmap is naturally aligned to page boundaries, which matches how modern operating systems manage virtual memory and ensures efficient access.

Readyset uses jemalloc as the default global allocator, Jemalloc uses mmap to allocate large chunks of memory. Because mmap calls happen much less often than malloc, tracing them introduces negligible overhead while still offering valuable information about large memory allocations.

Finding mmap symbol to trace

To trace JEMALLOC's use of mmap, we need to find the relevant symbol in the compiled binary. Using tools such as nm, we can find the global text segment looking for mmap calls:

$ nm ./target/release/readyset | grep mmap 00000001039cc90c T _rjem_je_extent_alloc_mmap 00000001039cc9ac T _rjem_je_extent_dalloc_mmap U _mmap

In the above example, the relevant symbol is _rjem_je_extent_alloc_mmap.

Using BPFtrace to Trace Allocations Larger than 128 MB

With the symbol identified, a BPFtrace script can be written to monitor memory allocations exceeding 128 MB. The following script attaches a probe to _rjem_je_extent_alloc_mmap and filters for allocations above the specified size:

sudo bpftrace -e ' uprobe:/home/ec2-user/readyset/public/target/release/readyset:_rjem_je_extent_alloc_mmap /(int64)arg1 > (128 * 1024 * 1024)/ { printf("[%s] PID: %d, TID: %d TName: %s\n", strftime("%Y-%m-%d %H:%M:%S", nsecs), pid, tid, comm); printf("Called mmap, size: %llu MB\n", ((uint64) arg1) / ((uint64)(1024 * 1024))); printf("User Stack Trace:\n%s\n", ustack(perf)); }'

Script Breakdown:

  1. Probe: A user-space probe (uprobe) is attached to the _rjem_je_extent_alloc_mmap symbol in the binary.

2. Filter: The script filters calls where the first argument (arg1) exceeds 128 MB.

3. Output: For each matching event, the script logs:
   - Timestamp
   - Process ID (PID)
   - Thread ID (TID)
   - Thread name (comm)
   - Allocation size in MB
   - A user stack trace for context

Example output:

[2025-01-02 18:17:30] PID: 9473, TID: 9542 TName: Domain 16.0.0 Called mmap, size: 2048 MB User Stack Trace: 556c55924800 _rjem_je_extent_alloc_mmap+0 (/home/ec2-user/readyset/public/target/release/readyset) 556c5591d298 _rjem_je_ehooks_default_alloc_impl+216 (/home/ec2-user/readyset/public/target/release/readyset) 556c55924147 _rjem_je_ecache_alloc_grow+1895 (/home/ec2-user/readyset/public/target/release/readyset) 556c5592de95 pac_alloc_real+149 (/home/ec2-user/readyset/public/target/release/readyset) 556c5592dff0 pac_alloc_impl+224 (/home/ec2-user/readyset/public/target/release/readyset) 556c5592cfd1 _rjem_je_pa_alloc+81 (/home/ec2-user/readyset/public/target/release/readyset) 556c558d5a36 _rjem_je_arena_extent_alloc_large+166 (/home/ec2-user/readyset/public/target/release/readyset) 556c5592b1a1 _rjem_je_large_palloc+209 (/home/ec2-user/readyset/public/target/release/readyset) 556c5592bd47 _rjem_je_large_ralloc+247 (/home/ec2-user/readyset/public/target/release/readyset) 556c558ca8d6 do_rallocx+454 (/home/ec2-user/readyset/public/target/release/readyset) 556c541344ed _ZN5alloc7raw_vec11finish_grow17h129f024319b1753dE.llvm.15453257542483125133+77 (/home/ec2-user/readyset/public/target/release/readyset)

Rust incomplete stack traces

First experiments with BPFtrace, showed incomplete stack traces, making it difficult to track back where in the code base the allocation was triggered. In the above example we can tell there is a 2GB being mmap’ed from a vector resize, however we cannot tell from which internal readyset code this is being triggered.

BPFtrace uses frame pointers to generate full and accurate stack traces when profiling or tracing user-space and kernel-space applications. 

Frame pointers are a convention used in compiled programs to manage and trace function call stacks. They involve the use of a dedicated CPU register (commonly the base pointer or frame pointer, often represented as %rbp on x86-64 systems) to point to the start of the current function’s stack frame. This allows debuggers, profilers, and tracing tools to walk the stack easily and reconstruct the sequence of function calls. Since this additional register is not strictly necessary to run the program, some compilers optimize them out by default.

To turn it on, we can either pass it as an environment variable or adjust .cargo/config.toml :

# Environment variable RUSTFLAGS="-C force-frame-pointers=yes" cargo build # .cargo/config.toml [build] rustflags = ["-C", "force-frame-pointers=yes"]

If we check now, we will get the full stack trace:

[2025-01-02 19:44:35] PID: 119243, TID: 119280 TName: Domain 16.0.0 Called mmap, size: 2048 MB User Stack Trace: 559736979fd0 _rjem_je_extent_alloc_mmap+0 (/home/ec2-user/readyset/public/target/release/readyset) 559736972a68 _rjem_je_ehooks_default_alloc_impl+216 (/home/ec2-user/readyset/public/target/release/readyset) 559736979917 _rjem_je_ecache_alloc_grow+1895 (/home/ec2-user/readyset/public/target/release/readyset) 559736983665 pac_alloc_real+149 (/home/ec2-user/readyset/public/target/release/readyset) 5597369837c0 pac_alloc_impl+224 (/home/ec2-user/readyset/public/target/release/readyset) 5597369827a1 _rjem_je_pa_alloc+81 (/home/ec2-user/readyset/public/target/release/readyset) 55973692b206 _rjem_je_arena_extent_alloc_large+166 (/home/ec2-user/readyset/public/target/release/readyset) 559736980971 _rjem_je_large_palloc+209 (/home/ec2-user/readyset/public/target/release/readyset) 559736981517 _rjem_je_large_ralloc+247 (/home/ec2-user/readyset/public/target/release/readyset) 5597369200a6 do_rallocx+454 (/home/ec2-user/readyset/public/target/release/readyset) 5597350f25fd _ZN5alloc7raw_vec11finish_grow17h129f024319b1753dE.llvm.6655621825799014452+77 (/home/ec2-user/readyset/public/target/release/readyset) 559735164f89 indexmap::map::core::RefMut$LT$K$C$V$GT$::insert_unique::h0abd8b6f6c4c6b2f+809 (/home/ec2-user/readyset/public/target/release/readyset) 55973508d81e reader_map::inner::Inner$LT$K$C$V$C$M$C$T$C$S$C$I$GT$::data_entry::h920bdc650ca27f1f+1870 (/home/ec2-user/readyset/public/target/release/readyset) 55973508edb3 reader_map::write::_$LT$impl$u20$left_right..Absorb$LT$reader_map..write..Operation$LT$K$C$V$C$M$C$T$GT$$GT$$u20$for$u20$reader_map..inner..Inner$LT$K$C$V$C$M$C$T$C$S$C$I$GT$$GT$::absorb_second::h9d499e62d36b0d02+1139 (/home/ec2-user/readyset/public/target/release/readyset) 559735086062 left_right::write::WriteHandle$LT$T$C$O$GT$::publish::h517265aabf0c17c5+1234 (/home/ec2-user/readyset/public/target/release/readyset) 559735026f7b readyset_dataflow::domain::Domain::handle_replay::h14d216df813b75ad+13915 (/home/ec2-user/readyset/public/target/release/readyset) 55973501e7c0 readyset_dataflow::domain::Domain::handle::ha32f6fa845cfeb87+3824 (/home/ec2-user/readyset/public/target/release/readyset) 559735036fef readyset_dataflow::domain::Domain::handle_packet::ha30cf516896dc980+799 (/home/ec2-user/readyset/public/target/release/readyset) 559735bb99cc _ZN15readyset_server6worker7replica7Replica9run_inner28_$u7b$$u7b$closure$u7d$$u7d$17h71548d1ec4da5765E.llvm.17110769118536154604+4748 (/home/ec2-user/readyset/public/target/release/readyset) 559735b91c6b _$LT$tracing..instrument..Instrumented$LT$T$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h12cf7e57c90010a5+235 (/home/ec2-user/readyset/public/target/release/readyset) 559735721f12 _ZN15readyset_server6worker7replica7Replica3run28_$u7b$$u7b$closure$u7d$$u7d$17h4e50dd40335201a1E.llvm.18257486542318815986+210 (/home/ec2-user/readyset/public/target/release/readyset) 5597356f593b tokio::runtime::task::core::Core$LT$T$C$S$GT$::poll::hcdbcea595282fa4a+171 (/home/ec2-user/readyset/public/target/release/readyset) 559735d420e5 tokio::runtime::task::harness::Harness$LT$T$C$S$GT$::poll::hd1ccfc0b4ed5da95+229 (/home/ec2-user/readyset/public/target/release/readyset) 559735695560 tokio::runtime::scheduler::current_thread::CurrentThread::block_on::h23da13ca1c25d625+1600 (/home/ec2-user/readyset/public/target/release/readyset) 559735d53cdb tokio::runtime::runtime::Runtime::block_on::h88ca14758f883377+203 (/home/ec2-user/readyset/public/target/release/readyset) 5597355aa24b std::sys::backtrace::__rust_begin_short_backtrace::h26ec098603e0887d+91 (/home/ec2-user/readyset/public/target/release/readyset) 559735c8d275 core::ops::function::FnOnce::call_once$u7b$$u7b$vtable.shim$u7d$$u7d$::hd39c8bb29e363132+421 (/home/ec2-user/readyset/public/target/release/readyset) 55973630a8fb std::sys::pal::unix::thread::Thread::new::thread_start::hcc78f3943333fa94+43 (/home/ec2-user/readyset/public/target/release/readyset) 7f99ef89f832 start_thread+722 (/usr/lib64/libc.so.6)

With the complete stack trace, we are able to successfully track down which particular vector is growing in size, in this case it was a vector from an indexmap coming from our readermap.

Performance Impact

When using tracing tools like BPFtrace, a common concern is whether attaching a probe to a running program will slow it down. However, during testing, we found that running the program with BPFtrace attached to trace mmap calls caused no noticeable performance impact.

This is largely because mmap is called relatively infrequently compared to other memory operations like malloc. Since mmap is typically used for large memory allocations, the overhead of tracing these calls is minimal due to their low frequency. BPFtrace efficiently handles these events, and its reliance on the eBPF framework ensures that most of the work is done in the kernel with minimal disruption to the program.

The low frequency of mmap calls, combined with the efficiency of eBPF, makes this approach safe for observing large memory allocations without affecting the program’s performance. This means BPFtrace can be used confidently even in production environments where performance is critical.

Summary

In this post, we explored how to trace large memory allocations in Rust applications using BPFtrace, focusing on the less frequent mmap calls used by JEMALLOC for large allocations. By identifying the relevant symbol in the binary with nm and attaching a BPFtrace uprobe, we efficiently tracked allocations over 128 MB with minimal performance impact.

Initially, incomplete stack traces made it hard to pinpoint allocation sources, but enabling frame pointers during compilation resolved this, allowing us to trace memory growth back to specific data structures. Our tests showed that tracing mmap calls with BPFtrace introduced no noticeable slowdown, thanks to their low frequency and eBPF’s efficiency.

This approach provided valuable insights into memory usage in Readyset, offering a practical, low-overhead solution for debugging and optimizing high-performance applications.