DHAT: a dynamic heap analysis tool
Intro to DHAT
To fully understand DHAT please read the Valgrind docs of DHAT. Here's just a short summary and quote from the docs:
DHAT is primarily a tool for examining how programs use their heap allocations. It tracks the allocated blocks, and inspects every memory access to find which block, if any, it is to. It presents, on a program point basis, information about these blocks such as sizes, lifetimes, numbers of reads and writes, and read and write patterns.
The rest of this chapter is dedicated to how DHAT is integrated into Gungraun.
The DHAT modes
Gungraun supports all three modes heap
(the default), copy
and ad-hoc
which can be changed on the command-line with
--dhat-args=--mode=ad-hoc
or in the benchmark itself with Dhat::args
. Note
that ad-hoc
mode requires client requests which have
prerequisites. If running the benchmarks in ad-hoc
mode, it is highly
recommended to turn off the EntryPoint
with EntryPoint::None
(See next
section). However, DHAT is normally run in heap
mode and it is assumed that
this is the mode used in the next sections.
The default entry point
The DHAT default entry point EntryPoint::Default
in library benchmarks behaves
like Callgrind's EntryPoint
. This
centers the collected metrics shown in the terminal output on the benchmark
function. The entry point is set to EntryPoint::None
for binary benchmarks.
But, if necessary, the entry point can be turned off or customized in
Dhat::entry_point
.
Similar to Callgrind's entry point, the default entry point for DHAT excludes
metrics related to setup
and/or teardown
code, as well as any
elements specified in the args
parameter of the #[bench]
or #[benches]
attributes. This behavior typically aligns with user expectations. However, DHAT
has a unique characteristic: if the benchmarked function utilizes an array
created in the setup function, the metrics will not capture the reads and writes
to that array. To accurately measure these reads and writes, it is necessary to
set the entry point to the setup function (in this case, the
setup_worst_case_array
function).
extern crate gungraun; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use std::hint::black_box; use gungraun::{ library_benchmark, library_benchmark_group, main, Dhat, EntryPoint, LibraryBenchmarkConfig, }; pub fn setup_worst_case_array(start: i32) -> Vec<i32> { if start.is_negative() { (start..0).rev().collect() } else { (0..start).rev().collect() } } #[library_benchmark] #[bench::worst_case_3(setup_worst_case_array(3))] fn bench_library(array: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(array)) } library_benchmark_group!(name = my_group; benchmarks = bench_library); fn main() { main!( config = LibraryBenchmarkConfig::default() .tool(Dhat::default() .entry_point( EntryPoint::Custom("*::setup_worst_case_array".to_owned()) ) ); library_benchmark_groups = my_group ); }
Usage on the command-line
Running DHAT instead of or in addition to Callgrind is pretty straight-forward and not different to any other tool:
Either use command-line arguments or environment
variables: --default-tool=dhat
or
GUNGRAUN_DEFAULT_TOOL=dhat
(replaces callgrind as default tool) or
--tools=dhat
or GUNGRAUN_TOOLS=dhat
(runs DHAT in addition to the
default tool).
Usage in a benchmark and a small example analysis
Running DHAT in addition to Callgrind can also be carried out in the benchmark
itself with the Dhat
struct in LibraryBenchmarkConfig::tool
. We stick to the
example from above. The above benchmark will produce the following metrics:
lib_bench_dhat::my_group::bench_library worst_case_3:vec! [3, 2, 1]
======= CALLGRIND ====================================================================
Instructions: 83|N/A (*********)
L1 Hits: 110|N/A (*********)
LL Hits: 0|N/A (*********)
RAM Hits: 3|N/A (*********)
Total read+write: 113|N/A (*********)
Estimated Cycles: 215|N/A (*********)
======= DHAT =========================================================================
Total bytes: 12|N/A (*********)
Total blocks: 1|N/A (*********)
At t-gmax bytes: 0|N/A (*********)
At t-gmax blocks: 0|N/A (*********)
At t-end bytes: 0|N/A (*********)
At t-end blocks: 0|N/A (*********)
Reads bytes: 24|N/A (*********)
Writes bytes: 36|N/A (*********)
Gungraun result: Ok. 1 without regressions; 0 regressed; 0 filtered; 1 benchmarks finished in 0.55554s
Analyzing the DHAT data, there are a total of 12 bytes
of allocations (The
vector: 3 * sizeof(i32)
bytes = 3 * 4
bytes) in 1
block during the setup
of the benchmark in setup_worst_case_array
. That's also 12
bytes of writes
to fill the vector with the values. That makes 24
bytes of reads and 24
bytes of writes in the bubble_sort
function. Also, there are no
(de-)allocations of heap memory in bubble_sort
itself.
Soft limits and hard limits
Based on that data, we could define for example hard limits (or soft limits or
both whatever you think is appropriate) to ensure bubble_sort
is not getting
worse than that.
extern crate gungraun; mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } } use gungraun::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, Dhat, DhatMetric }; use std::hint::black_box; #[library_benchmark] #[bench::worst_case_3( args = (vec![3, 2, 1]), config = LibraryBenchmarkConfig::default() .tool(Dhat::default() .hard_limits([ (DhatMetric::ReadsBytes, 24), (DhatMetric::WritesBytes, 32) ]) ) )] fn bench_bubble_sort(array: Vec<i32>) -> Vec<i32> { black_box(my_lib::bubble_sort(array)) } library_benchmark_group!(name = my_group; benchmarks = bench_bubble_sort); fn main() { main!( config = LibraryBenchmarkConfig::default() .tool(Dhat::default()); library_benchmark_groups = my_group ); }
Now, if bubble_sort
would read more than 24
bytes or if there were more than
32
bytes of writes during the benchmark, the benchmark would fail and exit
with error.
Frames and benchmarking multi-threaded functions
It is possible to specify additional Dhat::frames
for example when
benchmarking multi-threaded functions. Like in callgrind, each thread/subprocess
in DHAT is treated as a separate unit and thus requires frames
(the
Gungraun specific approximation of callgrind toggles) in addition to the
default entry point to include the interesting ones in the measurements.
By example. Suppose there's a function in the benchmark_tests
library
find_primes_multi_thread(num_threads: usize)
which searches for primes in the
range 0
- 10000 * num_threads
. This multi-threaded function is splitting the
work for each 10000
numbers into a separate thread each calling the
single-threaded function benchmark_tests::find_primes
which does the actual
work. The inner workings aren't important but this description should be enough
to understand the basic idea.
extern crate gungraun; mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } } use std::hint::black_box; use gungraun::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, ValgrindTool, }; #[library_benchmark( config = LibraryBenchmarkConfig::default() .default_tool(ValgrindTool::DHAT) )] fn bench_library() -> Vec<u64> { black_box(benchmark_tests::find_primes_multi_thread(black_box(1))) } library_benchmark_group!(name = my_group; benchmarks = bench_library); fn main() { main!(library_benchmark_groups = my_group); }
Running the benchmark produces the following output:
lib_bench_find_primes::my_group::bench_library
======= DHAT =========================================================================
Total bytes: 11472|N/A (No change)
Total blocks: 9|N/A (No change)
At t-gmax bytes: 10264|N/A (No change)
At t-gmax blocks: 4|N/A (No change)
At t-end bytes: 0|N/A (No change)
At t-end blocks: 0|N/A (No change)
Reads bytes: 808|N/A (No change)
Writes bytes: 10345|N/A (No change)
Gungraun result: Ok. 1 without regressions; 0 regressed; 0 filtered; 1 benchmarks finished in 0.47449s
The problem here is, that the spawned thread is not included in the metrics.
Looking at the output files of the dhat output in dh_view.html
(heavily
shortened to safe some space):
Invocation {
Mode: heap
Command: /home/some/project/target/release/deps/lib_bench_find_primes-59f714debd12ac0a --gungraun-run my_group 0 0 lib_bench_find_primes::my_group::bench_library
PID: 262049
}
Times {
t-gmax: 2,826,328 instrs (99.56% of program duration)
t-end: 2,838,795 instrs
}
▼ PP 1/1 (3 children) {
Total: 47,303 bytes (100%, 16,663.06/Minstr) in 39 blocks (100%, 13.74/Minstr), avg size 1,212.9 bytes, avg lifetime 858,522.46 instrs (30.24% of program duration)
At t-gmax: 27,261 bytes (100%) in 10 blocks (100%), avg size 2,726.1 bytes
At t-end: 456 bytes (100%) in 1 blocks (100%), avg size 456 bytes
Reads: 45,572 bytes (100%, 16,053.29/Minstr), 0.96/byte
Writes: 48,000 bytes (100%, 16,908.58/Minstr), 1.01/byte
Allocated at {
#0: [root]
}
}
├─▼ PP 1.1/3 (14 children) {
│ Total: 46,503 bytes (98.31%, 16,381.25/Minstr) in 30 blocks (76.92%, 10.57/Minstr), avg size 1,550.1 bytes, avg lifetime 880,394 instrs (31.01% of program duration)
│ At t-gmax: 26,925 bytes (98.77%) in 8 blocks (80%), avg size 3,365.63 bytes
│ At t-end: 456 bytes (100%) in 1 blocks (100%), avg size 456 bytes
│ Reads: 45,108 bytes (98.98%, 15,889.84/Minstr), 0.97/byte
│ Writes: 47,640 bytes (99.25%, 16,781.77/Minstr), 1.02/byte
│ Allocated at {
│ #1: 0x48F47A8: malloc (in /usr/lib/valgrind/vgpreload_dhat-amd64-linux.so)
│ }
│ }
│ ├── PP 1.1.1/14 {
│ │ Total: 32,736 bytes (69.2%, 11,531.65/Minstr) in 10 blocks (25.64%, 3.52/Minstr), avg size 3,273.6 bytes, avg lifetime 235,134.1 instrs (8.28% of program duration)
│ │ Max: 16,384 bytes in 1 blocks, avg size 16,384 bytes
│ │ At t-gmax: 16,384 bytes (60.1%) in 1 blocks (10%), avg size 16,384 bytes
│ │ At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
│ │ Reads: 26,184 bytes (57.46%, 9,223.63/Minstr), 0.8/byte
│ │ Writes: 26,184 bytes (54.55%, 9,223.63/Minstr), 0.8/byte
│ │ Allocated at {
│ │ ^1: 0x48F47A8: malloc (in /usr/lib/valgrind/vgpreload_dhat-amd64-linux.so)
│ │ #2: 0x4050CE3: UnknownInlinedFun (alloc.rs:94)
│ │ #3: 0x4050CE3: UnknownInlinedFun (alloc.rs:189)
│ │ #4: 0x4050CE3: UnknownInlinedFun (alloc.rs:250)
│ │ #5: 0x4050CE3: UnknownInlinedFun (mod.rs:476)
│ │ #6: 0x4050CE3: with_capacity_in<alloc::alloc::Global> (mod.rs:422)
│ │ #7: 0x4050CE3: with_capacity_in<u64, alloc::alloc::Global> (mod.rs:190)
│ │ #8: 0x4050CE3: with_capacity_in<u64, alloc::alloc::Global> (mod.rs:929)
│ │ #9: 0x4050CE3: with_capacity<u64> (mod.rs:500)
│ │ #10: 0x4050CE3: from_iter<u64, core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>> (spec_from_iter_nested.rs:31)
│ │ #11: 0x4050CE3: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter::SpecFromIter<T,I>>::from_iter (spec_from_iter.rs:34)
│ │ #12: 0x404EF57: from_iter<u64, core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>> (mod.rs:3633)
│ │ #13: 0x404EF57: collect<core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>, alloc::vec::Vec<u64, alloc::alloc::Global>> (iterator.rs:2027)
│ │ #14: 0x404EF57: benchmark_tests::find_primes (lib.rs:31)
│ │ #15: 0x40504D0: {closure#0} (lib.rs:38)
│ │ #16: 0x40504D0: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:158)
│ │ #17: 0x404FAD4: {closure#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>> (mod.rs:559)
│ │ #18: 0x404FAD4: call_once<alloc::vec::Vec<u64, alloc::alloc::Global>, std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>>> (unwind_safe.rs:272)
│ │ #19: 0x404FAD4: do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>>>, alloc::vec::Vec<u64, alloc::alloc::Global>> (panicking.rs:589)
│ │ #20: 0x404FAD4: catch_unwind<alloc::vec::Vec<u64, alloc::alloc::Global>, core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>>>> (panicking.rs:552)
│ │ #21: 0x404FAD4: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::{impl#0}::spawn_unchecked_::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>>>, alloc::vec::Vec<u64, alloc::alloc::Global>> (panic.rs:359)
│ │ #22: 0x404FAD4: {closure#1}<benchmark_tests::find_primes_multi_thread::{closure_env#0}, alloc::vec::Vec<u64, alloc::alloc::Global>> (mod.rs:557)
│ │ #23: 0x404FAD4: core::ops::function::FnOnce::call_once{{vtable.shim}} (function.rs:253)
│ │ #24: 0x408461E: call_once<(), dyn core::ops::function::FnOnce<(), Output=()>, alloc::alloc::Global> (boxed.rs:1971)
│ │ #25: 0x408461E: std::sys::pal::unix::thread::Thread::new::thread_start (thread.rs:107)
│ │ #26: 0x49F09CA: ??? (in /usr/lib/libc.so.6)
│ │ #27: 0x4A74833: clone (in /usr/lib/libc.so.6)
│ │ }
│ │ }
...
The missing metrics of the thread are caused by the default entry point which
only includes the program points with the benchmark function in their call
stack. But, looking closely at the program point PP 1.1.1/12
and the call
stack, there's no frame of the benchmark function bench_library
or a main
function. As mentioned earlier, this is because the thread is completely
separated by DHAT.
There are multiple ways to go on depending on what we want to measure. To show
two different approaches, at first, I'll go with measuring the benchmark
function with the function spawning the threads (the default entry point which
doesn't have to be specified) and additionally all threads which execute the
benchmark_tests::find_primes
function.
extern crate gungraun; mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } } use std::hint::black_box; use gungraun::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, ValgrindTool, Dhat }; #[library_benchmark( config = LibraryBenchmarkConfig::default() .default_tool(ValgrindTool::DHAT) .tool(Dhat::default() .frames(["benchmark_tests::find_primes"]) ) )] fn bench_library() -> Vec<u64> { black_box(benchmark_tests::find_primes_multi_thread(black_box(1))) } library_benchmark_group!(name = my_group; benchmarks = bench_library); fn main() { main!(library_benchmark_groups = my_group); }
Now, the metrics include the spawned thread(s):
lib_bench_find_primes::my_group::bench_library
======= DHAT =========================================================================
Total bytes: 44208|N/A (No change)
Total blocks: 19|N/A (No change)
At t-gmax bytes: 26648|N/A (No change)
At t-gmax blocks: 5|N/A (No change)
At t-end bytes: 0|N/A (No change)
At t-end blocks: 0|N/A (No change)
Reads bytes: 26992|N/A (No change)
Writes bytes: 36529|N/A (No change)
Gungraun result: Ok. 1 without regressions; 0 regressed; 0 filtered; 1 benchmarks finished in 0.48695s
If we were only interested in the threads themselves, then using
EntryPoint::Custom
would be one way to do it. Setting a custom entry point is
syntactic sugar for disabling the entry point with EntryPoint::None
and
specifying a frame with Dhat::frames
:
extern crate gungraun; mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } } use std::hint::black_box; use gungraun::{ library_benchmark, library_benchmark_group, main, LibraryBenchmarkConfig, ValgrindTool, Dhat, EntryPoint }; #[library_benchmark( config = LibraryBenchmarkConfig::default() .default_tool(ValgrindTool::DHAT) .tool(Dhat::default() .entry_point( EntryPoint::Custom("benchmark_tests::find_primes".to_owned()) ) ) )] fn bench_library() -> Vec<u64> { black_box(benchmark_tests::find_primes_multi_thread(black_box(1))) } library_benchmark_group!(name = my_group; benchmarks = bench_library); fn main() { main!(library_benchmark_groups = my_group); }
Running this benchmark results in:
lib_bench_find_primes::my_group::bench_library
======= DHAT =========================================================================
Total bytes: 32736|N/A (*********)
Total blocks: 10|N/A (*********)
At t-gmax bytes: 16384|N/A (*********)
At t-gmax blocks: 1|N/A (*********)
At t-end bytes: 0|N/A (*********)
At t-end blocks: 0|N/A (*********)
Reads bytes: 26184|N/A (*********)
Writes bytes: 26184|N/A (*********)
Gungraun result: Ok. 1 without regressions; 0 regressed; 0 filtered; 1 benchmarks finished in 0.45178s
To verify our setup, let's compare these numbers with the data of the program
point with the thread of the dh_view.html
output shown above. Eventually,
these are the same metrics:
│ ├── PP 1.1.1/12 {
│ │ Total: 32,736 bytes (69.91%, 11,537.69/Minstr) in 10 blocks (27.03%, 3.52/Minstr), avg size 3,273.6 bytes, avg lifetime 235,111.9 instrs (8.29% of program duration)
│ │ Max: 16,384 bytes in 1 blocks, avg size 16,384 bytes
│ │ At t-gmax: 16,384 bytes (61.03%) in 1 blocks (11.11%), avg size 16,384 bytes
│ │ At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
│ │ Reads: 26,184 bytes (57.08%, 9,228.46/Minstr), 0.8/byte
│ │ Writes: 26,184 bytes (54.23%, 9,228.46/Minstr), 0.8/byte