DHAT: A Dynamic Heap Analysis Tool
Intro to DHAT
To fully understand DHAT please read the Valgrind docs for DHAT. Here’s just a short summary and quote from the docs:
DHAT is primarily a tool for examining how programs use their heap allocations. It tracks the allocated blocks, and inspects every memory access to find which block, if any, it is to. It presents, on a program point basis, information about these blocks such as sizes, lifetimes, numbers of reads and writes, and read and write patterns.
The rest of this chapter is dedicated to how DHAT is integrated into Gungraun.
The DHAT modes
Gungraun supports all three modes heap (the default), copy and ad-hoc
which can be changed on the command-line with
--dhat-args=--mode=ad-hoc or in the benchmark itself with Dhat::args. Note
that ad-hoc mode requires client requests which have
prerequisites. If running the benchmarks in ad-hoc mode, it is highly
recommended to turn off the EntryPoint with EntryPoint::None (See next
section). However, DHAT is normally run in heap mode and it is assumed that
this is the mode used in the next sections.
The Default Entry Point
The DHAT default entry point EntryPoint::Default in library benchmarks behaves
like
Callgrind's EntryPoint.
This centers the collected metrics shown in the terminal output on the benchmark
function. The entry point is set to EntryPoint::None for binary benchmarks.
But, if necessary, the entry point can be turned off or customized in
Dhat::entry_point.
Similar to Callgrind’s entry point, the default entry point for DHAT excludes
metrics related to
setup and/or teardown code,
as well as any elements specified in the args parameter of the #[bench] or
#[benches] attributes. This behavior typically aligns with user expectations.
However, DHAT has a unique characteristic: if the benchmarked function uses an
array created in the setup function, the metrics will not capture the reads and
writes to that array. To accurately measure these reads and writes, it is
necessary to set the entry point to the setup function (in this case, the
setup_worst_case_array function).
extern crate gungraun;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use std::hint::black_box;
use gungraun::prelude::*;
use gungraun::{Dhat, EntryPoint};
pub fn setup_worst_case_array(start: i32) -> Vec<i32> {
if start.is_negative() {
(start..0).rev().collect()
} else {
(0..start).rev().collect()
}
}
#[library_benchmark]
#[bench::worst_case_3(setup_worst_case_array(3))]
fn bench_library(array: Vec<i32>) -> Vec<i32> {
black_box(my_lib::bubble_sort(black_box(array)))
}
library_benchmark_group!(name = my_group, benchmarks = bench_library);
fn main() {
main!(
config = LibraryBenchmarkConfig::default()
.tool(Dhat::default()
.entry_point(
EntryPoint::Custom("*::setup_worst_case_array".to_owned())
)
),
library_benchmark_groups = my_group
);
}
Sanitized DHAT Output Files
Gungraun rewrites DHAT output files by default (SanitizeOutput::Yes) to match
the configured Dhat::entry_point and additional Dhat::frames filters. This
keeps the metrics shown in dh_view.html aligned with the metrics Gungraun
reports in the terminal.
Use Dhat::sanitize_output(SanitizeOutput::No) to keep DHAT output files
unchanged, or SanitizeOutput::KeepOrig to write sanitized output while keeping
the original files with an .orig extension.
Usage on the Command-Line
Running DHAT instead of or in addition to Callgrind is straightforward and no different from any other tool:
Either use
command-line arguments or environment variables:
--default-tool=dhat or GUNGRAUN_DEFAULT_TOOL=dhat (replaces Callgrind as
default tool) or --tools=dhat or GUNGRAUN_TOOLS=dhat (runs DHAT in addition
to the default tool).
Usage in a Benchmark and a Small Example Analysis
Running DHAT in addition to Callgrind can also be carried out in the benchmark
itself with the Dhat struct in LibraryBenchmarkConfig::tool. We stick to the
example from above. The above benchmark will produce the following metrics:
lib_bench_dhat::my_group::bench_library worst_case_3:vec! [3, 2, 1]
======= CALLGRIND ====================================================================
Instructions: 83|N/A (*********)
L1 Hits: 110|N/A (*********)
LL Hits: 0|N/A (*********)
RAM Hits: 3|N/A (*********)
Total read+write: 113|N/A (*********)
Estimated Cycles: 215|N/A (*********)
======= DHAT =========================================================================
Total bytes: 12|N/A (*********)
Total blocks: 1|N/A (*********)
At t-gmax bytes: 0|N/A (*********)
At t-gmax blocks: 0|N/A (*********)
At t-end bytes: 0|N/A (*********)
At t-end blocks: 0|N/A (*********)
Reads bytes: 24|N/A (*********)
Writes bytes: 36|N/A (*********)
Gungraun result: Ok. 1 without regressions; 0 regressed; 0 filtered; 1 benchmarks finished in 0.55554s
Analyzing the DHAT data, there are a total of 12 bytes of allocations (The
vector: 3 * sizeof(i32) bytes = 3 * 4 bytes) in 1 block during the setup
of the benchmark in setup_worst_case_array. That’s also 12 bytes of writes
to fill the vector with the values. That makes 24 bytes of reads and 24
bytes of writes in the bubble_sort function. Also, there are no
(de-)allocations of heap memory in bubble_sort itself.
Soft Limits and Hard Limits
Based on that data, we could define for example hard limits (or soft limits or
both whatever you think is appropriate) to ensure bubble_sort is not getting
worse than that.
extern crate gungraun;
mod my_lib { pub fn bubble_sort(_: Vec<i32>) -> Vec<i32> { vec![] } }
use gungraun::prelude::*;
use gungraun::{Dhat, DhatMetric};
use std::hint::black_box;
#[library_benchmark]
#[bench::worst_case_3(
args = (vec![3, 2, 1]),
config = LibraryBenchmarkConfig::default()
.tool(Dhat::default()
.hard_limits([
(DhatMetric::ReadsBytes, 24),
(DhatMetric::WritesBytes, 32)
])
)
)]
fn bench_bubble_sort(array: Vec<i32>) -> Vec<i32> {
black_box(my_lib::bubble_sort(black_box(array)))
}
library_benchmark_group!(name = my_group, benchmarks = bench_bubble_sort);
fn main() {
main!(
config = LibraryBenchmarkConfig::default()
.tool(Dhat::default()),
library_benchmark_groups = my_group
);
}
Now, if bubble_sort would read more than 24 bytes or if there were more than
32 bytes of writes during the benchmark, the benchmark would fail and exit
with error.
Frames and Benchmarking Multi-Threaded Functions
It is possible to specify additional Dhat::frames for example when
benchmarking multi-threaded functions. Like in Callgrind, each thread/subprocess
in DHAT is treated as a separate unit and thus requires frames (the Gungraun
specific approximation of Callgrind toggles) in addition to the default entry
point to include the interesting ones in the measurements.
By example. Suppose there’s a function in the benchmark_tests library
find_primes_multi_thread(num_threads: usize) which searches for primes in the
range 0 - 10000 * num_threads. This multi-threaded function is splitting the
work for each 10000 numbers into a separate thread each calling the
single-threaded function benchmark_tests::find_primes which does the actual
work. The inner workings aren’t important but this description should be enough
to understand the basic idea.
extern crate gungraun;
mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } }
use std::hint::black_box;
use gungraun::prelude::*;
use gungraun::ValgrindTool;
#[library_benchmark(
config = LibraryBenchmarkConfig::default()
.default_tool(ValgrindTool::DHAT)
)]
fn bench_library() -> Vec<u64> {
black_box(benchmark_tests::find_primes_multi_thread(black_box(1)))
}
library_benchmark_group!(name = my_group, benchmarks = bench_library);
fn main() {
main!(library_benchmark_groups = my_group);
}
Running the benchmark produces the following output:
lib_bench_find_primes::my_group::bench_library
======= DHAT =========================================================================
Total bytes: 11464|N/A (*********)
Total blocks: 9|N/A (*********)
At t-gmax bytes: 10264|N/A (*********)
At t-gmax blocks: 4|N/A (*********)
At t-end bytes: 0|N/A (*********)
At t-end blocks: 0|N/A (*********)
Reads bytes: 776|N/A (*********)
Writes bytes: 10337|N/A (*********)
Gungraun result: Ok. 1 without regressions; 0 regressed; 0 filtered; 1 benchmarks finished in 0.55922s
The problem here is that the spawned thread is not included in the metrics. The
DHAT output files shown in dh_view.html also do not show any threads because
DHAT output files are sanitized by default. The output in this example is
shortened to save space:
Invocation {
Mode: heap
Command: /fast/lenny/workspace/programming/gungraun/gungraun/target/release/deps/lib_bench_find_primes-f6831ae18d944771 --gungraun-run 00000 00000 00000
PID: 1971323
}
Times {
t-gmax: 2,942,796 instrs (99.58% of program duration)
t-end: 2,955,202 instrs
}
▼ PP 1/1 (8 children) {
Total: 11,464 bytes (100%, 3,879.26/Minstr) in 9 blocks (100%, 3.05/Minstr), avg size 1,273.78 bytes, avg lifetime 1,657,484.56 instrs (56.09% of program duration)
At t-gmax: 10,264 bytes (100%) in 4 blocks (100%), avg size 2,566 bytes
At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
Reads: 776 bytes (100%, 262.59/Minstr), 0.07/byte
Writes: 10,337 bytes (100%, 3,497.9/Minstr), 0.9/byte
Allocated at {
#0: [root]
}
}
├─▼ PP 1.1/8 (2 children) {
│ Total: 9,928 bytes (86.6%, 3,359.5/Minstr) in 2 blocks (22.22%, 0.68/Minstr), avg size 4,964 bytes, avg lifetime 1,244,629 instrs (42.12% of program duration)
│ At t-gmax: 9,928 bytes (96.73%) in 2 blocks (50%), avg size 4,964 bytes
│ At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
│ Reads: 24 bytes (3.09%, 8.12/Minstr), 0/byte
│ Writes: 9,856 bytes (95.35%, 3,335.14/Minstr), 0.99/byte
│ Allocated at {
│ #1: 0x40522FC: alloc (alloc.rs:95)
│ #2: 0x40522FC: alloc_impl_runtime (alloc.rs:190)
│ #3: 0x40522FC: alloc_impl (alloc.rs:312)
│ #4: 0x40522FC: allocate (alloc.rs:429)
│ #5: 0x40522FC: alloc::raw_vec::RawVecInner<A>::finish_grow (mod.rs:558)
│ }
│ }
│ ├── PP 1.1.1/2 {
│ │ Total: 9,832 bytes (85.76%, 3,327.01/Minstr) in 1 blocks (11.11%, 0.34/Minstr), avg size 9,832 bytes, avg lifetime 6,263 instrs (0.21% of program duration)
│ │ Max: 9,832 bytes in 1 blocks, avg size 9,832 bytes
│ │ At t-gmax: 9,832 bytes (95.79%) in 1 blocks (25%), avg size 9,832 bytes
│ │ At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
│ │ Reads: 0 bytes (0%, 0/Minstr), 0/byte
│ │ Writes: 9,832 bytes (95.11%, 3,327.01/Minstr), 1/byte
│ │ Allocated at {
│ │ ^1: 0x40522FC: alloc (alloc.rs:95)
│ │ ^2: 0x40522FC: alloc_impl_runtime (alloc.rs:190)
│ │ ^3: 0x40522FC: alloc_impl (alloc.rs:312)
│ │ ^4: 0x40522FC: allocate (alloc.rs:429)
│ │ ^5: 0x40522FC: alloc::raw_vec::RawVecInner<A>::finish_grow (mod.rs:558)
│ │ #6: 0x4052393: grow_amortized<alloc::alloc::Global> (mod.rs:527)
│ │ #7: 0x4052393: alloc::raw_vec::RawVecInner<A>::reserve::do_reserve_and_handle (mod.rs:666)
│ │ #8: 0x4050284: reserve<alloc::alloc::Global> (mod.rs:673)
│ │ #9: 0x4050284: reserve<u64, alloc::alloc::Global> (mod.rs:340)
│ │ #10: 0x4050284: reserve<u64, alloc::alloc::Global> (mod.rs:1446)
│ │ #11: 0x4050284: append_elements<u64, alloc::alloc::Global> (mod.rs:2879)
│ │ #12: 0x4050284: spec_extend<u64, alloc::alloc::Global, alloc::alloc::Global> (spec_extend.rs:34)
│ │ #13: 0x4050284: extend<u64, alloc::alloc::Global, alloc::vec::Vec<u64, alloc::alloc::Global>> (mod.rs:3933)
│ │ #14: 0x4050284: benchmark_tests::find_primes_multi_thread (lib.rs:49)
│ │ #15: 0x404D140: lib_bench_find_primes::bench_library::__gungraun_wrapper_mod::bench_library (lib_bench_find_primes.rs:16)
│ │ #16: 0x404D158: lib_bench_find_primes::bench_library::__gungraun_wrapper_id_mod::wrapper (lib_bench_find_primes.rs:15)
│ │ #17: 0x404D0F1: lib_bench_find_primes::bench_library::__run_wrapper (lib_bench_find_primes.rs:6)
│ │ #18: 0x404E331: lib_bench_find_primes::main (macros.rs:588)
│ │ #19: 0x404D1F2: call_once<fn(), ()> (function.rs:250)
│ │ #20: 0x404D1F2: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:166)
│ │ #21: 0x404D1E8: std::rt::lang_start::{{closure}} (rt.rs:206)
│ │ #22: 0x4085263: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:287)
│ │ #23: 0x4085263: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:581)
│ │ #24: 0x4085263: catch_unwind<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:544)
│ │ #25: 0x4085263: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:359)
│ │ #26: 0x4085263: {closure#0} (rt.rs:175)
│ │ #27: 0x4085263: do_call<std::rt::lang_start_internal::{closure_env#0}, isize> (panicking.rs:581)
│ │ #28: 0x4085263: catch_unwind<isize, std::rt::lang_start_internal::{closure_env#0}> (panicking.rs:544)
│ │ #29: 0x4085263: catch_unwind<std::rt::lang_start_internal::{closure_env#0}, isize> (panic.rs:359)
│ │ #30: 0x4085263: std::rt::lang_start_internal (rt.rs:171)
│ │ #31: 0x404F7FB: main (in /fast/lenny/workspace/programming/gungraun/gungraun/target/release/deps/lib_bench_find_primes-f6831ae18d944771)
│ │ }
│ │ }
...
To actually see all program points that DHAT records you need to either run without sanitization or keep the original files and inspect those. We’re going for the latter, which conveniently lets us inspect the sanitized and original output files.
extern crate gungraun;
mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } }
use std::hint::black_box;
use gungraun::prelude::*;
use gungraun::{Dhat, ValgrindTool, SanitizeOutput};
#[library_benchmark(
config = LibraryBenchmarkConfig::default()
.default_tool(ValgrindTool::DHAT)
.tool(Dhat::default().sanitize_output(SanitizeOutput::KeepOrig))
)]
fn bench_library() -> Vec<u64> {
black_box(benchmark_tests::find_primes_multi_thread(black_box(1)))
}
fn main() {}
After running the benchmark again, the dhat.*.out.orig file (also shortened)
includes the metrics of the thread:
Invocation {
Mode: heap
Command: /fast/lenny/workspace/programming/gungraun/gungraun/target/release/deps/lib_bench_find_primes-f6831ae18d944771 --gungraun-run 00000 00000 00000
PID: 2007925
}
Times {
t-gmax: 2,940,202 instrs (99.58% of program duration)
t-end: 2,952,524 instrs
}
▼ PP 1/1 (19 children) {
Total: 47,327 bytes (100%, 16,029.34/Minstr) in 38 blocks (100%, 12.87/Minstr), avg size 1,245.45 bytes, avg lifetime 863,000.68 instrs (29.23% of program duration)
At t-gmax: 27,326 bytes (100%) in 10 blocks (100%), avg size 2,732.6 bytes
At t-end: 544 bytes (100%) in 1 blocks (100%), avg size 544 bytes
Reads: 45,739 bytes (100%, 15,491.49/Minstr), 0.97/byte
Writes: 48,163 bytes (100%, 16,312.48/Minstr), 1.02/byte
Allocated at {
#0: [root]
}
}
├── PP 1.1/19 {
│ Total: 32,736 bytes (69.17%, 11,087.46/Minstr) in 10 blocks (26.32%, 3.39/Minstr), avg size 3,273.6 bytes, avg lifetime 248,000.9 instrs (8.4% of program duration)
│ Max: 16,384 bytes in 1 blocks, avg size 16,384 bytes
│ At t-gmax: 16,384 bytes (59.96%) in 1 blocks (10%), avg size 16,384 bytes
│ At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
│ Reads: 26,184 bytes (57.25%, 8,868.34/Minstr), 0.8/byte
│ Writes: 26,184 bytes (54.37%, 8,868.34/Minstr), 0.8/byte
│ Allocated at {
│ #1: 0x4052446: alloc (alloc.rs:95)
│ #2: 0x4052446: alloc_impl_runtime (alloc.rs:190)
│ #3: 0x4052446: alloc_impl (alloc.rs:312)
│ #4: 0x4052446: allocate (alloc.rs:429)
│ #5: 0x4052446: try_allocate_in<alloc::alloc::Global> (mod.rs:464)
│ #6: 0x4052446: with_capacity_in<alloc::alloc::Global> (mod.rs:433)
│ #7: 0x4052446: with_capacity_in<u64, alloc::alloc::Global> (mod.rs:177)
│ #8: 0x4052446: with_capacity_in<u64, alloc::alloc::Global> (mod.rs:965)
│ #9: 0x4052446: with_capacity<u64> (mod.rs:524)
│ #10: 0x4052446: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter (spec_from_iter_nested.rs:30)
│ #11: 0x40507F0: from_iter<u64, core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>> (spec_from_iter.rs:33)
│ #12: 0x40507F0: from_iter<u64, core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>> (mod.rs:3865)
│ #13: 0x40507F0: collect<core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>, alloc::vec::Vec<u64, alloc::alloc::Global>> (iterator.rs:2064)
│ #14: 0x40507F0: benchmark_tests::find_primes (lib.rs:33)
│ #15: 0x4052BD9: {closure#0} (lib.rs:98)
│ #16: 0x4052BD9: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:166)
│ #17: 0x4051568: {closure#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>> (lifecycle.rs:91)
│ #18: 0x4051568: call_once<alloc::vec::Vec<u64, alloc::alloc::Global>, std::thread::lifecycle::spawn_unchecked::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>>> (unwind_safe.rs:274)
│ #19: 0x4051568: do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::lifecycle::spawn_unchecked::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>>>, alloc::vec::Vec<u64, alloc::alloc::Global>> (panicking.rs:581)
│ #20: 0x4051568: catch_unwind<alloc::vec::Vec<u64, alloc::alloc::Global>, core::panic::unwind_safe::AssertUnwindSafe<std::thread::lifecycle::spawn_unchecked::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>>>> (panicking.rs:544)
│ #21: 0x4051568: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::lifecycle::spawn_unchecked::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>>>, alloc::vec::Vec<u64, alloc::alloc::Global>> (panic.rs:359)
│ #22: 0x4051568: {closure#1}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>> (lifecycle.rs:89)
│ #23: 0x4051568: core::ops::function::FnOnce::call_once{{vtable.shim}} (function.rs:250)
│ #24: 0x408D0FE: call_once<(), (dyn core::ops::function::FnOnce<(), Output=()> + core::marker::Send), alloc::alloc::Global> (boxed.rs:2240)
│ #25: 0x408D0FE: <std::sys::thread::unix::Thread>::new::thread_start (unix.rs:118)
│ #26: 0x49F81B8: ??? (in /usr/lib/libc.so.6)
│ #27: 0x4A7D043: clone (in /usr/lib/libc.so.6)
│ }
│ }
...
The missing metrics of the thread are caused by the default entry point which
only includes the program points with the benchmark function in their call
stack. But, looking closely at the program point PP 1.1.1/12 and the call
stack, there’s no frame of the benchmark function bench_library or a main
function. As mentioned earlier, this is because the thread is completely
separated by DHAT.
There are multiple ways to go on depending on what we want to measure. To show
two different approaches, at first, I’ll go with measuring the benchmark
function with the function spawning the threads (the default entry point which
doesn’t have to be specified) and additionally all threads which execute the
benchmark_tests::find_primes function.
extern crate gungraun;
mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } }
use std::hint::black_box;
use gungraun::prelude::*;
use gungraun::{Dhat, ValgrindTool};
#[library_benchmark(
config = LibraryBenchmarkConfig::default()
.default_tool(ValgrindTool::DHAT)
.tool(Dhat::default()
.frames(["benchmark_tests::find_primes"])
)
)]
fn bench_library() -> Vec<u64> {
black_box(benchmark_tests::find_primes_multi_thread(black_box(1)))
}
library_benchmark_group!(name = my_group, benchmarks = bench_library);
fn main() {
main!(library_benchmark_groups = my_group);
}
Now, the metrics include the spawned thread(s):
lib_bench_find_primes::my_group::bench_library
======= DHAT =========================================================================
Total bytes: 44208|N/A (No change)
Total blocks: 19|N/A (No change)
At t-gmax bytes: 26648|N/A (No change)
At t-gmax blocks: 5|N/A (No change)
At t-end bytes: 0|N/A (No change)
At t-end blocks: 0|N/A (No change)
Reads bytes: 26992|N/A (No change)
Writes bytes: 36529|N/A (No change)
Gungraun result: Ok. 1 without regressions; 0 regressed; 0 filtered; 1 benchmarks finished in 0.48695s
If we were only interested in the threads themselves, then using
EntryPoint::Custom would be one way to do it. Setting a custom entry point is
syntactic sugar for disabling the entry point with EntryPoint::None and
specifying a frame with Dhat::frames:
extern crate gungraun;
mod benchmark_tests { pub fn find_primes_multi_thread (_: u64) -> Vec<u64> { vec![] } }
use std::hint::black_box;
use gungraun::prelude::*;
use gungraun::{Dhat, EntryPoint, ValgrindTool};
#[library_benchmark(
config = LibraryBenchmarkConfig::default()
.default_tool(ValgrindTool::DHAT)
.tool(Dhat::default()
.entry_point(
EntryPoint::Custom("benchmark_tests::find_primes".to_owned())
)
)
)]
fn bench_library() -> Vec<u64> {
black_box(benchmark_tests::find_primes_multi_thread(black_box(1)))
}
library_benchmark_group!(name = my_group, benchmarks = bench_library);
fn main() {
main!(library_benchmark_groups = my_group);
}
Running this benchmark results in:
lib_bench_find_primes::my_group::bench_library
======= DHAT =========================================================================
Total bytes: 32736|N/A (*********)
Total blocks: 10|N/A (*********)
At t-gmax bytes: 16384|N/A (*********)
At t-gmax blocks: 1|N/A (*********)
At t-end bytes: 0|N/A (*********)
At t-end blocks: 0|N/A (*********)
Reads bytes: 26184|N/A (*********)
Writes bytes: 26184|N/A (*********)
Gungraun result: Ok. 1 without regressions; 0 regressed; 0 filtered; 1 benchmarks finished in 0.45178s
Compare these numbers with the thread program point from the original output shown above. The sanitized output file then contains only that program point:
Invocation {
Mode: heap
Command: /fast/lenny/workspace/programming/gungraun/gungraun/target/release/deps/lib_bench_find_primes-f6831ae18d944771 --gungraun-run 00000 00000 00000
PID: 1941428
}
Times {
t-gmax: 2,940,191 instrs (99.58% of program duration)
t-end: 2,952,454 instrs
}
─ PP 1/1 {
Total: 32,736 bytes (100%, 11,087.73/Minstr) in 10 blocks (100%, 3.39/Minstr), avg size 3,273.6 bytes, avg lifetime 248,000.9 instrs (8.4% of program duration)
At t-gmax: 16,384 bytes (100%) in 1 blocks (100%), avg size 16,384 bytes
At t-end: 0 bytes (0%) in 0 blocks (0%), avg size 0 bytes
Reads: 26,184 bytes (100%, 8,868.55/Minstr), 0.8/byte
Writes: 26,184 bytes (100%, 8,868.55/Minstr), 0.8/byte
Allocated at {
#0: [root]
#1: 0x4052536: alloc (alloc.rs:95)
#2: 0x4052536: alloc_impl_runtime (alloc.rs:190)
#3: 0x4052536: alloc_impl (alloc.rs:312)
#4: 0x4052536: allocate (alloc.rs:429)
#5: 0x4052536: try_allocate_in<alloc::alloc::Global> (mod.rs:464)
#6: 0x4052536: with_capacity_in<alloc::alloc::Global> (mod.rs:433)
#7: 0x4052536: with_capacity_in<u64, alloc::alloc::Global> (mod.rs:177)
#8: 0x4052536: with_capacity_in<u64, alloc::alloc::Global> (mod.rs:965)
#9: 0x4052536: with_capacity<u64> (mod.rs:524)
#10: 0x4052536: <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter (spec_from_iter_nested.rs:30)
#11: 0x40508E0: from_iter<u64, core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>> (spec_from_iter.rs:33)
#12: 0x40508E0: from_iter<u64, core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>> (mod.rs:3865)
#13: 0x40508E0: collect<core::iter::adapters::filter::Filter<core::ops::range::RangeInclusive<u64>, benchmark_tests::find_primes::{closure_env#0}>, alloc::vec::Vec<u64, alloc::alloc::Global>> (iterator.rs:2064)
#14: 0x40508E0: benchmark_tests::find_primes (lib.rs:33)
#15: 0x4052CC9: {closure#0} (lib.rs:98)
#16: 0x4052CC9: std::sys::backtrace::__rust_begin_short_backtrace (backtrace.rs:166)
#17: 0x4051658: {closure#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>> (lifecycle.rs:91)
#18: 0x4051658: call_once<alloc::vec::Vec<u64, alloc::alloc::Global>, std::thread::lifecycle::spawn_unchecked::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>>> (unwind_safe.rs:274)
#19: 0x4051658: do_call<core::panic::unwind_safe::AssertUnwindSafe<std::thread::lifecycle::spawn_unchecked::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>>>, alloc::vec::Vec<u64, alloc::alloc::Global>> (panicking.rs:581)
#20: 0x4051658: catch_unwind<alloc::vec::Vec<u64, alloc::alloc::Global>, core::panic::unwind_safe::AssertUnwindSafe<std::thread::lifecycle::spawn_unchecked::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>>>> (panicking.rs:544)
#21: 0x4051658: catch_unwind<core::panic::unwind_safe::AssertUnwindSafe<std::thread::lifecycle::spawn_unchecked::{closure#1}::{closure_env#0}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>>>, alloc::vec::Vec<u64, alloc::alloc::Global>> (panic.rs:359)
#22: 0x4051658: {closure#1}<benchmark_tests::find_primes_multi_thread::{closure_env#1}, alloc::vec::Vec<u64, alloc::alloc::Global>> (lifecycle.rs:89)
#23: 0x4051658: core::ops::function::FnOnce::call_once{{vtable.shim}} (function.rs:250)
#24: 0x408D24E: call_once<(), (dyn core::ops::function::FnOnce<(), Output=()> + core::marker::Send), alloc::alloc::Global> (boxed.rs:2240)
#25: 0x408D24E: <std::sys::thread::unix::Thread>::new::thread_start (unix.rs:118)
#26: 0x49F81B8: ??? (in /usr/lib/libc.so.6)
#27: 0x4A7D043: clone (in /usr/lib/libc.so.6)
}
}
PP significance threshold: total >= 0.1 blocks (1%)