Thursday, January 12, 2023

bpftrace vs. strace

Many years ago I compared performance overhead of dtrace vs. truss on Solaris, in a microbenchmark. This time I run a similar test but comparing bpftrace vs. strace on Linux. To do so, I wrote a simple C program which stawns X threads and each thread will call stat("/tmp") N times in a loop. The program then prints the total time it took for all threads to execute.
While it is not necesarilly a very realistic test, it does show a potential overhead of tracing and difference between different technologies. Also sometimes you do need to trace a very tight loop which then might result in overheads like shown below (or even higher).

Let's run it three times (to see if we get consistent results) with 4 threads, each one calling stat() 100k times.
# for i in $(seq 1 3); do ./threads-loop-test-1 4 1000000; done
Total time: 0.982799
Total time: 0.977269
Total time: 0.980996
It took just below 1s to execute.
Now, let's run it under bpftrace which will count how many times the stat() was called by all threads.
# for i in $(seq 1 3); do bpftrace -qe 't:syscalls:sys_enter_newstat
                                        /pid==cpid/{@[probe]=count();}'
                                   -c "./threads-loop-test-1 4 1000000";
                       done

Total time: 1.192433

@[tracepoint:syscalls:sys_enter_newstat]: 4000019
Total time: 1.204695

@[tracepoint:syscalls:sys_enter_newstat]: 4000019
Total time: 1.143513

@[tracepoint:syscalls:sys_enter_newstat]: 4000019
There is roughly a 20% overhead - not bad.
Adding an extra condition to the predicate str(args->filename)=="/tmp" has little impact - resulting in total times <1.24s.
Again, not bad, especially given that string comparison like this is rather expensive.

Now time for strace.
# for i in $(seq 1 3); do strace -qfc -e trace=stat ./threads-loop-test-1 4 1000000; done
Total time: 49.478656
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   87.720184          21   4000019        18 stat
------ ----------- ----------- --------- --------- ----------------
100.00   87.720184          21   4000019        18 total

Total time: 49.336942
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   87.463778          21   4000019        18 stat
------ ----------- ----------- --------- --------- ----------------
100.00   87.463778          21   4000019        18 total

Total time: 49.250562
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   87.923629          21   4000019        18 stat
------ ----------- ----------- --------- --------- ----------------
100.00   87.923629          21   4000019        18 total
It took about 50 times longer to execute!

While there have been many improvements to strace to reduce its impact, it is still significant in some cases.
It doesn't mean that strace is a bad tool and you should avoid it - in fact, it is often more handy and quicker to use than bpftrace or systemtap.
However be mindful of its potentially much higher overhead, especially in tight loops.


The source code for the test program.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <time.h>

void *thread_handler(void *arg) {
    int N = (long)arg;
    struct stat file_stat;
    while (N--) {
        stat("/tmp", &file_stat);
    }
    pthread_exit(NULL);
}

int main(int argc, char *argv[]) {
    struct timespec start, end;
    int num_threads;
    long N;
    if (argc != 3) {
        printf("Usage: %s  \n", argv[0]);
        exit(-1);
    }
    num_threads = atoi(argv[1]);
    N = abs(atol(argv[2]));
    pthread_t threads[num_threads];
    int rc;
    long t;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for(t=0; t<num_threads; t++){
        rc = pthread_create(&threads[t], NULL, thread_handler, (void *)N);
        if (rc){
            printf("ERROR; return code from pthread_create() is %d\n", rc);
            exit(-1);
        }
    }

 /* Wait for all threads to complete */
    for(t=0; t<num_threads; t++) {
        pthread_join(threads[t], NULL);
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    double time_taken = (end.tv_sec - start.tv_sec) + (double)(end.tv_nsec - start.tv_nsec) / (double)1000000000;
    printf("Total time: %lf\n", time_taken);
    
    return 0;
}