Tuesday, March 28, 2023

BPF & Unix timestamp

Recently I've been using bpftrace to trace some events across servers and it would be really useful if bpftrace allowed to use actual unix timestamps. The nsecs variable won't work as it represents time since boot, while time() and strftime() return strings. 

In this specific case, an application running on server A was injecting timestamps into packets being sent to an app on server B where I run tracing. This allowed for tracing time it takes to send a packet over network between apps. Unfortunately bpftrace did not make it easy to do so...

This only makes sense if time across servers is synchronised with accuracy much better that the time deltas you need to measure. In our case, servers use PTP and are generally synchronised to GM <<100ns, while the time differences measured here were in many microseconds.

I'm not the only one asking for actual timestamps in bpf. For example, see here and here. The latter is a discussion about implementing a bpf function to provide epoch timestamp. Unfortunately, there was some resistance in doing so (I don't agree with the arguments), but at least there was a workaround suggested. It is unnecessarily complicated and clanky, but works.

> Not sure what problem you're trying to solve and thus what exactly you
> need... but you can probably get something very very close with 1 or 2
> of clock_gettime(CLOCK_{BOOTTIME,MONOTONIC,REALTIME}) possibly in a
> triple vdso call sandwich and iterated a few times and picking the one
> with smallest delta between 1st and 3rd calls. And then assuming the
> avg of 1st and 3rd matches the 2nd.
> ie.
>
> 5 times do:
> t1[i] = clock_gettime(REALTIME)
> t2[i] = clock_gettime(BOOTTIME)
> t3[i] = clock_gettime(REALTIME)
>
> pick i so t3[i] - t1[i] is smallest
>
> t2[i] is near equivalent to (t1[i] + t3[i]) / 2
>
> which basically gives you a REAL to BOOT offset.
Let's quickly implement it:
#include <time.h>
#include <stdio.h>
#include <stdint.h>

#define ITERATIONS 5

int main(int argc, char **argv) {
  struct timespec ts1[ITERATIONS], ts2[ITERATIONS], ts3[ITERATIONS];
  uint64_t t1, t2, t3, t4, t5, smallest_dt = 0;
  int ret, i, smallest_dt_i;

  for (i = 0; i < ITERATIONS; i++) {
    ret = clock_gettime(CLOCK_REALTIME, &ts1[i]);
    ret = clock_gettime(CLOCK_BOOTTIME, &ts2[i]);
    ret = clock_gettime(CLOCK_REALTIME, &ts3[i]);
  }

  for (i = 0; i < ITERATIONS; i++) {
    t1 = ts1[i].tv_sec * (uint64_t)1000000000 + ts1[i].tv_nsec;
    t3 = ts3[i].tv_sec * (uint64_t)1000000000 + ts3[i].tv_nsec;
    printf("i: %d dt: %lu\n", i, t3-t1);
    if (!smallest_dt || ((t3 - t1) < smallest_dt)) {
      smallest_dt = t3 - t1;
      smallest_dt_i = i;
    }
  }

  t1 = ts1[smallest_dt_i].tv_sec * (uint64_t)1000000000 + ts1[smallest_dt_i].tv_nsec;
  t2 = ts2[smallest_dt_i].tv_sec * (uint64_t)1000000000 + ts2[smallest_dt_i].tv_nsec;
  t3 = ts3[smallest_dt_i].tv_sec * (uint64_t)1000000000 + ts3[smallest_dt_i].tv_nsec;
  t4 = (uint64_t)(t1+t3)/2;
  t5 = t4 - t2;

  printf("\n");
  printf("t5=t4-t2: %lu\n", t5);
}

Now on a system where you need actual timestamp in bpftrace you do: $t_now = (uint64)NNN + nsecs;
where NNN is the t5 timestamp reported for the system.

I wish one could get it directly in bpftrace, but we are where we are...