Wednesday, July 19, 2023

LLAMA 2 - hello world

 $ torchrun --nproc_per_node 1 test2_text_completion.py \

    --tokenizer_path tokenizer.model --max_seq_len 128 --max_batch_size 4 --ckpt_dir llama-2-7b/ \

    --prompt "Hello everyone, I'm LLAMA-2"

...

Loaded in 4.32 seconds

Hello everyone, I'm LLAMA-2 000. I'm a new member here. I'm from China. I'm a girl, I'm 16. I like animals, especially lamas. I like to play games, especially the games on the internet. I like to make friends. I hope I can make

Tuesday, June 20, 2023

Azure/DALLE-2

The latest version (0.27.8) of the pypi/openai library now supports Azure/DALLE-2. To ensure compatibility, set openai.api_version = '2023-06-01-preview'.

Tuesday, March 28, 2023

BPF & Unix timestamp

Recently I've been using bpftrace to trace some events across servers and it would be really useful if bpftrace allowed to use actual unix timestamps. The nsecs variable won't work as it represents time since boot, while time() and strftime() return strings. 

In this specific case, an application running on server A was injecting timestamps into packets being sent to an app on server B where I run tracing. This allowed for tracing time it takes to send a packet over network between apps. Unfortunately bpftrace did not make it easy to do so...

This only makes sense if time across servers is synchronised with accuracy much better that the time deltas you need to measure. In our case, servers use PTP and are generally synchronised to GM <<100ns, while the time differences measured here were in many microseconds.

I'm not the only one asking for actual timestamps in bpf. For example, see here and here. The latter is a discussion about implementing a bpf function to provide epoch timestamp. Unfortunately, there was some resistance in doing so (I don't agree with the arguments), but at least there was a workaround suggested. It is unnecessarily complicated and clanky, but works.

> Not sure what problem you're trying to solve and thus what exactly you
> need... but you can probably get something very very close with 1 or 2
> of clock_gettime(CLOCK_{BOOTTIME,MONOTONIC,REALTIME}) possibly in a
> triple vdso call sandwich and iterated a few times and picking the one
> with smallest delta between 1st and 3rd calls. And then assuming the
> avg of 1st and 3rd matches the 2nd.
> ie.
>
> 5 times do:
> t1[i] = clock_gettime(REALTIME)
> t2[i] = clock_gettime(BOOTTIME)
> t3[i] = clock_gettime(REALTIME)
>
> pick i so t3[i] - t1[i] is smallest
>
> t2[i] is near equivalent to (t1[i] + t3[i]) / 2
>
> which basically gives you a REAL to BOOT offset.
Let's quickly implement it:
#include <time.h>
#include <stdio.h>
#include <stdint.h>

#define ITERATIONS 5

int main(int argc, char **argv) {
  struct timespec ts1[ITERATIONS], ts2[ITERATIONS], ts3[ITERATIONS];
  uint64_t t1, t2, t3, t4, t5, smallest_dt = 0;
  int ret, i, smallest_dt_i;

  for (i = 0; i < ITERATIONS; i++) {
    ret = clock_gettime(CLOCK_REALTIME, &ts1[i]);
    ret = clock_gettime(CLOCK_BOOTTIME, &ts2[i]);
    ret = clock_gettime(CLOCK_REALTIME, &ts3[i]);
  }

  for (i = 0; i < ITERATIONS; i++) {
    t1 = ts1[i].tv_sec * (uint64_t)1000000000 + ts1[i].tv_nsec;
    t3 = ts3[i].tv_sec * (uint64_t)1000000000 + ts3[i].tv_nsec;
    printf("i: %d dt: %lu\n", i, t3-t1);
    if (!smallest_dt || ((t3 - t1) < smallest_dt)) {
      smallest_dt = t3 - t1;
      smallest_dt_i = i;
    }
  }

  t1 = ts1[smallest_dt_i].tv_sec * (uint64_t)1000000000 + ts1[smallest_dt_i].tv_nsec;
  t2 = ts2[smallest_dt_i].tv_sec * (uint64_t)1000000000 + ts2[smallest_dt_i].tv_nsec;
  t3 = ts3[smallest_dt_i].tv_sec * (uint64_t)1000000000 + ts3[smallest_dt_i].tv_nsec;
  t4 = (uint64_t)(t1+t3)/2;
  t5 = t4 - t2;

  printf("\n");
  printf("t5=t4-t2: %lu\n", t5);
}

Now on a system where you need actual timestamp in bpftrace you do: $t_now = (uint64)NNN + nsecs;
where NNN is the t5 timestamp reported for the system.

I wish one could get it directly in bpftrace, but we are where we are...

Tuesday, February 21, 2023

Siri + ChatGPT

Although I appreciate Siri's ability to automate my home, there are times when it can be frustrating, such as when it replies with the annoying message "I found some web results, I sent them to your iPhone." Let's make Siri much more useful and fun by integrating Siri with OpenAI/ChatGPT via Shortcuts app. It even works with HomePod and Apple Watch.

To get started, we need to define a variable with our OpenAPI token, and then use the Dictate Text action to convert speech to text and store the result in another variable.
Next, let's set another variable with parameters for OpenAI query:
Now, let's issue a REST call:
And finally let's extract the reply and pass it back to Siri to read it aloud.
Now, you can say "Hey Siri, 2001" (I named the shortcut "2001") and it will ask you for a propmpt, once you provide it a moment later it will read the OpenAI/ChatGPT reply to you making Siri really useful and fun to engage with. It even works in CarPlay mode while driving.

You can download the Shortcut as well, enjoy.

btw: I asked ChatGPT to improve the blog post, here is the result:

The given blog post can be improved by making some modifications and clarifications. Below is the improved version of the post:

Siri is an amazing tool that can automate various tasks and make life much easier. However, sometimes it can be frustrating when it responds with the message "I found some web results, I sent them to your iPhone." To make Siri even more useful and fun, we can integrate Siri/Shortcuts with OpenAI/ChatGPT, which can work with HomePod too.

To get started, we need to define a variable with our OpenAI token. Then, we can use the Dictate Text action to convert speech to text and store the result in another variable. This allows Siri to understand our speech and perform the necessary actions.

Next, we need to set another variable with the parameters for the OpenAI query. These parameters include the text we want to generate, the model we want to use, and the temperature of the output. Temperature refers to the randomness of the generated text. Higher temperatures will result in more creative and unexpected output, but lower temperatures will produce more accurate and predictable text.

Once we have set the necessary variables, we can issue a REST call to the OpenAI API. This will send our query to the OpenAI server and retrieve the generated text. We can then use the Text action to speak the generated text aloud or display it on the screen.

It's important to note that using OpenAI/ChatGPT with Siri requires some programming skills and knowledge of APIs. However, with some practice and experimentation, anyone can create their own Siri shortcuts and integrate them with OpenAI. With this powerful combination, we can create a wide range of useful and entertaining applications, from generating creative writing prompts to answering trivia questions.

Thursday, January 12, 2023

bpftrace vs. strace

Many years ago I compared performance overhead of dtrace vs. truss on Solaris, in a microbenchmark. This time I run a similar test but comparing bpftrace vs. strace on Linux. To do so, I wrote a simple C program which stawns X threads and each thread will call stat("/tmp") N times in a loop. The program then prints the total time it took for all threads to execute.
While it is not necesarilly a very realistic test, it does show a potential overhead of tracing and difference between different technologies. Also sometimes you do need to trace a very tight loop which then might result in overheads like shown below (or even higher).

Let's run it three times (to see if we get consistent results) with 4 threads, each one calling stat() 100k times.
# for i in $(seq 1 3); do ./threads-loop-test-1 4 1000000; done
Total time: 0.982799
Total time: 0.977269
Total time: 0.980996
It took just below 1s to execute.
Now, let's run it under bpftrace which will count how many times the stat() was called by all threads.
# for i in $(seq 1 3); do bpftrace -qe 't:syscalls:sys_enter_newstat
                                        /pid==cpid/{@[probe]=count();}'
                                   -c "./threads-loop-test-1 4 1000000";
                       done

Total time: 1.192433

@[tracepoint:syscalls:sys_enter_newstat]: 4000019
Total time: 1.204695

@[tracepoint:syscalls:sys_enter_newstat]: 4000019
Total time: 1.143513

@[tracepoint:syscalls:sys_enter_newstat]: 4000019
There is roughly a 20% overhead - not bad.
Adding an extra condition to the predicate str(args->filename)=="/tmp" has little impact - resulting in total times <1.24s.
Again, not bad, especially given that string comparison like this is rather expensive.

Now time for strace.
# for i in $(seq 1 3); do strace -qfc -e trace=stat ./threads-loop-test-1 4 1000000; done
Total time: 49.478656
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   87.720184          21   4000019        18 stat
------ ----------- ----------- --------- --------- ----------------
100.00   87.720184          21   4000019        18 total

Total time: 49.336942
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   87.463778          21   4000019        18 stat
------ ----------- ----------- --------- --------- ----------------
100.00   87.463778          21   4000019        18 total

Total time: 49.250562
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00   87.923629          21   4000019        18 stat
------ ----------- ----------- --------- --------- ----------------
100.00   87.923629          21   4000019        18 total
It took about 50 times longer to execute!

While there have been many improvements to strace to reduce its impact, it is still significant in some cases.
It doesn't mean that strace is a bad tool and you should avoid it - in fact, it is often more handy and quicker to use than bpftrace or systemtap.
However be mindful of its potentially much higher overhead, especially in tight loops.


The source code for the test program.
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <time.h>

void *thread_handler(void *arg) {
    int N = (long)arg;
    struct stat file_stat;
    while (N--) {
        stat("/tmp", &file_stat);
    }
    pthread_exit(NULL);
}

int main(int argc, char *argv[]) {
    struct timespec start, end;
    int num_threads;
    long N;
    if (argc != 3) {
        printf("Usage: %s  \n", argv[0]);
        exit(-1);
    }
    num_threads = atoi(argv[1]);
    N = abs(atol(argv[2]));
    pthread_t threads[num_threads];
    int rc;
    long t;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for(t=0; t<num_threads; t++){
        rc = pthread_create(&threads[t], NULL, thread_handler, (void *)N);
        if (rc){
            printf("ERROR; return code from pthread_create() is %d\n", rc);
            exit(-1);
        }
    }

 /* Wait for all threads to complete */
    for(t=0; t<num_threads; t++) {
        pthread_join(threads[t], NULL);
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    double time_taken = (end.tv_sec - start.tv_sec) + (double)(end.tv_nsec - start.tv_nsec) / (double)1000000000;
    printf("Total time: %lf\n", time_taken);
    
    return 0;
}