Posts

  • Exporting Prometheus metrics from Go

    Exporting Prometheus metrics is quite straightforward, specially from a Go application - it is a Go project after all, as long as you know the basics of the process. The first step is to understand that Prometheus is not just a monitoring system, but also a time series database. So in order to collect metrics with it, there are three components involved: an application exporting its metrics in Prometheus format, a Prometheus scraper that will grab these metrics in pre-defined intervals and a time series database that will store them for later consumption - usually Prometheus itself, but it’s possible to use other storage backends. The focus here is the first component, the metrics export process.

    The first step is to decide which type is more suitable for the metric to be exported. The Prometheus documentation gives a nice explanation about the four types (Counter, Gauge, Histogram and Summary) offered. What’s important to understand is that they are basically a metric name (like job_queue_size), possibly associated with labels (like {type="email"}) that will have a numeric value associated with it (like 10). When scraped, these will be associated with the collection time, which makes it possible, for instance, to later plot these values in a graph. Different types of metrics will offer different facilities to collect the data.

    Next, there’s a need to decide when metrics will be observed. The short answer is “synchronously, at collection time”. The application shouldn’t worry about observing metrics in the background and give the last collected values when scraped. The scrape request itself should trigger the metrics observation - it doesn’t matter if this process isn’t instant. The long answer is that it depends, as when monitoring events, like HTTP requests or jobs processed in a queue, metrics will be observed at event time to be later collected when scraped.

    The following example will illustrate how metrics can be observed at event time:

    package main
    
    import (
      "io"
      "log"
      "net/http"
    
      "github.com/gorilla/mux"
      "github.com/prometheus/client_golang/prometheus"
      "github.com/prometheus/client_golang/prometheus/promhttp"
    )
    
    var httpRequestsTotal = prometheus.NewCounter(
      prometheus.CounterOpts{
        Name:        "http_requests_total",
        Help:        "Total number of HTTP requests",
        ConstLabels: prometheus.Labels{"server": "api"},
      },
    )
    
    func HealthCheck(w http.ResponseWriter, r *http.Request) {
      httpRequestsTotal.Inc()
      w.WriteHeader(http.StatusOK)
      io.WriteString(w, "OK")
    }
    
    func main() {
      prometheus.MustRegister(httpRequestsTotal)
    
      r := mux.NewRouter()
      r.HandleFunc("/healthcheck", HealthCheck)
      r.Handle("/metrics", promhttp.Handler())
    
      addr := ":8080"
      srv := &http.Server{
        Addr:    addr,
        Handler: r,
      }
      log.Print("Starting server at ", addr)
      log.Fatal(srv.ListenAndServe())
    }
    

    There’s a single Counter metric called http_requests_total (the “total” suffix is a naming convention) with a constant label {server="api"}. The HealthCheck() HTTP handler itself will call the Inc() method responsible for incrementing this counter, but in a real-life application that would preferable be done in a HTTP middleware. It’s important to not forget to register the metrics variable within the prometheus library itself, otherwise it won’t show up in the collection.

    Let’s see how they work using the xh HTTPie Rust clone:

    $ xh localhost:8080/metrics | grep http_requests_total
    # HELP http_requests_total Total number of HTTP requests
    # TYPE http_requests_total counter
    http_requests_total{server="api"} 0
    
    $ xh localhost:8080/healthcheck
    HTTP/1.1 200 OK
    content-length: 2
    content-type: text/plain; charset=utf-8
    date: Sat, 14 Aug 2021 12:26:03 GMT
    
    OK
    
    $ xh localhost:8080/metrics | grep http_requests_total
    # HELP http_requests_total Total number of HTTP requests
    # TYPE http_requests_total counter
    http_requests_total{server="api"} 1
    

    This is cool, but as the metric relies on constant labels, the measurement isn’t that granular. With a small modification we can use dynamic labels to store this counter per route and HTTP method:

    diff --git a/main.go b/main.go
    index 5d6079a..53249b1 100644
    --- a/main.go
    +++ b/main.go
    @@ -10,16 +10,17 @@ import (
            "github.com/prometheus/client_golang/prometheus/promhttp"
     )
    
    -var httpRequestsTotal = prometheus.NewCounter(
    +var httpRequestsTotal = prometheus.NewCounterVec(
            prometheus.CounterOpts{
                    Name:        "http_requests_total",
                    Help:        "Total number of HTTP requests",
                    ConstLabels: prometheus.Labels{"server": "api"},
            },
    +       []string{"route", "method"},
     )
    
     func HealthCheck(w http.ResponseWriter, r *http.Request) {
    -       httpRequestsTotal.Inc()
    +       httpRequestsTotal.WithLabelValues("/healthcheck", r.Method).Inc()
            w.WriteHeader(http.StatusOK)
            io.WriteString(w, "OK")
     }
    

    Again, in a real-life application it’s better to let the route be auto-discovered in runtime instead of hard-coding its value within the handler. The result will look like:

    $ xh localhost:8080/metrics | grep http_requests_total
    # HELP http_requests_total Total number of HTTP requests
    # TYPE http_requests_total counter
    http_requests_total{route="/healthcheck",method="GET",server="api"} 1
    

    The key here is to understand that the counter vector doesn’t that mean multiple values will be stored in the same metric. What it does is to use different label values to create a multi-dimensional metric, where each label combination is an element of the vector.

  • RAID on the Ubuntu Server Live installer

    My first contact with Ubuntu was in 2006, a little after the first Long-Term Support (LTS) version 6.06 (Dapper Drake) was out. Although it still feels like yesterday, 15 years is a heck of a long time. Things were a bit different by then, as the Canonical LTS offer was of about 3 years on desktop and 5 years on server releases - instead of 5 years for both as it stands to this date. They even sent free CDs to anyone in the world, including shipping, from 2005 to 2011 when the initiative was ended. This may look stupid now, but downloading a CD over a 56k dial-up connection (which was still a thing in multiple parts of the world) used to take over a day. Even ADSL connections were not that much faster, as the most common ones were around 256-300 Kbps.

    It took me a few more years to use Linux on a desktop, which I did around the end of 2012, although I was using it on my servers at least since 2010 - the year I started to grab cheap VPS offers from LowEndBox. By 2013 I started to work with Herberth Amaral (which is also one of the most competent professionals I know), where Ubuntu was being used on the servers instead of Debian - the latter being Linux distribution I was used to. That didn’t make a huge difference, as both are quite similar when you don’t consider their desktop UI, but I still opted for Debian on my own machines.

    This trend continued when I started to contribute to the Debian Project in 2014, where I used a Debian server as my primary development machine. But, except for this server that I still have 7 years later, almost every other server I had or company that I worked on used Ubuntu - except for one employee that used CentOS. So by the end of last year when I realized that this machine wasn’t getting security updates for almost six months since the Debian Stretch support was ended, I started to think why not just install Ubuntu on it. By doing that, I could forget about this machine for 5 more years until the LTS support ended.

    To be fair, to say that I use Ubuntu on “almost every other server” is an understatement. Ubuntu is my go-to OS option on almost every kind of computing environment I use - except for my desktop which is a macOS since 2016. Ubuntu is the OS I use when starting a virtual machine with vagrant up, an EC2 instance on AWS or when I want to try something quick with docker run (although I use Alpine Linux frequently in this last use case). So opting for it on a server that is going to run for at least a few more years felt like a natural choice to me - at least until I faced their new server installer.

    To give a bit of context, by being a Debian-based distribution, Ubuntu used the regular Debian Installer for its server distribution until the 18.04 (Bionic Beaver) LTS release, when it introduced the Ubuntu Server Live Installer. It didn’t work for me, as by the time it didn’t support non-standard setups like RAID and encrypted LVM. This wasn’t a big deal, as it was quite easy to find ISOs with the classic installer, so I ignored this issue for a bit. The old setup offered the features I needed and my expectation was that it was a matter of time for the new installer to be mature enough to properly replace the former.

    Last year the new Ubuntu 20.04 (Focal Fossa) LTS release came, where the developers considered the installer transition to be complete. The notes mention the features I missed, so I thought that it would be a good idea to try it out. So let’s see how a RAID-0 installation pans out:

    Wait, what?! What do you mean by If you put all disks into RAIDs or LVM VGs, there will be nowhere to put the boot partition? GRUB supports booting from RAID devices at least since 2008, so I guess it’s reasonable to expect that a Linux distribution installer won’t complain about that 13 years later. To make sure I’m not crazy or being betrayed by my own memory, I tried the same on a Debian Buster installation:

    No complaints, no error messages. The installation went all the way and booted fine in the end. “Something is odd”, I thought. By comparing the two partitioning summaries, I noticed that the Debian one is using partitions as a base for the RAID setup, while the Ubuntu one is relying on the entire disks. I went back to the Ubuntu installer and tried to use similar steps. The problem now is that if the option Add GPT Partition is used for both devices, it creates two partitions on the first disk and only one on the second disk. So I dropped to a shell from the Live Server installer with ALT + F2 and manually created empty partitions on both disks with fdisk (cfdisk will do fine as well). After a reboot, I tried again:

    Well, the complaint went away. But after creating the RAID array, opting to format the newly device as ext4 and choosing / for its mount point, the Done button was still grayed out. Looking at the top of the screen again, the Mount a filesystem at / item was gone, so the last one that needed to be filled was the Select a boot disk. Clicking on one of the disks and selecting the option Use As Boot Device did the trick.

  • Slack threads are one honking great idea -- let's use more of those!

    Slack, one of the world’s most popular business communication platforms, launched the threaded conversations feature over 3 years ago. To this day, there are still people who don’t use them, be either for inertia, personal taste, or because they don’t understand its purpose. The goal of this article is to illustrate that by doing the effortless action of clicking the button to Start/View a thread when answering a message, you will be improving not only the life of your future self but doing a huge favor to all your colleagues.

    Threaded messages are not just a communication style. They rely on one single pillar to improve the chat tool usability: reduce distractions by giving freedom to their users. Freedom in the sense that they can choose which conversation streams to follow, without having to leave or mute channels - things that may not be wanted or even feasible. And there are many nice side-effects to get in return. Let’s go through some Slack messaging characteristics to better understand their implications.

    There’s no need to send multiple separate messages

    Which of the following examples is easier to understand:

    A bunch of messages sent separately?

    Or a single message containing all the needed information:

    It’s important to notice that the second example directly benefits the usage of threads. The messages that originated it are not scattered around. Also, if you need to append more information, the message may be edited (depending on the Workspace settings). That’s not just aesthetically pleasing, the main issue is that…

    Every message sent to a channel is a potential disruption

    A channel message may not result in a notification sent to your cellphone or desktop browser, but there are a couple of implications. First, there’s the “unread messages” icon, where the tab favicon turns white. This icon per se can catch someone else’s attention, making them wonder whether their assistance is needed or not. Second, there’s the problem that everybody will have to catch up with all channel messages when they return after being away from the chat. By using threads, the number of channel messages is reduced, making it easier for people to skim through the unread ones, choosing what they need to follow.

    Be careful when using the “also send to #channel” option

    There’s an option to also send the message to channel when replying to a thread. It should be used with care, for the reasons mentioned above: it will generate a channel message that comes with all its implications. It’s fine to use it, for instance, when sending a reminder to a thread that was started a while ago and needs attention from people that might have not seen it. Selecting this option just to “make a point”, showing what are you are answering to people that might not be interested in the thread may sound condescending and should be avoided.

    A thread is a group of related messages

    The main goal of using threads - grouping related messages - facilitates a few use cases. A thread can be, for instance, a support request from another team. After the issue is solved, one can tag it with a checkmark emoji indicating that it was concluded.

    This can either help someone else taking the shift in understanding if any action is needed or an interested third-party to figure if the problem was properly answered/addressed without going through all the messages. Without a thread, it’s hard - impossible in high-traffic channels - to even figure where the conversation ended.

    Threads improve message history significantly

    Another situation greatly improved by threads is when going through the message history, which is especially useful in the paid - and unlimited - Slack version. Either by using the search or going through a link, when finding the relevant thread all the information is in there: the parent message containing the whole context, all the discussion properly indicating where it started and where it ended. The true value of that can be easily seen, for instance, when a link to discussion is attached to a ticket in the issue tracker and accessed months later.

    Closing thoughts

    Threads were invented with a noble goal: to make text-based communication more efficient. Even if it might be tempting to take a shortcut and start typing a response when you see a message in a channel, remember that clicking on the Start/View thread button is a small step for you, but a giant leap for whole chatting experience. By doing that the life quality of everyone that might be involved in a Slack conversation, either at that exact point in time or in a long time in the future, will be greatly improved.

  • Formatting a list of strings in Ansible

    My Kubernetes Ansible setup - which is, to this date, still the easiest way to bootstrap an on-premises Kubernetes cluster - has a task that installs the packages tied to a specific Kubernetes version. When using it with an Ansible version newer than the one used when it was written, I was bothered by the following deprecation warning:

    [DEPRECATION WARNING]: Invoking "apt" only once while using a loop via
    squash_actions is deprecated. Instead of using a loop to supply multiple items
    and specifying `name: "{{ item }}={{ kubernetes_version }}-00"`, please use
    `name: '{{ kubernetes_packages }}'` and remove the loop. This feature will be
    removed in version 2.11. Deprecation warnings can be disabled by setting
    deprecation_warnings=False in ansible.cfg.
    

    This warning a bit misleading. It’s clear that item, which comes from the kubernetes_packages variable used in a with_items option, is just one part of the equation. The package name is being interpolated with its version, glued together with other characters (= and -00) that will produce something like kubectl=1.17.2-00. Changing it to kubernetes_packages isn’t enough. The process of replacing this in-place interpolation by a proper list, as Ansible wants, can be achieved in some ways like:

    • Write down a list that interpolates hard-coded package names with version, like: kubectl={{ kubernetes_version }}-00. The problem is that this pattern has to be repeated for every package.
    • Find a way do generate this list dynamically, by applying the interpolation to every item of the kubernetes_packages list.

    Repetition isn’t always bad, but I prefer to avoid it here. The latter option can be easily achieved in any programming language with functional constructs, like JavaScript, which offers a map() array method that accepts a function (here, an arrow function) as the argument and returns another array:

    let pkgs = ['kubelet', 'kubectl', 'kubeadm'];
    
    let version = '1.17.2';
    
    pkgs.map(p => `${p}=${version}-00`);
    (3) ["kubelet=1.17.2-00", "kubectl=1.17.2-00", "kubeadm=1.17.2-00"]
    

    Python, the language in which Ansible is written, offers a map() function which accepts a function (here, a lambda expression) and a list as arguments. The object it returns can then be converted to a list:

    In [1]: pkgs = ['kubelet', 'kubectl', 'kubeadm']
    
    In [2]: version = '1.17.2'
    
    In [3]: list(map(lambda p: '{}={}-00'.format(p, version), pkgs))
    Out[3]: ['kubelet=1.17.2-00', 'kubectl=1.17.2-00', 'kubeadm=1.17.2-00']
    

    That’s be supposed to be similarly easy in Ansible, given that Jinja, its template language, offers a format() filter. The problem is that it does not - and will not - support combining format() and map() filters. Another way to do the same would be to use the format() filter in a list comprehension, but that’s also not supported. But not all hope is lost, as Ansible supports additional regular expression filters, like regex_replace(). It can be used in many different ways, but here we will use it for doing a single thing: concatenate the package name with a suffix made of another string concatenation operation. This way, the following task:

    - name: install packages
      apt:
        name: "{{ item }}={{ kubernetes_version }}-00"
      with_items: "{{ kubernetes_packages }}"
    

    Is equivalent to:

    - name: install packages
      apt:
        name: "{{ kubernetes_packages |
          map('regex_replace', '$', '=' + kubernetes_version + '-00') |
          list
        }}"
    

    The key is that the '$' character matches the end of the string, so replacing it is akin to concatenating two strings. The list filter in the end is needed because, just like the equivalent Python built-in function, the Jinja map() filter also returns a generator. This object then needs to be converted to a list, otherwise, it would result in errors like No package matching '<generator object do_map at 0x10bbedba0>' is available, given that its string representation will be used as the package name.

  • Web app updates without polling

    A co-worker wasn’t happy with the current solution to update a page in our web app, polling some API endpoints every minute. This isn’t just wasteful, as most of the time the requests didn’t bring new data, but also slow, as every change can take up to a minute to propagate. He asked if I knew a better way to do that and while I didn’t have an answer to give right away, I do remember to have heard about solutions for this problem in the past. A quick Stack Overflow search showed three options:

    • Long polling: basically what we were already doing.
    • WebSockets: persistent connections that can be used to transfer data in both ways.
    • Server-Sent Events (SSEs): one-way option to send data from server to client, where the connection is closed after the request is finished.

    WebSockets looked like the most interesting option. Their name also reminded me of another technique applications use to push updates to others: Webhooks. That’s how, for instance, a CI/CD job is triggered after changes to a repository are pushed to GitHub. The only problem is that webhooks are, by definition, a server-to-server interaction. A server cannot send an HTTP request to a client. I started to question myself: if so, how do websites like the super cool Webhook.site works?

    The Webhook.site is a tool meant to debug webhooks. One can set up a webhook, for instance, in GitHub, and inspect the entire body/headers/method of the request in a nice interface, without having to resort to set up a server to do that. The most interesting part is that the requests sent by the webhooks are displayed on the webpage in (near) real-time: exactly the problem I was looking to solve. So I started to look around to figure out how they managed to achieve that.

    Digging through the page source, I found some references to Socket.IO, which is indeed an engine that was designed to offer bidirectional real-time communication. Before even trying to use it, I tried to understand if it worked over WebSockets and found the following quote on its Wikipedia page:

    Socket.IO is not a WebSocket library with fallback options to other realtime protocols. It is a custom realtime transport protocol implementation on top of other realtime protocols.

    So, Socket.IO may be a nice tool, but not the best alternative for our use-case. There’s no need to use a custom protocol where we might have to, for instance, replace our server implementation when we can opt for a IETF/W3C standard like WebSockets. So I started to think about how webhooks can be integrated with WebSockets.

    The easiest way would be to store, in-memory, all the currently open WebSocket connections to an endpoint and loop over them every time a webhook request came. The problem is that this architecture doesn’t scale. It’s fine when there’s a single API server, but it wouldn’t work when there are multiple clients connected to different instances of the API server. In the latter case, only a subset of the clients - the ones residing in-memory on the same server which received the webhook - would be notified about this update.

    By this point, a previous co-worker and university colleague mentioned that with Redis it would be even simple to achieve that. It didn’t immediately make sense to me, as at first, I thought Redis would only replace the in-memory connections list until I found out about the PUBLISH command. Not only Redis would take care of almost all the logic involved in notifying the connected clients: when combining them with the SUBSCRIBE command, it actually offers an entire Pub/Sub solution. It was awesome to learn that the tool I used for over 5 years mainly as a centralized hash table - or queue at max - was the key to simplify the whole architecture.

    Now it was the time to assemble the planned solution. With Gin + Gorilla WebSocket + Redis in the Backend, together with React in the Frontend, I’ve managed to create Webhooks + WebSockets, a very basic clone of the Webhook.site, showing how real-time server updates can be achieved in a web app by combining the two standards, backed by multiple technologies. The source code is available at GitHub.

Subscribe via RSS