Find a file
2026-05-20 01:45:30 +02:00
doc.go Initial implementation of rfc-service-health v1.0 2026-05-19 23:58:23 +02:00
example_test.go Initial implementation of rfc-service-health v1.0 2026-05-19 23:58:23 +02:00
go.mod Initial implementation of rfc-service-health v1.0 2026-05-19 23:58:23 +02:00
handler.go allow cors * to enable client-only dashboards 2026-05-20 01:45:30 +02:00
README.md allow cors * to enable client-only dashboards 2026-05-20 01:45:30 +02:00
render.go Initial implementation of rfc-service-health v1.0 2026-05-19 23:58:23 +02:00
servicehealth.go Initial implementation of rfc-service-health v1.0 2026-05-19 23:58:23 +02:00
servicehealth_test.go allow cors * to enable client-only dashboards 2026-05-20 01:45:30 +02:00

service-health-go

Go helper library implementing version 1.0 of the Service Health Reporting Format (rfc-service-health).

The RFC defines a single endpoint, /service-health, that every service in a cooperating ecosystem exposes. It serves two representations of the same state — a human-friendly JSON form and a Prometheus-scrapable OpenMetrics form — selected by HTTP content negotiation.

This module gives you a Reporter, a few Register(...) calls for your sub-checks, and an http.Handler to mount. Done.

Install

go env -w GOPRIVATE='forge.stacktop.network/*'
go get forge.stacktop.network/openstacktop/service-health-go@latest

Requires Go 1.26+. No external dependencies.

Usage

Minimal — service with no dependencies

package main

import (
    "net/http"

    sh "forge.stacktop.network/openstacktop/service-health-go"
)

func main() {
    r, _ := sh.New("hello-svc")
    mux := http.NewServeMux()
    mux.Handle(sh.EndpointPath, r.Handler())
    _ = http.ListenAndServe("127.0.0.1:8080", mux)
}
$ curl -s http://127.0.0.1:8080/service-health
{"name":"hello-svc","status":1}

Full — Postgres + outbound API + background refresh

package main

import (
    "context"
    "database/sql"
    "net/http"
    "time"

    sh "forge.stacktop.network/openstacktop/service-health-go"
    _ "github.com/jackc/pgx/v5/stdlib"
)

func main() {
    db, err := sql.Open("pgx", "postgres://localhost/app")
    if err != nil { panic(err) }

    r, _ := sh.New("billing-api")

    // Well-known sub-check name; cheap (default 15s cadence).
    _ = r.Register("database", func(ctx context.Context) sh.Result {
        if err := db.PingContext(ctx); err != nil {
            return sh.Result{Status: sh.Red, Msg: "db ping failed"}
        }
        return sh.Result{Status: sh.Green}
    })

    // Expensive upstream — cache up to one minute. Active failures
    // override this and refresh every 15s anyway (RFC §4.7 rule 4).
    _ = r.Register("stripe", func(ctx context.Context) sh.Result {
        req, _ := http.NewRequestWithContext(ctx, "GET", "https://api.stripe.com/healthcheck", nil)
        resp, err := http.DefaultClient.Do(req)
        if err != nil {
            return sh.Result{Status: sh.Red, Msg: "stripe unreachable"}
        }
        defer resp.Body.Close()
        if resp.StatusCode >= 500 {
            return sh.Result{Status: sh.Yellow, Msg: "stripe degraded"}
        }
        return sh.Result{Status: sh.Green}
    }, sh.Interval(time.Minute))

    // Optional: keeps cache fresh during scrape gaps. Stop by cancelling ctx.
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()
    go r.Start(ctx)

    mux := http.NewServeMux()
    mux.Handle("/api/", apiRouter())              // your existing routes
    mux.Handle(sh.EndpointPath, r.Handler())      // mount /service-health
    _ = http.ListenAndServe("127.0.0.1:8080", mux)
}

func apiRouter() http.Handler { return http.NewServeMux() }
$ curl -s http://127.0.0.1:8080/service-health | jq
{
  "name": "billing-api",
  "status": 1,
  "children": [
    {"name": "database", "status": 1},
    {"name": "stripe",   "status": 1}
  ]
}

$ curl -s -H 'Accept: application/openmetrics-text;version=1.0.0' \
        http://127.0.0.1:8080/service-health
# HELP service_health Aggregate health (0=red, 0.5=yellow, 1=green).
# TYPE service_health gauge
service_health{service="billing-api"} 1
# HELP service_health_check Health of a sub-check within the service.
# TYPE service_health_check gauge
service_health_check{service="billing-api",check="database"} 1
service_health_check{service="billing-api",check="stripe"} 1
# EOF

When stripe returns Yellow, the root rolls up to 0.5 and the JSON gains a "msg" field with the first failing child's note:

{
  "name": "billing-api",
  "status": 0.5,
  "msg": "stripe: stripe degraded",
  "children": [
    {"name": "database", "status": 1},
    {"name": "stripe",   "status": 0.5, "msg": "stripe degraded"}
  ]
}

Setting a service-level self-status

Use this when the service itself is degraded for reasons that aren't captured by a sub-check (e.g. config reload pending, queue backlog above threshold):

_ = r.SetStatus(sh.Yellow, "draining: 142 jobs queued before shutdown")
// later, once drained:
_ = r.SetStatus(sh.Green, "")

The rollup still takes min(self, children), so a healthy self with one red child is reported as red.

What the library handles for you

  • Rollup: root status = min(self, all children) — RFC §4.6.
  • Refresh cadence: 15 s for cheap checks, configurable up to 10 min for expensive ones, with active failures forced back to 15 s — RFC §4.7.
  • Content negotiation, Cache-Control: no-store, always-200 responses, exact 0 / 0.5 / 1 numeric tokens — RFC §3.3§3.5, §4.4, §5.1.
  • OpenMetrics output with # HELP / # TYPE / # EOF and correctly escaped label values.
  • Open CORS (Access-Control-Allow-Origin: *) with OPTIONS preflight, so browser-based aggregator dashboards on any origin can scrape the endpoint. The payload carries no credentialled data and the endpoint is expected to be internal-only (see security note below).
  • Panic recovery in user check functions, message sanitisation (control chars stripped, 200-rune cap), name validation.

What stays your job

  • Don't put secrets, credentials, PII, or stack traces in Msg. It surfaces in dashboards (RFC §8).
  • Don't expose /service-health on the public Internet. Bind to an internal interface or protect at the ingress layer (RFC §8).
  • Pick well-known sub-check names (database, network, disk) when applicable; their absence is itself a meaningful signal (RFC §4.3).

License

MIT. Maintained by openstacktop (open source unit of stacktop GmbH). The referenced RFC is licensed CC BY 4.0 by the same author.