- Go 100%
| doc.go | ||
| example_test.go | ||
| go.mod | ||
| handler.go | ||
| README.md | ||
| render.go | ||
| servicehealth.go | ||
| servicehealth_test.go | ||
service-health-go
Go helper library implementing version 1.0 of the Service Health Reporting Format (rfc-service-health).
The RFC defines a single endpoint, /service-health, that every service in a
cooperating ecosystem exposes. It serves two representations of the same
state — a human-friendly JSON form and a Prometheus-scrapable OpenMetrics
form — selected by HTTP content negotiation.
This module gives you a Reporter, a few Register(...) calls for your
sub-checks, and an http.Handler to mount. Done.
- Spec: https://forge.stacktop.network/openstacktop/rfc-service-health
- Spec source (verbatim): https://forge.stacktop.network/openstacktop/rfc-service-health/raw/branch/main/docs/rfc-service-health.txt
Install
go env -w GOPRIVATE='forge.stacktop.network/*'
go get forge.stacktop.network/openstacktop/service-health-go@latest
Requires Go 1.26+. No external dependencies.
Usage
Minimal — service with no dependencies
package main
import (
"net/http"
sh "forge.stacktop.network/openstacktop/service-health-go"
)
func main() {
r, _ := sh.New("hello-svc")
mux := http.NewServeMux()
mux.Handle(sh.EndpointPath, r.Handler())
_ = http.ListenAndServe("127.0.0.1:8080", mux)
}
$ curl -s http://127.0.0.1:8080/service-health
{"name":"hello-svc","status":1}
Full — Postgres + outbound API + background refresh
package main
import (
"context"
"database/sql"
"net/http"
"time"
sh "forge.stacktop.network/openstacktop/service-health-go"
_ "github.com/jackc/pgx/v5/stdlib"
)
func main() {
db, err := sql.Open("pgx", "postgres://localhost/app")
if err != nil { panic(err) }
r, _ := sh.New("billing-api")
// Well-known sub-check name; cheap (default 15s cadence).
_ = r.Register("database", func(ctx context.Context) sh.Result {
if err := db.PingContext(ctx); err != nil {
return sh.Result{Status: sh.Red, Msg: "db ping failed"}
}
return sh.Result{Status: sh.Green}
})
// Expensive upstream — cache up to one minute. Active failures
// override this and refresh every 15s anyway (RFC §4.7 rule 4).
_ = r.Register("stripe", func(ctx context.Context) sh.Result {
req, _ := http.NewRequestWithContext(ctx, "GET", "https://api.stripe.com/healthcheck", nil)
resp, err := http.DefaultClient.Do(req)
if err != nil {
return sh.Result{Status: sh.Red, Msg: "stripe unreachable"}
}
defer resp.Body.Close()
if resp.StatusCode >= 500 {
return sh.Result{Status: sh.Yellow, Msg: "stripe degraded"}
}
return sh.Result{Status: sh.Green}
}, sh.Interval(time.Minute))
// Optional: keeps cache fresh during scrape gaps. Stop by cancelling ctx.
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
go r.Start(ctx)
mux := http.NewServeMux()
mux.Handle("/api/", apiRouter()) // your existing routes
mux.Handle(sh.EndpointPath, r.Handler()) // mount /service-health
_ = http.ListenAndServe("127.0.0.1:8080", mux)
}
func apiRouter() http.Handler { return http.NewServeMux() }
$ curl -s http://127.0.0.1:8080/service-health | jq
{
"name": "billing-api",
"status": 1,
"children": [
{"name": "database", "status": 1},
{"name": "stripe", "status": 1}
]
}
$ curl -s -H 'Accept: application/openmetrics-text;version=1.0.0' \
http://127.0.0.1:8080/service-health
# HELP service_health Aggregate health (0=red, 0.5=yellow, 1=green).
# TYPE service_health gauge
service_health{service="billing-api"} 1
# HELP service_health_check Health of a sub-check within the service.
# TYPE service_health_check gauge
service_health_check{service="billing-api",check="database"} 1
service_health_check{service="billing-api",check="stripe"} 1
# EOF
When stripe returns Yellow, the root rolls up to 0.5 and the JSON
gains a "msg" field with the first failing child's note:
{
"name": "billing-api",
"status": 0.5,
"msg": "stripe: stripe degraded",
"children": [
{"name": "database", "status": 1},
{"name": "stripe", "status": 0.5, "msg": "stripe degraded"}
]
}
Setting a service-level self-status
Use this when the service itself is degraded for reasons that aren't captured by a sub-check (e.g. config reload pending, queue backlog above threshold):
_ = r.SetStatus(sh.Yellow, "draining: 142 jobs queued before shutdown")
// later, once drained:
_ = r.SetStatus(sh.Green, "")
The rollup still takes min(self, children), so a healthy self with one
red child is reported as red.
What the library handles for you
- Rollup: root status = min(self, all children) — RFC §4.6.
- Refresh cadence: 15 s for cheap checks, configurable up to 10 min for expensive ones, with active failures forced back to 15 s — RFC §4.7.
- Content negotiation,
Cache-Control: no-store, always-200 responses, exact0/0.5/1numeric tokens — RFC §3.3–§3.5, §4.4, §5.1. - OpenMetrics output with
# HELP/# TYPE/# EOFand correctly escaped label values. - Open CORS (
Access-Control-Allow-Origin: *) with OPTIONS preflight, so browser-based aggregator dashboards on any origin can scrape the endpoint. The payload carries no credentialled data and the endpoint is expected to be internal-only (see security note below). - Panic recovery in user check functions, message sanitisation (control chars stripped, 200-rune cap), name validation.
What stays your job
- Don't put secrets, credentials, PII, or stack traces in
Msg. It surfaces in dashboards (RFC §8). - Don't expose
/service-healthon the public Internet. Bind to an internal interface or protect at the ingress layer (RFC §8). - Pick well-known sub-check names (
database,network,disk) when applicable; their absence is itself a meaningful signal (RFC §4.3).
License
MIT. Maintained by openstacktop (open source unit of stacktop GmbH). The referenced RFC is licensed CC BY 4.0 by the same author.