categraf prometheus插件频繁丢数据

Question

背景信息：夜莺和categraf部署于k8s，metrics由夜莺接收后写入到 VictoriaMetrics 集群。该categraf只启用prometheus插件，专门做Prometheus接口的采集。
问题：约50个地址，几乎每个地址都有丢数据的情况，见下图。categraf日志未见异常。

categraf 基础配置

    [global]
    print_configs = false
    hostname = ""
    omit_hostname = true
    precision = "ms"
    interval = 30

    [writer_opt]
    batch = 5000
    chan_size = 2000000

    [[writers]]
    url = "http://n9e-server:19000/prometheus/v1/write"
    # timeout settings, unit: ms
    timeout = 5000
    dial_timeout = 2500
    max_idle_conns_per_host = 1000

categraf inputs.prometheus 配置

    [[instances]]
    labels = { ident="master", cluster="aaa", role="master" }
    urls = [
        "http://machine1:19091/metrics",
        "http://machine2:19091/metrics",
        "http://machine3:19091/metrics",
        "http://machine4:19091/metrics",
        "http://machine5:19091/metrics"
    ]
    url_label_key = "instance"
    url_label_value = "{{.Host}}"
    headers = ["X-From", "monitor"]
    timeout = "25s" # timeout for every url

    [[instances]]
    # Volume Server
    labels = { ident="volume", cluster="aaa", role="volume" }
    urls = [
        "http://machine1:19092/metrics",
        "http://machine2:19092/metrics",
        "http://machine3:19092/metrics",
        "http://machine4:19092/metrics",
        "http://machine5:19092/metrics",
        "http://machine6:19092/metrics",
    ]
    url_label_key = "instance"
    url_label_value = "{{.Host}}"
    headers = ["X-From", "monitor"]
    timeout = "25s" # timeout for every url

VictoriaMetrics组件启动参数

vmstorage
vmstorage-prod -loggerTimezone Asia/Shanghai -storageDataPath /data5/victoria-metrics-data/ -dedup.minScrapeInterval=5s -retentionPeriod=400d -search.maxUniqueTimeseries=1000000 -vminsertAddr=:8400 -vmselectAddr=:8401 -httpListenAddr=:8482
vmselect
vmselect-prod -storageNode=n1:8401,n2:8411,n3:8401,n4:8411 -dedup.minScrapeInterval=1ms -search.maxSeries=10000000 -search.maxUniqueTimeseries=1000000 vmstorageDialTimeout=15s
vminsert
vminsert-prod -replicationFactor=2 -maxInsertRequestSize=9999999999 -vmstorageDialTimeout=15s -storageNode=n1:8400,n2:8410,n3:8400,n4:8410

ulricqin · Answer

有两个思路你可以参考：
1，查看Categraf、n9e、vm是否有报错日志，看看日志里是否有线索
2，把Categraf换成agent mode模式的 Prometheus，看看是否还有问题

m3xn · Answer

categraf仅有的日志如下

2023/04/07 17:54:54 prometheus.go:231: E! failed to parse response body, url: http://machine3:8030/metrics error: reading text format failed: text format parsing error in line 404: second TYPE line for metric name "doris_fe_query_latency_ms", or TYPE reported after samples
2023/04/07 17:54:54 prometheus.go:231: E! failed to parse response body, url: http://machine1:8030/metrics error: reading text format failed: text format parsing error in line 406: second TYPE line for metric name "doris_fe_query_latency_ms", or TYPE reported after samples
2023/04/07 17:54:54 prometheus.go:231: E! failed to parse response body, url: http://machine6:8031/metrics error: reading text format failed: text format parsing error in line 401: second TYPE line for metric name "doris_fe_query_latency_ms", or TYPE reported after samples
2023/04/07 17:54:54 prometheus.go:231: E! failed to parse response body, url: http://machine5:8031/metrics error: reading text format failed: text format parsing error in line 378: second TYPE line for metric name "doris_fe_query_latency_ms", or TYPE reported after samples
2023/04/07 17:54:54 prometheus.go:231: E! failed to parse response body, url: http://machine4:8031/metrics error: reading text format failed: text format parsing error in line 507: second TYPE line for metric name "doris_fe_query_latency_ms", or TYPE reported after samples
2023/04/07 17:54:54 prometheus.go:231: E! failed to parse response body, url: http://machine2:8030/metrics error: reading text format failed: text format parsing error in line 496: second TYPE line for metric name "doris_fe_query_latency_ms", or TYPE reported after samples

waney · Answer

看了下代码，这个报错好像是来自prometheus sdk里面的，大概意思就是同一个metrics name出现了两个TYPE,类似下面这种情况？大佬指导下@ulricqin

# HELP my_metric This is my metric
# TYPE my_metric counter
my_metric{label="value"} 42
# TYPE my_metric gauge

代码部分

// prometheus/common@v0.39.0/expfmt/text_parse.go
// readingType represents the state where the last byte read (now in
// p.currentByte) is the first byte of the type hint after 'HELP'.
func (p *TextParser) readingType() stateFn {
	if p.currentMF.Type != nil {
		p.parseError(fmt.Sprintf("second TYPE line for metric name %q, or TYPE reported after samples", p.currentMF.GetName()))
		return nil
	}
	// Rest of line is the type.
	if p.readTokenUntilNewline(false); p.err != nil {
		return nil // Unexpected end of input.
	}
	metricType, ok := dto.MetricType_value[strings.ToUpper(p.currentToken.String())]
	if !ok {
		p.parseError(fmt.Sprintf("unknown metric type %q", p.currentToken.String()))
		return nil
	}
	p.currentMF.Type = dto.MetricType(metricType).Enum()
	return p.startOfLine
}

categraf prometheus插件频繁丢数据

categraf 基础配置

categraf inputs.prometheus 配置

VictoriaMetrics组件启动参数

3 Answers