面试经常被问到处理过哪些故障,由于缺乏准备,经常卡壳,说不好,或者干脆完全想不起来要说啥。接下来几篇就总结一下自己处理过的故障。这一篇首先写写 alpine 镜像 /etc/nsswitch.conf
缺失导致的 Calico 无法启动问题
报错日志
1 2 3 |
2018-08-26 15:06:04.447 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address bird: Mesh_10_112_35_117: State changed to start 2018-08-26 15:06:05.448 [ERROR][68] health.go 196: Health endpoint failed, trying to restart it... error=listen tcp 115.18.44.218:9099: bind: cannot assign requested address |
处理过程
首先想到的是报 bug,到 github 提了 Issue,在作者答复之前,通过查找文档,指定 FELIX_HEALTHHOST
变量为 spec.nodeName
(我的环境下 spec.nodeName
为宿主机 IP),从表面上解决了问题。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
225a226,230 > # FELIX_HEALTHHOST > - name: FELIX_HEALTHHOST > valueFrom: > fieldRef: > fieldPath: spec.nodeName 283c288 < host: localhost --- > #host: 127.0.0.1 288,292c293,301 < exec: < command: < - /bin/calico-node < - -bird-ready < - -felix-ready --- > httpGet: > path: /readiness > port: 9099 > # host: 127.0.0.1 > #exec: > # command: > # - /bin/calico-node > # - -bird-ready > # - -felix-ready |
作者答复
calico-node -felix-ready
should just be doing an http get under the covers. It's mainly so we can wrap multiple liveness checks into a single command.
I haven't seen a need to set HEALTHHOST explicitly before, it's a bit odd. It should default to localhost
per this: https://github.com/projectcalico/felix/blob/2a2fedd7e2831db07d4f36d2ddc928df783e19bb/config/config_params.go
What does localhost
resolve to on this machine?
根据作者的答复,转向调查 localhost
解析。搜索网络,找到根本原因是因为 alpine 版本更新后删除了 /etc/nsswitch.conf
,导致 Go 程序直接使用 DNS 来解析域名,/etc/hosts
中的配置并没有被使用,localhost
也就不能被正确解析到 127.0.0.1
。
写一个简单的程序来验证
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
package main import ( "net" "fmt" "os" ) func main() { ns, err := net.LookupHost("localhost") if err != nil { fmt.Fprintf(os.Stderr, "Err: %s", err.Error()) return } for _, n := range ns { fmt.Fprintf(os.Stdout, "--%s\n", n) } } |
执行
1 2 3 4 5 6 7 8 |
# with out nsswitch.conf (#hosts: files dns myhostname) [root@repo tmp]# go run dns.go --115.9.3.123 # with nsswitch.conf [root@repo tmp]# vim /etc/nsswitch.conf [root@repo tmp]# [root@repo tmp]# go run dns.go --127.0.0.1 |
处理方案
给镜像补上 /etc/nsswitch.conf
。
1 |
hosts: files dns myhostname |
最新版 Calico 已经修复该问题
发表回复