问题场景
重新部署Elasticsearch容器后,容器IP地址发生变化,
运行curl -v staging-search_elasticsearch_1:9200
不能正确连接到容器新的IP地址,导致ES查询失败。
问题定位
首先在客户端运行dig staging-search_elasticsearch_1
,
返回如下结果:
; <<>> DiG 9.10.3-P4-Ubuntu <<>> staging-search_elasticsearch_1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 57550
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;staging-search_elasticsearch_1. IN A
;; AUTHORITY SECTION:
. 86397 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2016112401 1800 900 604800 86400
;; Query time: 59 msec
;; SERVER: 172.31.99.1#53(172.31.99.1)
;; WHEN: Fri Nov 25 09:08:31 CST 2016
;; MSG SIZE rcvd: 134
可以看到DNS服务器是172.31.99.1
,是VPN + DNS服务器的IP地址,
说明客户端查询的DNS服务是正确的,但返回结果里没有;; ANSWER SECTION:
段,
说明DNS服务没有正确返回DNS结果。
也可以使用nslookup staging-search_elasticsearch_1
.
原因分析和解决方案
缓存问题?
由于DNS使用dnsmasq作为服务,猜测可能是由于dnsmasq使用了缓存, 在容器IP地址变化后没有更新导致的,所以在DNS容器上重启了dnsmasq并将其缓存设置为0:
pkill dnsmasq
dnsmasq -r /etc/resolv.dnsmasq -c 0
相应修改容器启动脚本entrypoint.sh中的启动dnsmasq命令,
在后面加上-c 0
。
在客户端上,appbuild.py所在目录下启动IPython,运行如下指令:
from appbuilder import *
am = AppManager(pm, 'staging', confs)
am.clean(confs['services']['all'])
am.build_service(confs['services']['all'])
am.clean(['search'])
am.build_service(['search'])
重新部署的容器IP地址始终没有变化,但第二天再次出现找不到IP地址的情况, 说明用调整缓存的方法解决不了问题。
nameserver顺序错误
使用dnsmasq -8 /var/log/dnsmasq.log -q -r /etc/resolv.dnsmasq
启动dnsmasq,
其中-8
指定log文件位置,-q
指令要求dnsmasq记录dns query.
然后用tail -f /var/log/dnsmasq.log
查看dnsmasq收到的dns查询,
发现DNS周期性失效的原因是:dns cache超出TTL后,需要重新查询,
在/etc/resolv.dnsmasq内容如下:
nameserver 127.0.0.11
nameserver 8.8.8.8
nameserver 8.8.4.4
当客户端查询staging-search_elasticsearch_1
时,
dnsmasq没有查询第一位的上游nameserver 127.0.0.11
,
而是8.8.8.8
,返回结果为'NXDOMAIN‘,存入cache,在后续收到查询有继续返回'NXDOMAIN‘,
解决方案是只使用一个nameserver 127.0.0.11
.
由于系统默认的/etc/resolv.conf符合这个要求,用下面的方法启动dnsmasq:
dnsmasq -8 /var/log/dnsmasq.log -q
.
Debug
Alpine系统中默认没有curl
,可以用wget -qO- staging-search_elasticsearch_1:9200
代替,
其中-q
表示不输出进度条等信息,-O-
是-O -
的简写,表示输出到stdout.
跟踪DNS跳转:
-
Alpine上:
traceroute staging-search_elasticsearch_1
-
Ubuntu上:
tracepath staging-search_elasticsearch_1
dnsmasq 日志分析
以下日志是在一台VPN内网服务器上执行curl -v staging-search_elasticsearch_1:9200
后,
从dnsmasq所在Alphi容器的/var/log/messages中截取的日志,显示一次完整的容器名称查询过程:
2016-11 ... : query[A] staging-search_elasticsearch_1.localdomain from 172.31.99.100
2016-11 ... : cached staging-search_elasticsearch_1.localdomain is NXDOMAIN
2016-11 ... : query[AAAA] staging-search_elasticsearch_1.localdomain from 172.31.99.100
2016-11 ... : forwarded staging-search_elasticsearch_1.localdomain to 127.0.0.11
2016-11 ... : forwarded staging-search_elasticsearch_1.localdomain to 8.8.8.8
2016-11 ... : forwarded staging-search_elasticsearch_1.localdomain to 8.8.4.4
2016-11 ... : reply staging-search_elasticsearch_1.localdomain is NXDOMAIN
2016-11 ... : query[A] staging-search_elasticsearch_1.pek2.qingcloud.com from 172.31.99.100
2016-11 ... : cached staging-search_elasticsearch_1.pek2.qingcloud.com is NODATA-IPv4
2016-11 ... : query[AAAA] staging-search_elasticsearch_1.pek2.qingcloud.com from 172.31.99.100
2016-11 ... : forwarded staging-search_elasticsearch_1.pek2.qingcloud.com to 127.0.0.11
2016-11 ... : reply staging-search_elasticsearch_1.pek2.qingcloud.com is NODATA-IPv6
2016-11 ... : query[A] staging-search_elasticsearch_1 from 172.31.99.100
2016-11 ... : cached staging-search_elasticsearch_1 is 172.18.2.5
2016-11 ... : query[AAAA] staging-search_elasticsearch_1 from 172.31.99.100
2016-11 ... : forwarded staging-search_elasticsearch_1 to 127.0.0.11
客户端的/etc/resolv.conf内容:
nameserver 172.31.99.1
nameserver 127.0.1.1
search localdomain pek2.qingcloud.com