DarkMatter in Cyberspace
  • Home
  • Categories
  • Tags
  • Archives

解决重新部署容器后DNS失效问题


问题场景

重新部署Elasticsearch容器后,容器IP地址发生变化, 运行curl -v staging-search_elasticsearch_1:9200 不能正确连接到容器新的IP地址,导致ES查询失败。

问题定位

首先在客户端运行dig staging-search_elasticsearch_1, 返回如下结果:

; <<>> DiG 9.10.3-P4-Ubuntu <<>> staging-search_elasticsearch_1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 57550
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;staging-search_elasticsearch_1.        IN      A

;; AUTHORITY SECTION:
.                       86397   IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2016112401 1800 900 604800 86400

;; Query time: 59 msec
;; SERVER: 172.31.99.1#53(172.31.99.1)
;; WHEN: Fri Nov 25 09:08:31 CST 2016
;; MSG SIZE  rcvd: 134

可以看到DNS服务器是172.31.99.1,是VPN + DNS服务器的IP地址, 说明客户端查询的DNS服务是正确的,但返回结果里没有;; ANSWER SECTION:段, 说明DNS服务没有正确返回DNS结果。

也可以使用nslookup staging-search_elasticsearch_1.

原因分析和解决方案

缓存问题?

由于DNS使用dnsmasq作为服务,猜测可能是由于dnsmasq使用了缓存, 在容器IP地址变化后没有更新导致的,所以在DNS容器上重启了dnsmasq并将其缓存设置为0:

pkill dnsmasq
dnsmasq -r /etc/resolv.dnsmasq -c 0

相应修改容器启动脚本entrypoint.sh中的启动dnsmasq命令, 在后面加上-c 0。

在客户端上,appbuild.py所在目录下启动IPython,运行如下指令:

from appbuilder import *
am = AppManager(pm, 'staging', confs)
am.clean(confs['services']['all'])
am.build_service(confs['services']['all'])
am.clean(['search'])
am.build_service(['search'])

重新部署的容器IP地址始终没有变化,但第二天再次出现找不到IP地址的情况, 说明用调整缓存的方法解决不了问题。

nameserver顺序错误

使用dnsmasq -8 /var/log/dnsmasq.log -q -r /etc/resolv.dnsmasq启动dnsmasq, 其中-8指定log文件位置,-q指令要求dnsmasq记录dns query. 然后用tail -f /var/log/dnsmasq.log查看dnsmasq收到的dns查询, 发现DNS周期性失效的原因是:dns cache超出TTL后,需要重新查询, 在/etc/resolv.dnsmasq内容如下:

nameserver 127.0.0.11
nameserver 8.8.8.8
nameserver 8.8.4.4

当客户端查询staging-search_elasticsearch_1时, dnsmasq没有查询第一位的上游nameserver 127.0.0.11, 而是8.8.8.8,返回结果为'NXDOMAIN‘,存入cache,在后续收到查询有继续返回'NXDOMAIN‘,

解决方案是只使用一个nameserver 127.0.0.11. 由于系统默认的/etc/resolv.conf符合这个要求,用下面的方法启动dnsmasq: dnsmasq -8 /var/log/dnsmasq.log -q.

Debug

Alpine系统中默认没有curl,可以用wget -qO- staging-search_elasticsearch_1:9200代替, 其中-q表示不输出进度条等信息,-O-是-O -的简写,表示输出到stdout.

跟踪DNS跳转:

  • Alpine上:traceroute staging-search_elasticsearch_1

  • Ubuntu上:tracepath staging-search_elasticsearch_1

dnsmasq 日志分析

以下日志是在一台VPN内网服务器上执行curl -v staging-search_elasticsearch_1:9200后, 从dnsmasq所在Alphi容器的/var/log/messages中截取的日志,显示一次完整的容器名称查询过程:

2016-11 ... : query[A] staging-search_elasticsearch_1.localdomain from 172.31.99.100
2016-11 ... : cached staging-search_elasticsearch_1.localdomain is NXDOMAIN
2016-11 ... : query[AAAA] staging-search_elasticsearch_1.localdomain from 172.31.99.100
2016-11 ... : forwarded staging-search_elasticsearch_1.localdomain to 127.0.0.11
2016-11 ... : forwarded staging-search_elasticsearch_1.localdomain to 8.8.8.8
2016-11 ... : forwarded staging-search_elasticsearch_1.localdomain to 8.8.4.4
2016-11 ... : reply staging-search_elasticsearch_1.localdomain is NXDOMAIN
2016-11 ... : query[A] staging-search_elasticsearch_1.pek2.qingcloud.com from 172.31.99.100
2016-11 ... : cached staging-search_elasticsearch_1.pek2.qingcloud.com is NODATA-IPv4
2016-11 ... : query[AAAA] staging-search_elasticsearch_1.pek2.qingcloud.com from 172.31.99.100
2016-11 ... : forwarded staging-search_elasticsearch_1.pek2.qingcloud.com to 127.0.0.11
2016-11 ... : reply staging-search_elasticsearch_1.pek2.qingcloud.com is NODATA-IPv6
2016-11 ... : query[A] staging-search_elasticsearch_1 from 172.31.99.100
2016-11 ... : cached staging-search_elasticsearch_1 is 172.18.2.5
2016-11 ... : query[AAAA] staging-search_elasticsearch_1 from 172.31.99.100
2016-11 ... : forwarded staging-search_elasticsearch_1 to 127.0.0.11

客户端的/etc/resolv.conf内容:

nameserver 172.31.99.1
nameserver 127.0.1.1
search localdomain pek2.qingcloud.com


Published

Nov 25, 2016

Last Updated

Nov 25, 2016

Category

Tech

Tags

  • dns 4
  • dnsmasq 1
  • docker 10

Contact

  • Powered by Pelican. Theme: Elegant by Talha Mansoor