kompira動作停止について
kompiraのジョブフローが動作しなくなりました。
問題事象は11/1.00:00あたりに発生
毎月1日00:00のジョブフロー実行後にQueueManagerがsocket err
その後、kompiradが再起動、同エラーを検出して再起動
を繰り返しているようにみえます。
どのような原因が考えられるでしょうか?
ログは以下です。
調査をよろしくお願いいたします。
---
[2020-11-01 00:10:30,602:4233:kompirad:QueueManager] ERROR: [QueueManager] socket error: [Errno 104] Connection reset by peer
[2020-11-01 00:10:30,846:4233:kompirad:QueueManager] WARNING: [QueueManager] failed to unregister handler: [ctag=[Errno 32] Broken pipe] amq.ctag-FUMO8SVMq_5BlGnfliW3XQ
[2020-11-01 00:10:30,916:4233:kompirad:QueueManager] WARNING: [QueueManager] failed to unregister handler: [ctag=[Errno 32] Broken pipe] amq.ctag-0P8HQmpCCPsG4N7J1dx9aw
[2020-11-01 00:10:30,917:4233:kompirad:QueueManager] WARNING: [QueueManager] failed to unregister handler: [ctag=[Errno 32] Broken pipe] amq.ctag-KfYemH5b4GqxTSgZoEHokA
[2020-11-01 00:10:30,917:4233:kompirad:QueueManager] ERROR: [QueueManager] failed to close connection: [Errno 32] Broken pipe
[2020-11-01 00:10:30,926:4233:kompirad:QueueManager] INFO: [QueueManager] waiting 30 seconds for retry connection ...
[2020-11-01 00:11:00,947:4233:kompirad:QueueManager] INFO: [QueueManager] retry connection
[2020-11-01 00:11:00,947:4233:kompirad:QueueManager] INFO: [QueueManager] _connect start
[2020-11-01 00:11:01,163:4233:kompirad:QueueManager] ERROR: [QueueManager] socket error: [Errno 111] Connection refused
[2020-11-01 00:11:01,198:4233:kompirad:MainThread] INFO: [Engine] caught KeyboardInterrupt
[2020-11-01 00:11:01,941:4233:kompirad:ExpireManager] INFO: ExpireManager: finished: remain=0
[2020-11-01 00:11:03,076:4233:kompirad:CollectThread] INFO: collect_handler: finished: count=0
[2020-11-01 00:11:03,469:4233:kompirad:ResultThread] INFO: result_handler: finished: count=0
[2020-11-01 00:11:03,518:4233:kompirad:MainThread] INFO: [WorkerManager] terminate_all_workers: []
[2020-11-01 00:11:03,533:4233:kompirad:MainThread] INFO: [Engine] finished.
[2020-11-01 00:11:03,533:4233:kompirad:MainThread] INFO: kompirad: going to terminate engine_server
[2020-11-01 00:11:03,969:4233:kompirad:EngineServer] INFO: [EngineServer] thread finished
[2020-11-01 00:11:04,190:4233:kompirad:MainThread] INFO: terminate_children: finish
[2020-11-01 00:11:04,224:4233:kompirad:MainThread] INFO: kompirad: bye
[2020-11-01 00:11:08] INFO: starting pre-start of kompirad [en]
[2020-11-01 00:11:09] INFO: PostgreSQL is running
[2020-11-01 00:11:15,880:800:kompirad:MainThread] INFO: kompira version = 1.5.5.post8
[2020-11-01 00:11:15,880:800:kompirad:MainThread] INFO: sys.version = 2.7.5 (default, Apr 2 2020, 13:16:51)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
[2020-11-01 00:11:15,880:800:kompirad:MainThread] INFO: sys.stdin = <open file '<stdin>', mode 'r' at 0x7f145c6880c0> (UTF-8)
[2020-11-01 00:11:15,880:800:kompirad:MainThread] INFO: sys.stdout = <open file '<stdout>', mode 'w' at 0x7f145c688150> (UTF-8)
[2020-11-01 00:11:15,880:800:kompirad:MainThread] INFO: sys.stderr = <open file '<stderr>', mode 'w' at 0x7f145c6881e0> (UTF-8)
[2020-11-01 00:11:15,880:800:kompirad:MainThread] INFO: sys.getdefaultencoding = ascii
[2020-11-01 00:11:15,880:800:kompirad:MainThread] INFO: sys.getfilesystemencoding = UTF-8
[2020-11-01 00:11:15,880:800:kompirad:MainThread] INFO: locale.getdefaultlocale = en_US, UTF-8
[2020-11-01 00:11:15,880:800:kompirad:MainThread] INFO: locale.getpreferredencoding = UTF-8
[2020-11-01 00:11:15,881:800:kompirad:EngineServer] INFO: [EngineServer] thread started
[2020-11-01 00:11:15,881:800:kompirad:MainThread] INFO: [Engine] starting...
[2020-11-01 00:11:15,883:800:kompirad:ResultThread] INFO: result_handler: started: result_timeout=1.0
[2020-11-01 00:11:15,883:800:kompirad:CollectThread] INFO: collect_handler: started: collect_interval=0.2
[2020-11-01 00:11:15,883:800:kompirad:MainThread] INFO: [WorkerManager] initialize: max_worker=32, max_alive=64
[2020-11-01 00:11:15,883:800:kompirad:MainThread] INFO: [Engine] initializing in recovery mode
[2020-11-01 00:11:16,012:800:kompirad:MainThread] ERROR: JobflowRuntime(242).restart: error occured when unpickling data: AttributeError("'Process' object has no attribute 'snapshot'",)
[2020-11-01 00:11:16,023:800:kompirad:MainThread] INFO: Process(242).abort: finished /log/get_all_log [ABORTED] elapsed=0:11:15.054904
[2020-11-01 00:11:16,087:800:kompirad:MainThread] ERROR: JobflowRuntime(238).restart: error occured when unpickling data: AttributeError("'Process' object has no attribute 'snapshot'",)
[2020-11-01 00:11:16,089:800:kompirad:MainThread] INFO: Process(238).abort: finished /system/startup/start_recv_event [ABORTED] elapsed=31 days, 6:02:50.766447
[2020-11-01 00:11:16,185:800:kompirad:QueueManager] INFO: [QueueManager] thread started
[2020-11-01 00:11:16,185:800:kompirad:QueueManager] INFO: [QueueManager] _connect start
[2020-11-01 00:11:16,197:800:kompirad:QueueManager] ERROR: [QueueManager] socket error: [Errno 111] Connection refused
[2020-11-01 00:11:16,197:800:kompirad:QueueManager] INFO: [QueueManager] waiting 30 seconds for retry connection ...
[2020-11-01 00:11:16,198:800:kompirad:ExpireManager] INFO: ExpireManager: started
---
-
Active: inactive (dead)
となっております。
/varlog/rabbitmq/rabbit@ha-kompira1.log-20201101
を見たところ
---=INFO REPORT==== 1-Nov-2020::00:10:30 ===
Stopping RabbitMQ=INFO REPORT==== 1-Nov-2020::00:10:30 ===
stopped TCP Listener on [::]:5672=ERROR REPORT==== 1-Nov-2020::00:10:30 ===
AMQP connection <0.294.0> (running), channel 0 - error:
{amqp_error,connection_forced,
"broker forced connection closure with reason 'shutdown'",none}=ERROR REPORT==== 1-Nov-2020::00:10:30 ===
AMQP connection <0.273.0> (running), channel 0 - error:
{amqp_error,connection_forced,
"broker forced connection closure with reason 'shutdown'",none}=INFO REPORT==== 1-Nov-2020::00:10:31 ===
Halting Erlang VM
---
となっていました。 -
crm_mon -Af1の結果は以下となります。
Stack: corosync
Current DC: ha-kompira1 (version 1.1.21-4.el7-f14e36fd43) - partition with quorum
Last updated: Mon Nov 9 13:33:22 2020
Last change: Wed Sep 30 18:10:41 2020 by root via crm_attribute on ha-kompira12 nodes configured
9 resources configuredOnline: [ ha-kompira1 ha-kompira2 ]
Active resources:
Resource Group: webserver
res_vip (ocf::heartbeat:IPaddr2): Started ha-kompira1
res_httpd (ocf::heartbeat:apache): Started ha-kompira1
res_kompirad (ocf::kompira:kompirad): Started ha-kompira1
res_kompira_jobmngrd (ocf::kompira:kompira_jobmngrd): Started ha-kompira1
res_lsyncd (systemd:lsyncd): FAILED ha-kompira1 (blocked)
Master/Slave Set: ms_pgsql [res_pgsql]
Masters: [ ha-kompira1 ]
Slaves: [ ha-kompira2 ]
Master/Slave Set: ms_rabbitmq [res_rabbitmq]
Masters: [ ha-kompira2 ]Node Attributes:
* Node ha-kompira1:
+ master-res_pgsql : 1000
+ res_pgsql-data-status : LATEST
+ res_pgsql-master-baseline : 0000000073DD51F8
+ res_pgsql-status : PRI
* Node ha-kompira2:
+ master-res_pgsql : 100
+ master-res_rabbitmq : 10
+ res_pgsql-data-status : STREAMING|ASYNC
+ res_pgsql-status : HS:asyncMigration Summary:
* Node ha-kompira1:
res_httpd: migration-threshold=1 fail-count=3 last-failure='Sun Nov 1 00:10:23 2020'
res_rabbitmq: migration-threshold=1 fail-count=1 last-failure='Sun Nov 1 00:10:23 2020'
res_kompirad: migration-threshold=1 fail-count=565 last-failure='Mon Nov 9 13:30:33 2020'
res_vip: migration-threshold=1 fail-count=1 last-failure='Sun Nov 1 00:10:23 2020'
res_kompira_jobmngrd: migration-threshold=1 fail-count=144 last-failure='Mon Nov 9 11:17:30 2020'
res_lsyncd: migration-threshold=1 fail-count=1000000 last-failure='Thu Oct 1 00:09:54 2020'
* Node ha-kompira2:Failed Resource Actions:
* res_httpd_monitor_5000 on ha-kompira1 'unknown error' (1): call=45, status=Timed Out, exitreason='',
last-rc-change='Sun Nov 1 00:09:22 2020', queued=0ms, exec=0ms
* res_rabbitmq_monitor_7000 on ha-kompira1 'unknown error' (1): call=75, status=Timed Out, exitreason='',
last-rc-change='Sun Nov 1 00:09:51 2020', queued=0ms, exec=0ms
* res_kompirad_monitor_5000 on ha-kompira1 'not running' (7): call=47, status=complete, exitreason='',
last-rc-change='Mon Nov 9 13:30:33 2020', queued=0ms, exec=0ms
* res_vip_monitor_10000 on ha-kompira1 'unknown error' (1): call=43, status=Timed Out, exitreason='',
last-rc-change='Sun Nov 1 00:09:50 2020', queued=0ms, exec=0ms
* res_kompira_jobmngrd_monitor_5000 on ha-kompira1 'not running' (7): call=49, status=complete, exitreason='',
last-rc-change='Mon Nov 9 11:17:30 2020', queued=0ms, exec=0ms
* res_lsyncd_stop_0 on ha-kompira1 'unknown error' (1): call=65, status=complete, exitreason='',
last-rc-change='Thu Oct 1 00:09:32 2020', queued=0ms, exec=22368ms -
ありがとうございます。
ログがそれだけであれば、lsyncdは問題無さそうです。
crm_monの結果によると、
res_httpd_monitor_5000 on ha-kompira1 'unknown error' (1): call=45, status=Timed Out, exitreason='',
last-rc-change='Sun Nov 1 00:09:22 2020', queued=0ms, exec=0msとありますので、きっかけとしてはhttpd(Apache)が何らかの障害によって停止(監視タイムアウト)し、フェールオーバーしようとしたところ失敗して、現在のような状態になっていると考えられます。
手動で切り替えるために、1号機側(ha-kompira1)で以下のコマンドをroot権限で実行し、pacemakerを終了させ、切り替わるか試して頂けますでしょうか?
# systemctl stop pacemaker
-
調査のために、障害発生付近の以下のログ(1号機、2号機とも)を送付いただけますでしょうか?
- /var/log/cluster/corosync.log
- /var/log/httpd/*
- /var/log/kompira/*
- /var/log/lsyncd.log
- /var/log/pacemaker.log
- /var/log/pcsd/pcsd.log
- /var/log/postgresql.log
- /var/log/rabbitmq/*
- /var/log/rsyncd.log
また、合わせてご利用のKompiraとOSのバージョンもご連絡いただければと思います。
なお、障害発生の状況によっては、完全に原因が解明できないこともございますが、その際はご容赦ください。
以上、お手数をおかけしますが、どうぞよろしくお願いいたします。
-
ありがとうございます。
まで、送付いただけますでしょうか?
どうぞよろしくお願いいたします。
サインインしてコメントを残してください。
コメント
22件のコメント