C’est la fin du monde…

…pour quelques minutes !

Wikipédia, le Wiktionnaire, et de manière générale l’immense majorité des sites gérés par la Wikimedia Foundation, sont « down », inaccessibles.

Un problème d’alimentation, et c’est le drame : il semble en effet que cette coupure de courant, qui a commencé à 2:12:21 CEST, affecte pmtpa (un cluster situé à Tampa en Floride, et qui regroupe une sacrée tripotée de serveurs1) , et empêche donc l’accès à tous les sites cités. Pad’bol. Plus précisément, il semblerait que ce soit le système de refroidissement qui en pâtisse en premier lieu. De fait, les serveurs eux aussi peuvent souffrir de la température. Un déménagement de tous les serveurs au Groenland serait-il en vue ? J’ai tout de même comme un léger doute 😀

#wikimedia-tech, le chan Freenode qui recueille toutes les informations techniques sur l’état du cluster Wikimedia, est assailli. Presque autant par les ceusses qui ne profitent pas de leur connexion pour en lire le topic (qui dit actuellement ceci : « Status: Down due to A/C issues at pmtpa | Wikimedia servers administration | 100k | MediaWiki: #mediawiki | Toolserver: #wikimedia-toolserver | Pastebin: http://p.defau.lt/ | Server admin log: http://tr.im/JEF6 ») que par nagios-wm, un bot qui justement essaye périodiquement de vérifier l’état dudit cluster pour en informer la galerie (attention flood !) :

[02:12:21] <nagios-wm> PROBLEM – Host srv167 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:21] <nagios-wm> PROBLEM – Host srv163 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:21] <nagios-wm> PROBLEM – Host srv155 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:21] <nagios-wm> PROBLEM – Host srv154 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:21] <nagios-wm> PROBLEM – Host srv166 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:21] <nagios-wm> PROBLEM – Host srv164 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:31] <nagios-wm> PROBLEM – Host srv175 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:31] <nagios-wm> PROBLEM – Host srv176 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:31] <nagios-wm> PROBLEM – Host srv179 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:31] <nagios-wm> PROBLEM – Host srv181 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:31] <nagios-wm> PROBLEM – Host srv178 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:31] <nagios-wm> PROBLEM – Host srv182 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:41] <nagios-wm> PROBLEM – Host srv183 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:41] <nagios-wm> PROBLEM – Host srv186 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:41] <nagios-wm> PROBLEM – Host srv184 is DOWN: PING CRITICAL – Packet loss = 100%

[02:12:41] <nagios-wm> PROBLEM – Host srv185 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:11] <nagios-wm> PROBLEM – Host storage3 is DOWN: CRITICAL – Host Unreachable (208.80.152.169)

[02:13:11] <nagios-wm> PROBLEM – Host sanger is DOWN: CRITICAL – Host Unreachable (208.80.152.187)

[02:13:11] <nagios-wm> PROBLEM – Host mchenry is DOWN: CRITICAL – Host Unreachable (208.80.152.186)

[02:13:11] <nagios-wm> PROBLEM – Host sq76 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:11] <nagios-wm> PROBLEM – Host sq77 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:21] <nagios-wm> PROBLEM – Host srv152 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:21] <nagios-wm> PROBLEM – Host srv151 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:21] <nagios-wm> PROBLEM – Host db5 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:21] <nagios-wm> PROBLEM – Host db7 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:21] <nagios-wm> PROBLEM – Host db8 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:31] <nagios-wm> PROBLEM – Host tridge is DOWN: CRITICAL – Host Unreachable (208.80.152.170)

[02:13:31] <nagios-wm> PROBLEM – Host hume is DOWN: CRITICAL – Host Unreachable (208.80.152.190)

[02:13:31] <nagios-wm> PROBLEM – Host lvs4 is DOWN: CRITICAL – Host Unreachable (208.80.152.123)

[02:13:31] <nagios-wm> PROBLEM – Host db9 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:31] <nagios-wm> PROBLEM – Host rr.pmtpa is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:31] <nagios-wm> PROBLEM – Host sq75 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:32] <nagios-wm> PROBLEM – Host sq72 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:32] <nagios-wm> PROBLEM – Host sq73 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:41] <nagios-wm> PROBLEM – Host srv168 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:41] <nagios-wm> PROBLEM – Host srv153 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:41] <nagios-wm> PROBLEM – Host srv165 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:41] <nagios-wm> PROBLEM – Host srv156 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:41] <nagios-wm> PROBLEM – Host srv177 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:41] <nagios-wm> PROBLEM – Host srv180 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:51] <nagios-wm> PROBLEM – Host locke is DOWN: CRITICAL – Host Unreachable (208.80.152.138)

[02:13:51] <nagios-wm> PROBLEM – Host sq74 is DOWN: PING CRITICAL – Packet loss = 100%

[02:13:51] <nagios-wm> PROBLEM – Host sq71 is DOWN: PING CRITICAL – Packet loss = 100%

[02:14:01] <nagios-wm> PROBLEM – Host lvs2 is DOWN: PING CRITICAL – Packet loss = 100%

[02:14:11] <nagios-wm> PROBLEM – check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL – Can not connect to 10.0.2.183:11000 (Connection timed out)

[02:14:11] <nagios-wm> PROBLEM – Host upload.pmtpa is DOWN: PING CRITICAL – Packet loss = 100%

[02:14:51] <nagios-wm> RECOVERY – SSH on lily is OK: SSH OK – OpenSSH_4.7p1 Debian-8ubuntu1.2 (protocol 2.0)

[02:15:31] <nagios-wm> RECOVERY – Disk free on lily is OK: DISK OK

[02:17:51] <nagios-wm> PROBLEM – SSH on lily is CRITICAL: CRITICAL – Socket timeout after 10 seconds

[02:18:31] <nagios-wm> PROBLEM – Disk free on lily is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.

[02:19:01] <nagios-wm> PROBLEM – Misc_Db_Slave on db10 is CRITICAL: CRITICAL: Slave running: expected Yes, got No

[02:20:21] <nagios-wm> PROBLEM – Mobile_Web on mobile2 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error

[02:20:31] <nagios-wm> RECOVERY – Disk free on lily is OK: DISK OK

[02:20:51] <nagios-wm> RECOVERY – SSH on lily is OK: SSH OK – OpenSSH_4.7p1 Debian-8ubuntu1.2 (protocol 2.0)

[02:27:58] * nagios-wm est parti (~nagios-wm@spence.wikimedia.org – Ping timeout: 265 seconds)

[02:31:03] * nagios-wm (~nagios-wm@spence.wikimedia.org) a rejoint #wikimedia-tech

[02:31:11] <nagios-wm> RECOVERY – Host srv182 is UP: PING OK – Packet loss = 0%, RTA = 0.20 ms

[02:31:11] <nagios-wm> RECOVERY – Host srv155 is UP: PING OK – Packet loss = 0%, RTA = 0.21 ms

[02:31:11] <nagios-wm> RECOVERY – Host db14 is UP: PING OK – Packet loss = 0%, RTA = 0.28 ms

[02:31:11] <nagios-wm> RECOVERY – Host bayes is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:31:11] <nagios-wm> RECOVERY – Host amane is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:31:21] <nagios-wm> RECOVERY – Host browne is UP: PING OK – Packet loss = 0%, RTA = 0.34 ms

[02:31:21] <nagios-wm> RECOVERY – Host srv181 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:31:31] <nagios-wm> RECOVERY – Host db15 is UP: PING OK – Packet loss = 0%, RTA = 0.14 ms

[02:31:31] <nagios-wm> RECOVERY – Host search13 is UP: PING OK – Packet loss = 0%, RTA = 1.96 ms

[02:31:31] <nagios-wm> RECOVERY – Host search19 is UP: PING OK – Packet loss = 0%, RTA = 0.27 ms

[02:31:31] <nagios-wm> RECOVERY – Host search16 is UP: PING OK – Packet loss = 0%, RTA = 1.71 ms

[02:31:40] <nagios-wm> RECOVERY – Host srv152 is UP: PING OK – Packet loss = 0%, RTA = 0.61 ms

[02:31:41] <nagios-wm> RECOVERY – Host srv164 is UP: PING OK – Packet loss = 0%, RTA = 2.55 ms

[02:31:41] <nagios-wm> RECOVERY – Host srv167 is UP: PING OK – Packet loss = 0%, RTA = 1.45 ms

[02:31:41] <nagios-wm> RECOVERY – Host srv183 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[02:31:41] <nagios-wm> RECOVERY – Host srv179 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[02:31:41] <nagios-wm> RECOVERY – Host srv185 is UP: PING OK – Packet loss = 0%, RTA = 1.50 ms

[02:31:42] <nagios-wm> RECOVERY – Host srv176 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[02:31:42] <nagios-wm> RECOVERY – Host srv175 is UP: PING OK – Packet loss = 0%, RTA = 0.20 ms

[02:31:43] <nagios-wm> RECOVERY – Host srv166 is UP: PING OK – Packet loss = 0%, RTA = 0.23 ms

[02:31:43] <nagios-wm> RECOVERY – Host search14 is UP: PING OK – Packet loss = 0%, RTA = 0.23 ms

[02:31:44] <nagios-wm> RECOVERY – Host srv154 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[02:31:44] <nagios-wm> RECOVERY – Host search17 is UP: PING OK – Packet loss = 0%, RTA = 0.26 ms

[02:31:45] <nagios-wm> RECOVERY – Host srv149 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:31:45] <nagios-wm> RECOVERY – Host db13 is UP: PING OK – Packet loss = 0%, RTA = 0.14 ms

[02:32:01] <nagios-wm> PROBLEM – MySQL on thistle is CRITICAL: Connection refused

[02:32:01] <nagios-wm> RECOVERY – Host sq75 is UP: PING OK – Packet loss = 0%, RTA = 0.21 ms

[02:32:01] <nagios-wm> RECOVERY – Host sq72 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:32:01] <nagios-wm> RECOVERY – Host sq73 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:32:01] <nagios-wm> PROBLEM – MySQL on db1 is CRITICAL: Connection refused

[02:32:02] <nagios-wm> PROBLEM – MySQL on db4 is CRITICAL: Connection refused

[02:32:11] <nagios-wm> RECOVERY – Host srv168 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[02:32:11] <nagios-wm> RECOVERY – Host srv165 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[02:32:11] <nagios-wm> RECOVERY – Host srv153 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[02:32:11] <nagios-wm> RECOVERY – Host srv156 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[02:32:11] <nagios-wm> RECOVERY – Host srv177 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[02:32:12] <nagios-wm> RECOVERY – Host srv180 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:32:12] <nagios-wm> RECOVERY – Host srv186 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:32:13] <nagios-wm> RECOVERY – Host srv151 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[02:32:13] <nagios-wm> RECOVERY – Host hume is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:32:14] <nagios-wm> RECOVERY – Host db17 is UP: PING OK – Packet loss = 0%, RTA = 0.14 ms

[02:32:14] <nagios-wm> RECOVERY – Host db20 is UP: PING OK – Packet loss = 0%, RTA = 0.14 ms

[02:32:15] <nagios-wm> RECOVERY – Host db11 is UP: PING OK – Packet loss = 0%, RTA = 0.14 ms

[02:32:15] <nagios-wm> RECOVERY – Host db8 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:32:16] <nagios-wm> RECOVERY – Host sq74 is UP: PING OK – Packet loss = 0%, RTA = 0.26 ms

[02:32:16] <nagios-wm> RECOVERY – Host sq71 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:32:17] <nagios-wm> RECOVERY – Host sq77 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[02:32:17] <nagios-wm> PROBLEM – MySQL on db3 is CRITICAL: CRITICAL – Socket timeout after 10 seconds

[02:32:21] <nagios-wm> RECOVERY – Host srv163 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[02:32:21] <nagios-wm> RECOVERY – Host srv178 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[02:32:21] <nagios-wm> RECOVERY – Host ms1 is UP: PING OK – Packet loss = 0%, RTA = 0.26 ms

[02:32:21] <nagios-wm> RECOVERY – Host ms2 is UP: PING OK – Packet loss = 0%, RTA = 0.35 ms

[02:32:21] <nagios-wm> RECOVERY – Host srv184 is UP: PING OK – Packet loss = 0%, RTA = 3.98 ms

[02:32:22] <nagios-wm> PROBLEM – MySQL on db2 is CRITICAL: Connection refused

[02:32:22] <nagios-wm> PROBLEM – MySQL on ixia is CRITICAL: Connection refused

[02:32:31] <nagios-wm> RECOVERY – Host db12 is UP: PING OK – Packet loss = 0%, RTA = 0.14 ms

[02:32:31] <nagios-wm> RECOVERY – Host db18 is UP: PING OK – Packet loss = 0%, RTA = 0.13 ms

[02:32:31] <nagios-wm> RECOVERY – Host locke is UP: PING OK – Packet loss = 0%, RTA = 0.16 ms

[02:32:31] <nagios-wm> RECOVERY – Host db9 is UP: PING OK – Packet loss = 0%, RTA = 0.23 ms

[02:32:31] <nagios-wm> RECOVERY – Host db5 is UP: PING OK – Packet loss = 0%, RTA = 0.16 ms

[02:32:51] <nagios-wm> RECOVERY – Host db7 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:32:51] <nagios-wm> RECOVERY – Host sq76 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[02:33:01] <nagios-wm> PROBLEM – MySQL on db16 is CRITICAL: Connection refused

[02:33:01] <nagios-wm> PROBLEM – SSH on search18 is CRITICAL: Connection refused

[02:33:01] <nagios-wm> PROBLEM – SSH on search15 is CRITICAL: Connection refused

[02:33:01] <nagios-wm> PROBLEM – SSH on search20 is CRITICAL: Connection refused

[02:33:10] <nagios-wm> RECOVERY – Host sq39 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[02:33:21] <nagios-wm> PROBLEM – MySQL on db14 is CRITICAL: Connection refused

[02:33:31] <nagios-wm> RECOVERY – Host lvs2 is UP: PING OK – Packet loss = 0%, RTA = 0.16 ms

[02:33:31] <nagios-wm> RECOVERY – Host sql-text11.knams is UP: PING OK – Packet loss = 0%, RTA = 122.00 ms

[02:33:31] <nagios-wm> RECOVERY – Host sql-text16.knams is UP: PING OK – Packet loss = 0%, RTA = 122.11 ms

[02:33:40] <nagios-wm> RECOVERY – Host sql-text1.knams is UP: PING OK – Packet loss = 0%, RTA = 121.99 ms

[02:33:41] <nagios-wm> RECOVERY – Host sql-text14.knams is UP: PING OK – Packet loss = 0%, RTA = 122.02 ms

[02:33:41] <nagios-wm> RECOVERY – Host sql-text12.knams is UP: PING OK – Packet loss = 0%, RTA = 122.00 ms

[02:33:41] <nagios-wm> RECOVERY – Host sql-text13.knams is UP: PING OK – Packet loss = 0%, RTA = 124.20 ms

[02:33:41] <nagios-wm> RECOVERY – Host sql-text10.knams is UP: PING OK – Packet loss = 0%, RTA = 122.11 ms

[02:33:41] <nagios-wm> PROBLEM – IRC on browne is CRITICAL: Connection refused

[02:33:42] <nagios-wm> PROBLEM – Disk free on amane is CRITICAL: Connection refused by host

[02:33:42] <nagios-wm> PROBLEM – MySQL on db15 is CRITICAL: Connection refused

[02:33:43] <nagios-wm> PROBLEM – SSH on search14 is CRITICAL: Connection refused

[02:33:43] <nagios-wm> PROBLEM – SSH on search17 is CRITICAL: Connection refused

[02:33:51] <nagios-wm> RECOVERY – Host sanger is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:33:51] <nagios-wm> RECOVERY – Host mchenry is UP: PING OK – Packet loss = 0%, RTA = 0.15 ms

[02:33:51] <nagios-wm> RECOVERY – Host sql-text4.knams is UP: PING OK – Packet loss = 0%, RTA = 122.06 ms

[02:33:51] <nagios-wm> RECOVERY – Host sql-text5.knams is UP: PING OK – Packet loss = 0%, RTA = 122.00 ms

[02:33:51] <nagios-wm> RECOVERY – Host sql-text6.knams is UP: PING OK – Packet loss = 0%, RTA = 121.95 ms

[02:33:52] <nagios-wm> RECOVERY – Host sql-text19.knams is UP: PING OK – Packet loss = 0%, RTA = 122.00 ms

[02:33:52] <nagios-wm> RECOVERY – Host sql-text17.knams is UP: PING OK – Packet loss = 0%, RTA = 122.08 ms

[02:33:53] <nagios-wm> RECOVERY – Host sql-text7.knams is UP: PING OK – Packet loss = 0%, RTA = 122.01 ms

[02:33:53] <nagios-wm> RECOVERY – Host sql-text18.knams is UP: PING OK – Packet loss = 0%, RTA = 121.99 ms

[02:33:54] <nagios-wm> RECOVERY – Host sql-text8.knams is UP: PING OK – Packet loss = 0%, RTA = 122.42 ms

[02:33:54] <nagios-wm> RECOVERY – Host sql-text3.knams is UP: PING OK – Packet loss = 0%, RTA = 122.07 ms

[02:33:55] <nagios-wm> RECOVERY – Host sql-text9.knams is UP: PING OK – Packet loss = 0%, RTA = 122.05 ms

[02:34:01] <nagios-wm> PROBLEM – Apache on srv149 is CRITICAL: Connection refused

[02:34:01] <nagios-wm> PROBLEM – SSH status on srv149 is CRITICAL: Connection refused

[02:34:01] <nagios-wm> PROBLEM – SSH on amane is CRITICAL: Connection refused

[02:34:01] <nagios-wm> PROBLEM – MySQL on db13 is CRITICAL: Connection refused

[02:34:10] <nagios-wm> RECOVERY – Host tridge is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[02:34:21] <nagios-wm> PROBLEM – SSH on search13 is CRITICAL: Connection refused

[02:34:21] <nagios-wm> PROBLEM – SSH on search16 is CRITICAL: Connection refused

[02:34:21] <nagios-wm> PROBLEM – SSH on search19 is CRITICAL: Connection refused

[02:34:21] <nagios-wm> PROBLEM – MySQL on db11 is CRITICAL: Connection refused

[02:34:21] <nagios-wm> PROBLEM – MySQL on db17 is CRITICAL: Connection refused

[02:34:22] <nagios-wm> PROBLEM – MySQL on db8 is CRITICAL: Connection refused

[02:34:31] <nagios-wm> RECOVERY – Host upload.pmtpa is UP: PING OK – Packet loss = 0%, RTA = 0.16 ms

[02:34:41] <nagios-wm> PROBLEM – MySQL on db12 is CRITICAL: Connection refused

[02:34:41] <nagios-wm> PROBLEM – MySQL on db18 is CRITICAL: Connection refused

[02:34:41] <nagios-wm> PROBLEM – MySQL on db5 is CRITICAL: Connection refused

[02:35:01] <nagios-wm> PROBLEM – MySQL on db7 is CRITICAL: Connection refused

[02:35:30] <nagios-wm> RECOVERY – Host ms5 is UP: PING OK – Packet loss = 0%, RTA = 0.15 ms

[02:36:01] <nagios-wm> RECOVERY – Host temp-es19 is UP: PING OK – Packet loss = 0%, RTA = 0.22 ms

[02:36:01] <nagios-wm> RECOVERY – Host temp-es12 is UP: PING OK – Packet loss = 0%, RTA = 0.42 ms

[02:36:01] <nagios-wm> RECOVERY – Host temp-es16 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[02:36:01] <nagios-wm> RECOVERY – Host temp-es18 is UP: PING OK – Packet loss = 0%, RTA = 0.27 ms

[02:36:01] <nagios-wm> RECOVERY – Host temp-es17 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[02:36:02] <nagios-wm> RECOVERY – MySQL on db3 is OK: TCP OK – 0.000 second response time on port 3306

[02:36:02] <nagios-wm> PROBLEM – SSH on mchenry is CRITICAL: Connection refused

[02:36:21] <nagios-wm> PROBLEM – SSH on tridge is CRITICAL: Connection refused

[02:36:21] <nagios-wm> PROBLEM – MySQL status on db17 is CRITICAL: (Service Check Timed Out)

[02:36:21] <nagios-wm> PROBLEM – MySQL status on db23 is CRITICAL: (Service Check Timed Out)

[02:36:21] <nagios-wm> PROBLEM – MySQL status on db36 is CRITICAL: (Service Check Timed Out)

[02:36:21] <nagios-wm> PROBLEM – Old ext. store status on srv167 is CRITICAL: (Service Check Timed Out)

[02:36:22] <nagios-wm> PROBLEM – Old ext. store status on temp-es1 is CRITICAL: (Service Check Timed Out)

[02:36:22] <nagios-wm> PROBLEM – Old ext. store status on srv152 is CRITICAL: (Service Check Timed Out)

[02:36:23] <nagios-wm> PROBLEM – Old ext. store status on srv164 is CRITICAL: (Service Check Timed Out)

[02:36:23] <nagios-wm> PROBLEM – Disk free on sanger is CRITICAL: Connection refused by host

[02:36:31] <nagios-wm> PROBLEM – Old ext. store status on srv176 is CRITICAL: (Service Check Timed Out)

[02:36:31] <nagios-wm> PROBLEM – Old ext. store status on srv179 is CRITICAL: (Service Check Timed Out)

[02:36:31] <nagios-wm> PROBLEM – MySQL status on db8 is CRITICAL: (Service Check Timed Out)

[02:36:31] <nagios-wm> PROBLEM – Old ext. store status on srv161 is CRITICAL: (Service Check Timed Out)

[02:36:31] <nagios-wm> PROBLEM – Old ext. store status on srv185 is CRITICAL: (Service Check Timed Out)

[02:36:32] <nagios-wm> PROBLEM – Old ext. store status on srv173 is CRITICAL: (Service Check Timed Out)

[02:36:40] <nagios-wm> PROBLEM – Old ext. store status on srv153 is CRITICAL: (Service Check Timed Out)

[02:36:40] <nagios-wm> PROBLEM – Misc_Db_Slave on db10 is CRITICAL: (Service Check Timed Out)

[02:36:41] <nagios-wm> PROBLEM – SSH on sanger is CRITICAL: Connection refused

[02:37:01] <nagios-wm> PROBLEM – Old ext. store status on srv180 is CRITICAL: (Service Check Timed Out)

[02:37:01] <nagios-wm> PROBLEM – Old ext. store status on srv168 is CRITICAL: (Service Check Timed Out)

[02:37:01] <nagios-wm> PROBLEM – MySQL status on db29 is CRITICAL: (Service Check Timed Out)

[02:37:01] <nagios-wm> PROBLEM – Old ext. store status on srv177 is CRITICAL: (Service Check Timed Out)

[02:37:01] <nagios-wm> PROBLEM – MySQL status on db15 is CRITICAL: (Service Check Timed Out)

[02:37:02] <nagios-wm> PROBLEM – Old ext. store status on srv165 is CRITICAL: (Service Check Timed Out)

[02:37:02] <nagios-wm> PROBLEM – check_all_memcacheds on spence is CRITICAL: (Service Check Timed Out)

[02:37:03] <nagios-wm> PROBLEM – Current ext. store master status on ms3 is CRITICAL: (Service Check Timed Out)

[02:37:03] <nagios-wm> PROBLEM – Old ext. store status on srv157 is CRITICAL: (Service Check Timed Out)

[02:37:04] <nagios-wm> PROBLEM – Old ext. store status on srv151 is CRITICAL: (Service Check Timed Out)

[02:37:04] <nagios-wm> PROBLEM – Old ext. store status on srv178 is CRITICAL: (Service Check Timed Out)

[02:37:05] <nagios-wm> PROBLEM – Old ext. store status on srv163 is CRITICAL: (Service Check Timed Out)

[02:37:05] <nagios-wm> PROBLEM – Old ext. store status on srv155 is CRITICAL: (Service Check Timed Out)

[02:37:06] <nagios-wm> PROBLEM – Old ext. store status on srv181 is CRITICAL: (Service Check Timed Out)

[02:37:06] <nagios-wm> PROBLEM – Old ext. store status on srv184 is CRITICAL: (Service Check Timed Out)

[02:37:07] <nagios-wm> PROBLEM – Old ext. store status on srv175 is CRITICAL: (Service Check Timed Out)

[02:37:07] <nagios-wm> PROBLEM – Disk space on tridge is CRITICAL: Connection refused by host

[02:37:10] <nagios-wm> PROBLEM – Old ext. store status on srv170 is CRITICAL: (Service Check Timed Out)

[02:37:10] <nagios-wm> PROBLEM – Old ext. store status on srv154 is CRITICAL: (Service Check Timed Out)

[02:37:10] <nagios-wm> PROBLEM – Misc_Db_Master on db9 is CRITICAL: (Service Check Timed Out)

[02:37:11] <nagios-wm> PROBLEM – Old ext. store status on srv160 is CRITICAL: (Service Check Timed Out)

[02:37:11] <nagios-wm> PROBLEM – Old ext. store status on srv166 is CRITICAL: (Service Check Timed Out)

[02:37:11] <nagios-wm> PROBLEM – Old ext. store status on srv172 is CRITICAL: (Service Check Timed Out)

[02:37:11] <nagios-wm> PROBLEM – Old ext. store status on srv158 is CRITICAL: (Service Check Timed Out)

[02:38:10] <nagios-wm> PROBLEM – MySQL status on db31 is CRITICAL: (Service Check Timed Out)

[02:38:10] <nagios-wm> PROBLEM – MySQL status on db24 is CRITICAL: (Service Check Timed Out)

[02:38:11] <nagios-wm> PROBLEM – MySQL status on db38 is CRITICAL: (Service Check Timed Out)

[02:38:11] <nagios-wm> PROBLEM – MySQL status on db27 is CRITICAL: (Service Check Timed Out)

[02:38:11] <nagios-wm> PROBLEM – MySQL status on thistle is CRITICAL: (Service Check Timed Out)

[02:38:11] <nagios-wm> PROBLEM – MySQL status on db3 is CRITICAL: (Service Check Timed Out)

[02:38:11] <nagios-wm> PROBLEM – MySQL status on db21 is CRITICAL: (Service Check Timed Out)

[02:38:12] <nagios-wm> PROBLEM – MySQL status on db1 is CRITICAL: (Service Check Timed Out)

[02:38:12] <nagios-wm> PROBLEM – MySQL status on db32 is CRITICAL: (Service Check Timed Out)

[02:38:13] <nagios-wm> PROBLEM – MySQL status on db4 is CRITICAL: (Service Check Timed Out)

[02:38:13] <nagios-wm> PROBLEM – MySQL status on db25 is CRITICAL: (Service Check Timed Out)

[02:38:41] <nagios-wm> PROBLEM – MySQL status on db26 is CRITICAL: (Service Check Timed Out)

[02:38:41] <nagios-wm> PROBLEM – MySQL status on db40 is CRITICAL: (Service Check Timed Out)

[02:38:41] <nagios-wm> PROBLEM – MySQL status on db30 is CRITICAL: (Service Check Timed Out)

[02:38:41] <nagios-wm> PROBLEM – MySQL status on ixia is CRITICAL: (Service Check Timed Out)

[02:39:21] <nagios-wm> PROBLEM – MySQL status on db16 is CRITICAL: (Service Check Timed Out)

[02:39:21] <nagios-wm> RECOVERY – Disk free on sanger is OK: DISK OK

[02:39:41] <nagios-wm> RECOVERY – SSH on sanger is OK: SSH OK – OpenSSH_4.7p1 Debian-8ubuntu1.2 (protocol 2.0)

[02:39:51] <nagios-wm> PROBLEM – MySQL status on db14 is CRITICAL: (Service Check Timed Out)

Sans parler de l’inénarrable retour en force…

[03:09:21] <nagios-wm> PROBLEM – Host sanger is DOWN: CRITICAL – Host Unreachable (208.80.152.187)

[03:09:22] <nagios-wm> PROBLEM – Host srv152 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:22] <nagios-wm> PROBLEM – Host srv151 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:22] <nagios-wm> PROBLEM – Host db7 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:22] <nagios-wm> PROBLEM – Host srv164 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:22] <nagios-wm> PROBLEM – Host srv155 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:22] <nagios-wm> PROBLEM – Host srv183 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:22] <nagios-wm> PROBLEM – Host srv167 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:23] <nagios-wm> PROBLEM – Host srv185 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:23] <nagios-wm> PROBLEM – Host srv179 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:24] <nagios-wm> PROBLEM – Host srv176 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:24] <nagios-wm> PROBLEM – Host srv175 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:25] <nagios-wm> PROBLEM – Host srv181 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:25] <nagios-wm> PROBLEM – Host srv166 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:26] <nagios-wm> PROBLEM – Host srv154 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:26] <nagios-wm> PROBLEM – Host srv163 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:27] <nagios-wm> PROBLEM – Host srv178 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:27] <nagios-wm> PROBLEM – Host srv184 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:28] <nagios-wm> PROBLEM – Host srv182 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:31] <nagios-wm> PROBLEM – Host sq72 is DOWN: CRITICAL – Host Unreachable (208.80.152.82)

[03:09:31] <nagios-wm> PROBLEM – Host sq75 is DOWN: CRITICAL – Host Unreachable (208.80.152.85)

[03:09:31] <nagios-wm> PROBLEM – Host hume is DOWN: CRITICAL – Host Unreachable (208.80.152.190)

[03:09:31] <nagios-wm> PROBLEM – Host tridge is DOWN: CRITICAL – Host Unreachable (208.80.152.170)

[03:09:31] <nagios-wm> PROBLEM – Host db9 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:31] <nagios-wm> PROBLEM – Host sq73 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:31] <nagios-wm> PROBLEM – Host sq76 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:41] <nagios-wm> PROBLEM – Host db8 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:41] <nagios-wm> PROBLEM – Host lvs4 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:51] <nagios-wm> PROBLEM – Host locke is DOWN: CRITICAL – Host Unreachable (208.80.152.138)

[03:09:51] <nagios-wm> PROBLEM – Host srv168 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:51] <nagios-wm> PROBLEM – Host srv165 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:51] <nagios-wm> PROBLEM – Host srv153 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:51] <nagios-wm> PROBLEM – Host srv156 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:51] <nagios-wm> PROBLEM – Host srv177 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:51] <nagios-wm> PROBLEM – Host srv180 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:52] <nagios-wm> PROBLEM – Host srv186 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:52] <nagios-wm> PROBLEM – Host sq74 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:53] <nagios-wm> PROBLEM – Host sq71 is DOWN: PING CRITICAL – Packet loss = 100%

[03:09:53] <nagios-wm> PROBLEM – Host sq77 is DOWN: PING CRITICAL – Packet loss = 100%

[03:10:01] <nagios-wm> PROBLEM – Host lvs2 is DOWN: CRITICAL – Host Unreachable (208.80.152.121)

[03:10:01] <nagios-wm> PROBLEM – Host ms1 is DOWN: PING CRITICAL – Packet loss = 100%

[03:10:11] <nagios-wm> RECOVERY – Host tridge is UP: PING OK – Packet loss = 0%, RTA = 0.16 ms

[03:10:11] <nagios-wm> PROBLEM – Host db5 is DOWN: PING CRITICAL – Packet loss = 100%

[03:10:21] <nagios-wm> RECOVERY – Host srv186 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:10:31] <nagios-wm> RECOVERY – Host lvs2 is UP: PING OK – Packet loss = 0%, RTA = 0.16 ms

[03:10:41] <nagios-wm> RECOVERY – Host srv164 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:10:41] <nagios-wm> RECOVERY – Host srv167 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[03:10:41] <nagios-wm> RECOVERY – Host srv183 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[03:10:41] <nagios-wm> RECOVERY – Host srv185 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:10:41] <nagios-wm> RECOVERY – Host srv179 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:10:41] <nagios-wm> RECOVERY – Host srv176 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:10:41] <nagios-wm> RECOVERY – Host srv155 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:10:42] <nagios-wm> RECOVERY – Host srv166 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[03:10:42] <nagios-wm> RECOVERY – Host srv154 is UP: PING OK – Packet loss = 0%, RTA = 0.26 ms

[03:10:51] <nagios-wm> RECOVERY – Host sanger is UP: PING OK – Packet loss = 0%, RTA = 0.16 ms

[03:11:11] <nagios-wm> RECOVERY – Host srv165 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[03:11:11] <nagios-wm> RECOVERY – Host srv168 is UP: PING OK – Packet loss = 0%, RTA = 0.80 ms

[03:11:11] <nagios-wm> RECOVERY – Host srv182 is UP: PING OK – Packet loss = 0%, RTA = 0.30 ms

[03:11:11] <nagios-wm> RECOVERY – Host srv153 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:11:11] <nagios-wm> RECOVERY – Host srv152 is UP: PING OK – Packet loss = 0%, RTA = 0.27 ms

[03:11:11] <nagios-wm> RECOVERY – Host srv180 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[03:11:11] <nagios-wm> RECOVERY – Host srv177 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:11:12] <nagios-wm> RECOVERY – Host lvs4 is UP: PING OK – Packet loss = 0%, RTA = 0.15 ms

[03:11:12] <nagios-wm> RECOVERY – Host srv156 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[03:11:13] <nagios-wm> RECOVERY – Host mchenry is UP: PING OK – Packet loss = 0%, RTA = 0.15 ms

[03:11:13] <nagios-wm> RECOVERY – Host hume is UP: PING OK – Packet loss = 0%, RTA = 0.16 ms

[03:11:14] <nagios-wm> RECOVERY – Host srv151 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[03:11:21] <nagios-wm> RECOVERY – Host srv175 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[03:11:21] <nagios-wm> RECOVERY – Host srv163 is UP: PING OK – Packet loss = 0%, RTA = 0.20 ms

[03:11:21] <nagios-wm> RECOVERY – Host srv181 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[03:11:21] <nagios-wm> RECOVERY – Host srv178 is UP: PING OK – Packet loss = 0%, RTA = 0.20 ms

[03:11:21] <nagios-wm> RECOVERY – Host srv184 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:11:41] <nagios-wm> RECOVERY – Host db5 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:11:41] <nagios-wm> RECOVERY – Host sq71 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[03:11:41] <nagios-wm> RECOVERY – Host sq73 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[03:11:41] <nagios-wm> RECOVERY – Host sq74 is UP: PING OK – Packet loss = 0%, RTA = 0.17 ms

[03:11:51] <nagios-wm> RECOVERY – Host db7 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[03:11:51] <nagios-wm> RECOVERY – Host db9 is UP: PING OK – Packet loss = 0%, RTA = 0.18 ms

[03:11:51] <nagios-wm> RECOVERY – Host sq72 is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

[03:11:51] <nagios-wm> RECOVERY – Host sq75 is UP: PING OK – Packet loss = 0%, RTA = 0.16 ms

[03:11:51] <nagios-wm> RECOVERY – Host sq76 is UP: PING OK – Packet loss = 0%, RTA = 1.76 ms

[03:12:11] <nagios-wm> RECOVERY – Host sq77 is UP: PING OK – Packet loss = 0%, RTA = 0.24 ms

[03:12:11] <nagios-wm> RECOVERY – Host db8 is UP: PING OK – Packet loss = 0%, RTA = 3.73 ms

[03:12:31] <nagios-wm> RECOVERY – Host locke is UP: PING OK – Packet loss = 0%, RTA = 0.19 ms

Bref, une jolie panade, hein 🙂

Parmi les grands gagnants, citons JamesC93, un habitué du chan, qui nous a gratifiés de ceci :

[02:28:14] * JamesC93 calls dibs on power failure

Citons aussi Darkoneko, pourtant lui aussi un habitué du chan, qui nous a gratifiés de ceci :

[02:51:42] <darkoneko> hello

[02:52:06] <darkoneko> i’m hitting an error when trying to access http://fr.wikipedia.org/wiki/Discussion_utilisateur:Ju_gatsu_mikka (which is a redirection)

[02:52:10] <darkoneko> error is : Unknown error (10.0.0.241)

Parmi les perdants, citons :

  • Le Wiktionnary anglophone
  • Wiktionary has a problem.

  • Le Wiktionnaire francophone
  • Wiktionnaire has a problem.

  • La Wikipédia francophone
  • Ce wiki a un problème.

  • … et le site de MediaWiki ! Oui, forcément, quelqu’un l’a fait remarquer sur le chan #mediawiki
  • MediaWiki has a problem.

Malheureusement, le retour à la normale sur les serveurs n’est pas un retour à la normale sur les sites de la WMF : encore faut-il relancer les processus (notamment MySQL, qui a l’air de bouder un peu en ce moment). Et espérer qu’il n’y aura pas d’autres coupures de courant et donc de refroidissement dans les prochaines minutes… c’est pas gagné !

  1. selon cette page, les serveurs adler, albert, alrazi, amane, anthony, ariel, avicenna, bacon, bart, bayle, benet, biruni, bleueunn, browne, chloe, coronelli, dalembert, diderot, ennael, friedrich, goeje, harris, holbach, humboldt, hypatia, isidore, ixia, khaldun, kluge, larousse, lomaria, maurus, mchenry, moreri, rabanus, rose, samuel, sanger, smellie, suda, thistle, tingxi, vincent, webster, will, yongle, zwinger ; ainsi que les serveurs de base de données db1, db2, db3, db4, les squids sq1 à sq50, et les serveurs Apache srv0 à srv189 []

1 comment to C’est la fin du monde…

  • Ouais bon, pour ma défense, il était 3 heures du matin, j’étais tranquillement sur mon lit en train de consulter un dernier truc avant de dormir ; bref … j’avais la tête dans le fion.

    Comme j’avais (au départ) seulement le problème sur cette page là -j’avais testé plusieurs autres pages, mais elles marchaient… sans doute parce qu’elles étaient en cache-, j’ai pensé à un problème local (= « ce redirect en particulier à un souci » ou « les redirections merdent ») plutôt qu’a un problème général (« les serveurs ont pété »)

Leave a Reply

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">