GitLab.com 误删了数据，现在彻底挂了

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 2987 天前的主题，其中的信息可能已经有所发展或是发生改变。

已经宕机两个多小时，预计还会继续。

他们发了个Google Docs，里面一些内容也是值得学习的

https://docs.google.com/document/d/1GCK53YDcBWQveod9kfzW-VCxIABGiryG7_z_6jHdVik/pub

第 1 条附言 · 2017-02-01 14:40:08 +08:00

The incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis).

看来是💊，数据库再也恢复不出来的节奏！影响重大啊

第 2 条附言 · 2017-02-01 20:46:00 +08:00

https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-incident/

宕机

彻底

预计

小时

30 条回复 • 2017-02-03 10:28:27 +08:00

yangqi

2017-02-01 10:04:14 +08:00

太好笑了， 5 种备份方式没有一个备份正确的。这平时没人检查，一帮人居然没人意识到备份的重要性

dzxx36gyy

2017-02-01 10:04:32 +08:00 via Android

噗噗噗

ZE3kr

2017-02-01 10:07:33 +08:00

看到 “ 5 backup/replication techniques deployed none are working reliably or set up in the first place ” 时我也很惊讶，至少我都会定期检查备份是否更新，是否有大小（经常因为配置问题备份是空的）

ZE3kr

2017-02-01 10:10:02 +08:00

之前就是因为自己维护 GitLab 太麻烦，用了 GitLab.com 的公共服务，没想到它们的公共服务现在出了问题……

EPr2hh6LADQWqRVH

2017-02-01 10:27:40 +08:00 via Android

堪忧啊， Ruby 党

AstroProfundis

2017-02-01 10:30:31 +08:00

最后那句结论真是看着血淋淋的...

deleted

2017-02-01 10:36:54 +08:00 via Android

gitlab 粗事了…

wzxjohn

2017-02-01 10:46:35 +08:00

吃惊。。。还好我用的是自建的。。。

Havee

2017-02-01 11:02:50 +08:00

汗...

Unknwon

2017-02-01 11:07:24 +08:00

感觉这是个植入广告的完美时机。。我觉得我不能错过。。

https://gogs.io/ 你值得拥有。。。

DoraJDJ

2017-02-01 11:13:32 +08:00 via Android

rm -rf / 又发威了

幸好提前转到了 Coding Pages

yangqi

2017-02-01 11:23:30 +08:00

@Unknwon 呵呵，你需要把你们备份方案晒出来才能抓住这个完美时机

neilp

2017-02-01 11:29:57 +08:00

5 个方案竟然都失败了.
也是没谁了

irainsoft

2017-02-01 11:34:06 +08:00

5 个都不工作也真是厉害这么久还没发现....

Sharuru

2017-02-01 11:55:29 +08:00

五个备份除了最后的 S3 存储不正确以外，其他的都是因为备份周期过长（ 24 小时）导致的。

备份是为了能够在系统失效后尽快的回复可用状态的最后手段，
如果是为了失效后快速恢复，那是 HA 做的事；

ZE3kr

2017-02-01 12:01:42 +08:00 via iPhone

@Sharuru Regular backups 是找不到存在了哪个地方
Disk snapshots 没有备份数据库

总之就是倒霉的事情都碰上了，不然不至于宕机那么久

yangqi

2017-02-01 12:02:05 +08:00

@Sharuru 24 小时可用的备份只有 LVM 快照，其他都无效

"2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don ’ t appear to be working, producing files only a few bytes in size."
"3. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers."
"4. The synchronisation process removes webhooks once it has synchronised data to staging. "
"5. The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented"

ZE3kr

2017-02-01 12:06:25 +08:00 via iPhone

话说我最近也弄丢个非常重要的代码文件，正在我特别着急，是不是要重新写一遍代码时，发现我有一个一天前的 snapshot ，瞬间如释重负

DaCong

2017-02-01 12:07:00 +08:00 via Android

怪不得我今天更新 aur 上的一个包的时候从 gitlab 下载源代码总是出错。

matrix67

2017-02-01 12:09:41 +08:00

好消息是 This incident affected the database (including issues and merge requests) but not the git repo's (repositories and wikis).

matrix67

2017-02-01 12:12:22 +08:00

主要原因是手快

YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com

2017/01/31 23:27 YP - terminates the removal, but it ’ s too late. Of around 310 GB only about 4.5 GB is left - Slack

然后他们还想用这个方法恢复。难道他们是现场 google 的？应该不会吧。

JEJ: Probably too late, but isn't it sometimes possible if you make the disk read-only quickly enough? Also might still have file descriptor if the file was in use by a running process according to http://unix.stackexchange.com/a/101247/213510

Azure 这个删的快。想到 adobe 那个梗。 mac 要放个 adobe 压压惊。

Also, Azure is apparently also really good in removing data quickly, but not at sending it over to replicas.