Instapaper Outage 之後

Instapaper 在2月9日因為資料庫檔案過大,超過 ext3 檔案系統的單一文件2TB限制,系統掛掉 (Instapaper 的公告是 Extended Outage),直到2月14日才完全恢復。

前段日子,都在平板上看 Homo Deus 和整理書摘,沒有使用過 Instapaper,所以完全沒意識到這次大當機,直到看到灣區日報和 Instapaper 員工 Brian DonohueMaking Instapaper 發的 Instapaper Outage Cause & Recovery

Brian Donohue 在文章裡面簡明扼要的解釋事情發生的原委(root cause)和解決問題的過程。2.5 TB 的資料庫讓系統停擺,更可怕的是,雖然 Instapaper 每天都備份資料庫內容,但是同樣因為檔案大小限制的關係,每天備份的內容都不超過 2TB,所以每日備份不是真正完整的備份(daily backup)。

The reason this limitation exists is because MySQL RDS instances created before April 2014 used an ext3 filesystem which has a 2TB file size limit. Instances created after April 2014 are backed by an ext4 filesystem and subject to a 6TB file size limit.

還有一點很重要, AWS RDS 沒有提供監督和警示的機制,對於把身家都壓在 RDS 的網站管理者,要是沒有針對這點事先規劃好每日工作的內容(housekeeping details) 以及應變計畫(Disaster Recovery),即使檔案大小尺寸提高到 6TB, 後果仍然堪虞啊!

As far as we can tell, there’s no information in the RDS console in the form of monitoring, alerts or logging that would have let us know we were approaching the 2TB file size limit, or that we were subject to it in the first place. Even now, there’s nothing to indicate that our hosted database has a critical issue.

Brian Donohue 坦誠他們原先並沒有良好的應變計畫,當然更不可能在平日執行重大事故的應變演練。經過這次教訓,我想 Instapaper 一定知道該怎麼做啦。



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s