Note: This is a re-upload from my last website, it's actually been ~9 months since I left Facebook.
I take a lot of pride in what I'm able to do at work. I think this dedication pays off, but I'll let you be the judge.
Context: We have accidental data-loss SEVs. Oftentimes, there is a significant portion of time investigating which objects are recoverable, and then pushing the recovery step by step through metadata repair, cold storage restores, etc. This project:
- helped automate the restore process by streamlining all the necessary steps to attempt to recover the data from backups.
- saves on-call time, prevents human mistakes, and is well maintained compared to ad-hoc scripts written by team members.
Context: We have a lot of blobs. Like... exabytes. Sometimes we want to edit the metadata of records in the ~100k-1m range. To do that, you would write some filler code in a pipeline that was meant to handle state transitions (i.e. active to deleted), not anything along the lines of "If this blob is in this list of blobs, do this edit." This project:
- allows an Everstore Engineer to bulk-edit blobs based on a simple, declarative binary or extend it to implement their own logic (like what is mentioned above). It safeguards us from making unexpected changes, and allows the engineer to complete a “dry-run”, which previously wouldn’t have been possible.
- runs these edits extremely quickly in parallel, processing ~100k records per minute.
- Extended configuration on Cold Storage backups, decreasing compaction overhead by 35% during high-stress COVID load. To quantify this percentage further, this is about 1.2EB of savings.
- Decreased our online queue of pending volume deadings from 3M to 300k by short-circuiting from known failures.