After years of testing various approaches for detecting silent data corruptions (SDCs), Meta has outlined its approach for resolving the hardware issue.
SDCs are data errors that do not leave any record or trace in system logs. Sources of SDCs include datapath dependencies, temperature variance, and age, among other silicon factors. Since these data errors are silent, they can stay undetected within workloads and propagate across several services.
The data error can affect memory, storage, networking, as well as computer CPUs and cause data loss and corruption.
Meta engineers started testing three years ago as they had a difficult time detecting SDCs once components had already gone into one of its production data centre fleets.
“We [needed] novel detection approaches for preserving application health and fleet resiliency by detecting SDCs and mitigating them at scale,” Meta engineer Harish Dattatraya Dixit said in a blog post.
According to tests, Meta found its most preferred way for detecting SDCs is using both out-of-production and ripple testing.
Out-of-production testing is a SDC detection method that occurs when machines go through a maintenance event such as system reboots, kernel upgrades, and host provisioning among others. This type of testing piggybacks onto these events to allow for tests to have longer runtimes thereby enabling a “more intrusive nature of detection”.
Ripple testing, meanwhile, occurs by running silent error detection in conjunction with workloads being active. This is done through shadow testing with workloads and injecting bit patterns with expected results intermittently within fleets and workloads, which Meta found enabled faster SDC detection than out-of-production testing.
This faster type of testing “ripples” through Meta’s infrastructure, allowing for test times that are 1,000x lower than out-of-production test runtimes.
Meta engineers observed, however, ripple testing could only detect 70% of fleet data corruptions, although it was able to detect them in 15 days. By comparison, out-of-production testing took six months to detect the same corruptions along with other ones.
In explaining these benefits and tradeoffs, Dattatraya Dixit recommended that organisations with large-scale infrastructure should use both approaches to detect SDCs.
“We recommend using and deploying both in a large-scale fleet,” Dattatraya Dixit said.
“While detecting SDCs is a challenging problem for large-scale infrastructures, years of testing have shown us that [out-of-production] and ripple testing can provide a novel solution for detecting SDCs at scale as quickly as possible.”
When Meta engineers used both tests for detecting SDCs, they found all SDCs could eventually be detected. Meta said 70% of SDCs were from ripple testing after 15 days, out-of-production testing caught up to 23% of the remaining SDCs in six months, while the remaining 7% was found through repeated ripple instances within its data centre fleets.
To push further innovation in detecting SDCs, Meta has also announced it will provide five grants, each worth around $50,000, for academia to create research proposals in this field of research.