AMD EPYC 7002 Processors Have a Bug That Causes Freezing After 1044 Days
AMD developers have reported that an unusual bug has been identified in the AMD EPYC 7002 processors. The bug leads to the fact that after 1044 days of continuous operation (2 years and 10 months), the processor may freeze, due to which the server will have to be rebooted. AMD has warned that they will not be able to fix this problem.
What Causes the Bug?
The manufacturer reports that the problem is related to the fact that the kernel fails to exit the CC6 power-saving state if the last system reboot was more than 1044 days ago. Moreover, the failure time may vary depending on the frequency of REFCLK.
Theory of Reddit User acid_migrain
Reddit user acid_migrain suggests that the problem actually manifests itself not after 1044 days, but after 1042 days and 12 hours. According to his theory, the hang occurs when the TSC [Time Stamp Counter], counting the number of duty cycles after a reset, reaches 0x380000000000000 while operating at 2800 MHz (2800 * 10**6 * 1042.5).
Workaround Suggested by AMD
As a workaround, AMD is suggesting administrators either reboot the server more than once every 1044 days, which will “zero” the CPU and restart the 1044-day “timer”, or disable CC6’s power saving mode.
AMD EPYC 7002 processors are powerful and reliable, but this bug could be a major issue for servers that need to run for long periods of time without interruption. AMD has warned that they will not be able to fix this problem, so administrators should take the necessary steps to ensure that their servers are not affected by this bug.