Tolerating faults in modern processors

 

Reliability is as important as high-performance from processor application perspective. So, how to design a reliable processor is the main goal of our project.

 

Firstly, we realize that there are a lot of sources of degrading modern processor reliability, like CMOS TDBD, design error, coupling noise, power supply noise, etc. But the source we are mostly concern about is radiation induced transient faults, for example, single event upset (SEU). This kind of concern has been there for a long time, like spacecraft, airplane application domains. And it is becoming more severe even at ground-level application domains. One of well-known reasons for that is CMOS technology scaling.

 

Secondly, to tolerate radiation induced faults, we need the fault models of radioactive particle striking on modern processors.

 

Thirdly, based on examining fault models, we will develop mechanisms to handling them during run time. We now proposed an improved control flow checking approach to protect processors from violation of sequential execution.

 

Finally, we will evaluate proposed mechanisms.