Simultaneous MultiThreading for Network Processors

Problem: As network processing applications become increasingly sophisticated and Internet traffic continues growing in intensity, the design of future network processors will entail major constraints. One of them is the "deep packet classification processing" issue which is a major performance-critical function: with network applications like QoS, URL matching, virus detection, intrusion detection, and load balancing, deep packet classification is required to extract data from headers between Layer 3 to Layer 7. In another fundamental problem, security-related processing has become quite essential for Web switches and servers. Generically, security-related processing is a CPU-intensive task which requires more CPU power than most other network applications. Further, it should be noted that all these network applications such as deep packet classification and security will have to be run at line rates in the network processors of the near future.

Most programmable network processors on the market today (such as the Intelĸį IXP2800) target relatively low performance (from 100 Mbps to 10 Gbps) and low cost edge routers. Further, they cannot easily cope with upcoming sophisticated network applications which need to be processed at line rates. This means that the task of routing in the backbone must be given to ASICs, and that new programmable network processors are needed to deal with upcoming sophisticated network application efficiently.

Objective: The goal of the work proposed in this document is to develop architectural solutions to the challenges presented by the workloads generated by typical applications of network processors. Its purpose is to examine and improve upon SMT (Simultaneous MultiThreading) techniques to execute network applications of an irregular, dynamic, and unstructured nature.

Novelty: The data streams presented to network processors and general-purpose CPUs exhibit different characteristics: network applications are data-intensive, they exhibit irregularity caused by their branch-intensive characteristics, packet parallelism, and packet dependencies. All these features have the potential to enhance the performance of SMT network processors.

Approach: To enhance the performance of SMT network processors, thread scheduling will first be examined. Indeed, while many thread scheduling schemes have been developed and applied to SMT processors, it has been found that, due to incorrect speculations, their performance is still quite lower than the possible peak performance [1]. In this proposed work, advanced thread scheduling schemes will be studied in order to design an effective fetch unit in SMT targeted to network applications. To obtain more performance with improved thread scheduling, a distinct exploitation of the inherent packet parallelism and packet dependency will be made. For example, skipping threads which are stalled because of packet dependences and increasing the priority of threads without packet dependences are two immediately feasible approaches. Secondly, effective branch prediction strategies will be investigated since network applications are so branch-intensive and branch prediction in SMT processors is difficult because multiple independent threads share branch prediction resources and it may interfere with branch prediction of each other threads. Traditional branch predictions such as bimodal branch prediction using a branch target buffer will be evaluated to determine whether they are still valid for network applications in SMT processors and more advanced branch handling schemes for network applications than currently existing for conventional applications will also be considered. Finally, memory models for network processors will be studied since network applications are processed in packet based and the packet size is not necessarily a precise fit with the memory (in terms of block and page size) while SMT inherently requires more bandwidth from the primary caches because it allows many more loads and stores which have a significant interference with their operation on L1 cache.