| Emerging Software-Defined Network(SDN)is a radical innovation to the traditional distributed control plane architecture and vertically integrated network device model,which advocate the separation of the data plane and the control plane.However,SDN’s programmability is limited to the control plane,and the data plane still uses fixed-function forwarding hardware.For this issue,the Programmable Data Plane(PDP)and the P4 language were proposed.For the programmable data plane,the Reconfigurable Match Tables(RMT)forwarding model is adopted to define packet processing behaviors,such as programmable parsing,programmable matching,and programmable actions.And the P4 programming language allows for a quick and concise description of packet processing behaviors.The programmable forwarding chips are the implementations of the programmable data plane,which achieves high performance,programmability,appropriate chip area,and appropriate power consumption at the same time.The programmable data plane gains widespread attention from industry and academia.For industry,programmable forwarding chips can support the rapid deployment of new network functions without purchasing new equipment,which effectively reduces equipment procurement costs.For academia,a programmable data plane can support the deployment of new ideas in a real environment,which greatly accelerates innovation in network research.There are three research directions on the programmable data plane,which are research on forwarding architectures and forwarding algorithms,research on programming languages and compilers,and research on programmable data plane applications.The first research is the foundation of the other two,and its results directly affect the forwarding performance and programmability of the programmable data plane.The research on forwarding architectures and forwarding algorithms require hardware development experience and a hardware verification environment such as an FPGA platform,which has a high research threshold.With the rapid development of information technology such as big data,cloud computing,and 5G,network traffic has shown exponential growth in recent years.However,with the slowdown of Moore’s law,advances in process can not lead to a significant increase in performance anymore.To cope with the fast-growing network traffic,it is necessary to do innovations from the perspective of architectures and algorithms,which leads to new forwarding architectures and forwarding algorithms.This dissertation focuses on the ultra-high-speed packet processing requirement(Tbps per port),research forwarding architectures and forwarding algorithms of the next-generation programmable data plane.The research proposes new forwarding architectures and new forwarding algorithms to achieve higher packet processing performance,lower area overhead,and bigger table size while maintaining data plane programmability.The main contributions are as follows.1.The high-performance and programmable CRC architecture and algorithms are studied.There are three problems for the ultra-high-speed CRC computation of the next-generation programmable data plane,which are area overhead problem,variable-length padding problem and bus efficiency problem.This dissertation proposes a high-performance and programmable CRC architecture with segmented bus support to address the three problems.The architecture supports a wide bus to support high throughput.And three algorithms are proposed to address the area overhead problem and variable-length padding problem.The proposed architecture supports segmented bus to address the bus efficiency problem.Finally,high throughput and low area cost can be achieved together.The Cyclic Redundancy Check(CRC)algorithm is widely used in Ethernet protocol for error detection,and it is also the actual choice for hash computation in network forwarding chips.The ultra-high-speed next-generation programmable data plane pursues higher throughput,and a wider data bus inside the chip is required to meet the throughput requirements under the frequency constraint,which leads to the area overhead problem,variable-length padding problem and bus efficiency problem.The three problems make the CRC computation a bottleneck for a further improvement of the packet processing performance of the next-generation programmable data plane.For the area overhead problem,the stride-by-N algorithm is proposed,which establish a connection between the selection of the stride and the underlying hardware structure.And the proposed algorithm can fully utilize the hardware.For the FPGA implementation,the ICAP-based CRC parameter reprogramming algorithm is proposed to reduce the area overhead by reusing the FPGA configuration path.For the variable-length padding problem,the pipelining go back algorithm is proposed,which firstly sets all the padding bits to 0 and performs CRC computation using the complete bus word after padding.And then the algorithm performs the go back operation in the form of a pipeline.The number of go back steps depends on the length of the padding and the final result can be achieved when all the pipeline stages are traversed.For a bus of n bits,the area complexity of the pipelining go back algorithm is O(log2(n)),which is better than the O(n)area complexity of the traditional scheme.For the bus efficiency problem,a high-performance and programmable CRC computing architecture is proposed,which can supports a segmented bus and carry multiple data frames in one bus word.The system throughput can be guaranteed even in the worst case.The experimental results show that,compared with the state-of-the-art work with segmented bus support,the proposed architecture reduces the area overhead by 2.9%-20.8%,while the throughput increases by and 32.2%-80.2%.2.The architectures and algorithms of the high-performance and programmable parser are studied.This dissertation proposes HyperParser to meet the five parsing requirements of the packet processing of ultra-high-speed next-generation programmable data plane,which are high performance,low power consumption,low area overhead,low latency,and deep parsing.HyperParser leverages inverse butterfly network and butterfly network to meet the five requirements,which can support high performance,low power consumption,low area overhead,low latency,and wide data bus.The ultra-high-speed next-generation programmable data plane requires a significant improvement in PPALD(Performance,Power,Area,Latency,and parser Depth)of programmable parsers.HyperParser can support a simple segmented bus,which can guarantee the bus efficiency and low area overhead.The reduced inverse butterfly network is used to realize the bit extraction operation and the butterfly network is used to extract the concerned fields.Compared with Crossbar,inverse butterfly network and butterfly network reduce the area overhead significantly,and the high performance,low latency,low power consumption,and deep parsing can be guaranteed.The HyperParser+is proposed to address the problem of the limited parsing capacity of HyperParser.The parsing capacity is limited by the complexity of BV-TCAM,and the HyperParser+divides the rule set into multiple subsets to reduce the complexity of BV-TCAM.As a result,the parsing capacity becomes bigger.The bit selection algorithm is proposed to execute the bit selection operation.The algorithm combines preprocessing and heuristic column removal algorithm,and it can reduce the number of selected bits by 73%-92%.For the routing of the inverse butterfly network and the butterfly network,the inverse-iteration routing algorithm is proposed,which can produce the configuration of the crosspoint switches in the inverse butterfly and butterfly networks.The FPGA implementation of HyperParser outperforms existing works in throughput by 17.5%~136.6%,and the ASIC implementation of HyperParser can achieve a 4x throughput compared with the state-of-the-art.HyperParser also outperforms existing works in power consumption,area,latency,and parsing depth.3.The algorithm and architecture of minimal perfect hashing with high-speed construction and high-speed lookup are studied.This dissertation proposes the Cuckoo Hashing Assisted Minimal Perfect Hashing(CHAMPH)framework to meet the requirements of the packet processing of ultra-high-speed next-generation programmable data plane for high-speed construction and high-speed lookup of big tables.Both the construction and lookup can be accelerated by hardware,which can achieve high-speed construction and high-speed lookup of the minimal perfect hash table.The minimal perfect hashing can achieve one off-chip access and low on-chip memory overhead,which can meet the demand of next-generation programmable data plane for high throughput and large table size.Perfect hashing deployed to programmable data planes requires optimization in terms of construction time and lookup speed.Existing works map the dataset to non-uniform buckets,making it difficult to accelerate the construction process by hardware.And it is also difficult to deploy lookup operations in hardware to achieve high-speed lookups.To address the above problem,the Cuckoo Hashing Assisted Minimal Perfect Hashing(CHAMPH)framework is proposed.In the construction phase,the Cuckoo hashing is used to realize uniform bucket mapping,and the RecSplit algorithm is used to implement the minimal perfect hash function in the bucket.For a billion-scale dataset,uniform buckets can be achieved within a four-step kick operation.In addition,the FPGA-based uniform bucket mapping architecture is proposed.Compared with the software implementation,the FPGA acceleration scheme can reduce the bucket mapping time by 95%.The FPGA acceleration architecture of the RecSplit construction is proposed,which realizes 8192 hash functions in parallel and reduces the construction time by 99.83%compared with the 8-thread software implementation.The CHAMPH lookup architecture can be deployed in FPGA and ASIC.The storage efficiency and lookup speed of exponential Golomb coding are difficult to guarantee at the same time.To address this problem,a multi-level bitmap pipeline is used to store the unary part,and an odd-even memory is used to store the fixed part,which can guarantee high performance and low area cost.To address the DRAM bank access collision problem,the fair bank round-robin algorithm is proposed.The experimental results show that the proposed CHAMPH algorithm reduces the construction time by 63.6%compared with the state-of-the-art,and the proposed CHAMPH lookup architecture improves the lookup performance by 70%compared with the state-of-the-art. |