| Microprocessor is the kernel of embedded system. Based on the 32-bit embedded processor RISC32E designed by the author of this thesis, we analysis the frequency and performance bottleneck of scalar processor, and provide a heterogeneous media dual-issue architecture frame called POLLUX. The research work introduced in this thesis mainly concerns the design of POLLUX pipeline microarchitecture, branch prediction structure, and multimedia datapath.Higher clock frequency and more advanced architecture are the methods to increase the performance of embedded microprocessor. Memory wall restricts the frequency of short pipelines, and the maximum throughput for a scalar processor is bounded by one instruction per cycle. The author partitioned the POLLUX pipeline oriented by the memory access, and designed a separate multimedia pipeline and integer pipeline to implement the out-of-order execution dual-issue architecture. High performance data bypassing network and novel coarse granule distributed pipeline control strategy were designed to resolve the pipeline interlock. A low cost reorder buffer was designed to insure precise exception. Experiments show that POLLUX can work at 400MHz with TSMC13G technology in worst case, 580MHz in typical case. Furthermore, POLLUX achieves 1.4 DMIPS/MHz and has powerful media data processing ability.Branch instructions are increasingly important in determining overall machine performance while recent processor have made use of increasing degrees of instruction level parallelism (ILP). The minimizing branch penalty and maximizing instruction flow throughput, POLLUX incorporates dynamic branch prediction mechanism to exploit ILP. The author evaluated the power, area, and performance of different branch predictor, considered bimodal and gshare predictor were beneficial to realize in the embedded processor and designed a software configurable multi-mode predictor. Experiment shows that dynamic branch prediction mechanism can achieve 91% prediction accuracy at the cost of 13.9 kgates.Multimedia data path is an important component of POLLUX microarchitecture. Based on the POLLUX media instruction extension, this paper gives a general standard-cell based optimization process at architecture level for data path design to implement low delay and low power. The proposed method was applied to the split multiplication and accumulation unit (MAC) in the multimedia data path. Experiment shows that optimized MAC can improve the speed by 33.6% while reducing the power by 27.1%. |