| Tiled Processors can solve the power, wire delay and design complexity problems in the era of nano-meter chip technology, exploiting the increasing on chip transistor resources to improve the performance of applications. However, it brings new demands for its cache design. On one hand, distributed cache architecture is needed to satisfy parallel memory access requests from large amount of processor cores. On the other hand, distributed memory dependence disambiguation mechanism is needed to guarantee the correct memory access order. Non-Uniform Cache Architecture is suitable for these demands compared to traditional centralized controlled caches. Therefore, this thesis focuses on the design of Non-Uniform Cache applied to Level-Two and Level-One Cache on Tiled Processor respectively. Furthermore, we make optimizations to Non-Uniform Level-One Cache according to specific load behaviors on Tiled Processors, which is called Load Execution Localization Model. Finally, we implement and evaluate the model. This research will direct further design and implementation of Cache Architecture on Tiled Processors.In the thesis, the design and optimizations to the Level-One and Level-Two Cache are on the platform of TPA-PI (Tiled Processor Architecture - Processor for ILP), which is a new processor proposed and researched by my collegues in the laboratory. The detailed work includes following aspects. (1) We design a Non-Uniform Level-Two Cache for TPA-PI, including static data mapping algorithm, on chip interconnection, internal structures of cache bank, transaction processing logic, and the pipeline design. A Level-Two Cache simulator written by C language is implemented based on that, which is written in a singal-accurate way that can be easily transformed into hardware design. (2) We make optimizations with respect to the long transferring Load latency in Non-Uniform Level-One Cache on TPA-PI. The load behaviors are profiled firstly, which leads to the observation that performance can be gained from eliminating the long latency between load issuing side and data placing side. Then, an optimized model about localizing load execution is proposed, as well as several replication strategies and coherence maintaining method in order to control the overhead induced by cache copy and store multicast. Finally, the model is evaluated through function and timing simulation and the results show that the basic model can gain 5.72% performance improvement on average. It is also observed that while the cache copy overhead doesn't influence the cache hit rate too much, the store multicast overhead is more critical for performance.We draw conclusions from this work as follows. (1) The Non-Uniform Level-Two Cache is loosely coupled with TPA-PI, and thus can be designed separately. (2) The Non-Uniform Level-One Cache is tightly coupled with TPA-PI architecture and execution model, and the key point to performance is how to reduce the routing latency and communication overhead in the Cache architecuture. |