| The Legup High-Level Synthesis (HLS) tool permits the synthesis of multi-threaded software into parallel hardware, where parallel software threads are realized as concurrently operating hardware units. A common performance bottleneck in any parallel implementation is memory bandwidth -- parallel threads demand concurrent access to memory resulting in contention that hurts performance. FPGAs contain an abundance of independently accessible memories offering high internal memory bandwidth. We describe an approach for leveraging such bandwidth in the context of synthesizing parallel software into hardware. Our approach applies trace-based profiling to determine how a program's arrays should be automatically partitioned into sub-arrays, which are then implemented in separate RAM blocks within the target FPGA. The end result is that each thread, when implemented in hardware, has exclusive access to its own memories to the extent possible, significantly reducing contention and arbitration and thus raising performance. |