| With the background of big data,applications from various industries take on a trend of explosive growth.The complexity of statistic analysis also increases.Traditional computing technology and current processing capability of information system are far away from users’ demands.As the rapid development of high performance computing platforms,more and more statisticians are turning to combine the statistic methods with big data platforms or trying to optimize the language implementation for a better compatibility with high performance computing platforms.R language is developed by statisticians and is widely used all over the world.But it shares a lot of bottlenecks in performance like other interpreted languages.Parallelism is one of good choice to improve it.However,developing a parallel program is not as easy as writing a serial one.It requires both the implementing skills of code function and a good acquirement of distributed system and parallel programming.There exists a huge gap for programmers to jump between the high level algorithm design and the bottom level of debugging distributed programming.Frameworks which can shorten the gap is needed badly.In this paper,we first analyze the characteristics of R language.As R language is a young language and has no systematic references,we studied its source code as the most authoritative document and give detailed description mainly from its type system,vectorized programming style and functional language properties.We next using GNU R as the research object to discuss its runtime execution framework.It is essential to clarify the responsibility of runtime stage as the theoretical base for the next runtime optimization.We also come up with a system design for distributed parallel programming for R(Rdp).Through providing several parallel function interfaces as the programming model,we divided the parallel runtime framework into four levels,top down parallel application,parallel interfaces,runtime environment and bottom hardware resources respectively.Using Message Passing Interface(MPI)as the main implementation,we give the layout of the system and the API functional design.Then,we described the implementation of the system according to the hierarchical design mentioned above.Parallel programming API are abstracted into three categories according to different operating granularity.They are the bottom level of sensing the system,the middle level of scheduling working processes and the high level of application interfaces.MPI standard in Rdp is implemented by C so as to gain an optimized resource distribution and task scheduling.Apply function family is chosen as the major optimizing objects because it is the goal to reserve the coding habits as well as provide user-friendly high performance computing calling interface.In the rest part of the paper,we first run several benchmarks to analysis the performance of main function in Rdp compared with those corresponding ones in snow and parallel packages.The results shows that Rdp outperforms in handling large scale data sets,scalability and load balancing.Then,profile tools traceR is used to analysis the runtime behavior and memory consumption situation when running those functions.According to the analysis conclusion,several corresponding optimization strategies are given next.An application test for TWIX package is running in order to prove the usability and high efficiency of Rdp. |