| Data explosion is experiencing in many domains such as astronomy, informationretrieval, and social network. The knowledge buries in these enormous datasets is soinvaluable that the ability to apply sophisticated statistical analysis methods to this data isbecoming more and more essential. As the most popular statistical language, R providesrich functionality for data analysis, but simply fails when the data becomes too large. Weproposed a generic framework JRBridge which can integrate R and JVM-based opensource computational infrastructures. Integrating R with Hadoop on this frameworkimproves the ability for large-scale statistic computing in R.In order to benefit from the lastest research achievements on computing frameworksand programming models for large scale data process, integrating R language withcomoutational infrastructures is our research thinking. With detailed analysis on how tointegrate R with popular open source infrastructures, we proposed SF and UDF integrationmodels. As most of open source computational infrastructures are JVM-based and thedefault APIs are written in Java, in order to make it easy for integrating, we designedJVM-based framework JRBridge which can integrate R and JVM-based open sourcecomputational infrastructures. JVM-based R language interpreter can interpretive executethe embedded R code in Java; Java Class Loader and Executor in R with jload,import and$operator make it possible to call method in Java library; R2J and J2R type convertersautomatically perform type conversion in the presence of context switching. With abovecooperating mechanisms and the plugin for integrating R and Hadoop, it is easy to handlelarge-scale data statistical computing in R.With the HDFS plugin, it brings a way to store and access datasets with millions ofobjects in HDFS. And with MapReduce plugin, it brings a natural environment to codeMapReduce algorithms in R. In the Hadoop clusters with5worknodes, the consumed timeof wordcount in JRBridge is nearly reduced to1/7of original R. The experiment resultshows that JRBridge scales linearly with the size of the dataset and thus provide a scalablesolution for large-scale statistical computing in R. |