Pengukuran Performa Apache Spark dengan Library H2O Menggunakan Benchmark Hibench Berbasis Cloud Computing

Penulis

  • Aminudin Aminudin Jurusan Teknik Informatika, Universitas Muhammadiyah Malang
  • Eko Budi Cahyono Jurusan Teknik Informatika, Universitas Muhammadiyah Malang

DOI:

https://doi.org/10.25126/jtiik.2019651520

Abstrak

Apache Spark merupakan platform yang dapat digunakan untuk memproses data dengan ukuran data yang relatif  besar (big data) dengan kemampuan untuk membagi data tersebut ke masing-masing cluster yang telah ditentukan konsep ini disebut dengan parallel komputing. Apache Spark mempunyai kelebihan dibandingkan dengan framework lain yang serupa misalnya Apache Hadoop dll, di mana Apache Spark mampu memproses data secara streaming artinya data yang masuk ke dalam lingkungan Apache Spark dapat langsung diproses tanpa menunggu data lain terkumpul. Agar di dalam Apache Spark mampu melakukan proses machine learning, maka di dalam paper ini akan dilakukan eksperimen yaitu dengan mengintegrasikan Apache Spark yang bertindak sebagai lingkungan pemrosesan data yang besar dan konsep parallel komputing akan dikombinasikan dengan library H2O yang khusus untuk menangani pemrosesan data menggunakan algoritme machine learning. Berdasarkan hasil pengujian Apache Spark di dalam lingkungan cloud computing, Apache Spark mampu memproses data cuaca yang didapatkan dari arsip data cuaca terbesar yaitu yaitu data NCDC dengan ukuran data sampai dengan 6GB. Data tersebut diproses menggunakan salah satu model machine learning yaitu deep learning dengan membagi beberapa node yang telah terbentuk di lingkungan cloud computing dengan memanfaatkan library H2O. Keberhasilan tersebut dapat dilihat dari parameter pengujian yang telah diujikan meliputi nilai running time, throughput, Avarege Memory dan Average CPU yang didapatkan dari Benchmark Hibench. Semua nilai tersebut  dipengaruhi oleh banyaknya data dan jumlah node.

 

Abstract

Apache Spark is a platform that can be used to process data with relatively large data sizes (big data) with the ability to divide the data into each cluster that has been determined. This concept is called parallel computing. Apache Spark has advantages compared to other similar frameworks such as Apache Hadoop, etc., where Apache Spark is able to process data in streaming, meaning that the data entered into the Apache Spark environment can be directly processed without waiting for other data to be collected. In order for Apache Spark to be able to do machine learning processes, in this paper an experiment will be conducted that integrates Apache Spark which acts as a large data processing environment and the concept of parallel computing will be combined with H2O libraries specifically for handling data processing using machine learning algorithms . Based on the results of testing Apache Spark in a cloud computing environment, Apache Spark is able to process weather data obtained from the largest weather data archive, namely NCDC data with data sizes up to 6GB. The data is processed using one of the machine learning models namely deep learning by dividing several nodes that have been formed in the cloud computing environment by utilizing the H2O library. The success can be seen from the test parameters that have been tested including the value of running time, throughput, Avarege Memory and CPU Average obtained from the Hibench Benchmark. All these values are influenced by the amount of data and number of nodes.


Downloads

Download data is not yet available.

Referensi

AMINUDIN, AMINUDIN. 2019. “Analisa Performa Apache Hadoop Dengan H2o Menggunakan Benchmark Hibench Via Cloud Computing.” In Prosiding SENTRA (Seminar Teknologi Dan Rekayasa).

AMINUDIN, AMINUDIN, AND MUHAMMAD ALWI. 2018. “Analisa Multithreading Pada Sistem Rekomendasi Menggunakan Metode Collaborative Filtering Dengan.” Techno. Com 17 (1): 1–11.

CHEN, JIANGUO, KENLI LI, ZHUO TANG, KASHIF BILAL, SHUI YU, CHULIANG WENG, AND KEQIN LI. 2017. “A Parallel Random Forest Algorithm for Big Data in a Spark Cloud Computing Environment.” IEEE Transactions on Parallel and Distributed Systems 28 (4): 919–33. https://doi.org/10.1109/TPDS.2016.2603511.

CHOUKSEY, PRIYANKA, AND ABHISHEK SINGH CHAUHAN. 2017. “Weather Data Analytics Using MapReduce and Spark” 6 (2): 42–47. https://doi.org/10.17148/IJARCCE.2017.6210.

GUPTA, ANAND, HARDEO THAKUR, RITVIK SHRIVASTAVA, PULKIT KUMAR, AND SREYASHI NAG. 2017. “A Big Data Analysis Framework Using Apache Spark and Deep Learning,” no. 1. https://doi.org/10.1109/ICDMW.2017.9.

HAN, ZHIJIE, AND YUJIE ZHANG. 2016. “Spark: A Big Data Processing Platform Based on Memory Computing.” Proceedings - International Symposium on Parallel Architectures, Algorithms and Programming, PAAP 2016–Janua: 172–76. https://doi.org/10.1109/PAAP.2015.41.

IVANOV, TODOR, RAIK NIEMANN, SEAD IZBEROVIC, MARTEN ROSSELLI, KARSTEN TOLLE, AND ROBERTO V ZICARI. 2014. “Benchmarking DataStax Enterprise/Cassandra with HiBench.” ArXiv Preprint ArXiv:1411.4044.

JONNALAGADDA, V SRINIVAS, P SRIKANTH, KRISHNAMACHARI THUMATI, SRI HARI NALLAMALA, AND KRISHNA DIST. 2016. “A Review Study of Apache Spark in Big Data Processing” 4 (3): 93–98.

KHUSUMANEGARA, PRIAGUNG. 2014. “Analisis Performa Kecepatan Mapreduce Pada Hadoop Menggunakan Tcp Packet Flow,” 72.

LIU, LU. 2015. “Performance Comparison by Running Benchmarks on Hadoop, Spark, and Hamr.”

MAVRIDIS, ILIAS, AND HELEN KARATZA. 2017. “Performance Evaluation of Cloud-Based Log File Analysis with Apache Hadoop and Apache Spark.” Journal of Systems and Software 125: 133–51. https://doi.org/10.1016/j.jss.2016.11.037.

NG, S. S. Y., W. ZHU, W. W. S. TANG, L. C. H. WAN, AND A. Y. W. WAT. 2016. “An Independent Study of Two Deep Learning Platforms - H2O and SINGA.” 2016 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), 1279–83. https://doi.org/10.1109/IEEM.2016.7798084.

PAN, SHENGTI. 2016. “The Performance Comparison of Hadoop and Spark.”

SAMADI, YASSIR, MOSTAPHA ZBAKH, AND CLAUDE TADONKI. 2016. “Comparative Study between Hadoop and Spark Based on Hibench Benchmarks.” https://doi.org/10.1109/CloudTech.2016.7847709.

WANG, KAIYUAN, JIAN FU, AND KAIYUAN WANG. 2016. “SPARK – A Big Data Processing Platform for Machine Learning.” 2016 International Conference on Industrial Informatics - Computing Technology, Intelligent Technology, Industrial Information Integration, 48–51. https://doi.org/10.1109/ICIICII.2016.27.

Diterbitkan

08-10-2019

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

Pengukuran Performa Apache Spark dengan Library H2O Menggunakan Benchmark Hibench Berbasis Cloud Computing. (2019). Jurnal Teknologi Informasi Dan Ilmu Komputer, 6(5), 519-526. https://doi.org/10.25126/jtiik.2019651520