Big data execution time based on Spark Machine Learning Libraries
Affiliation auteurs | !!!! Error affiliation !!!! |
Titre | Big data execution time based on Spark Machine Learning Libraries |
Type de publication | Conference Paper |
Year of Publication | 2019 |
Auteurs | Garate-Escamilla AKaren, Hassani AHajjam El, Andres E |
Conference Name | PROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON CLOUD AND BIG DATA COMPUTING (ICCBDC 2019) |
Publisher | ASSOC COMPUTING MACHINERY |
Conference Location | 1515 BROADWAY, NEW YORK, NY 10036-9998 USA |
ISBN Number | 978-1-4503-7165-0 |
Mots-clés | Apache Spark, Execution time prediction, Machine learning, Performance prediction model |
Résumé | The paper focuses on exploring the time consumption of supervised and unsupervised models of Apache Spark framework in massive datasets. Big Data analytics has been relevant in the industry due to the need to convert information into knowledge. Among the challenge of big data is the creation of strategies to improve the execution costs of running machine learning models to make a prediction. Apache Spark is a powerful in-memory platform that offers an extensive machine learning library for regression, classification, clustering, and rule extraction. This investigation, from a computation cost perspective, performs different experiments using real datasets. The main contribution of the paper is to compare the execution time of different machine learning models, such as random forests, decision tree, logistic regression, linear support vector machine, and kNN. The present work expects to combine the areas of big data and machine learning, comparing the results with different configurations and the use of the optimization methods, cache and persist. The evaluation experiments show that logistic regression performed the shortest execution time of the Spark MLlib models. |
DOI | 10.1145/3358505.3358519 |