Big data execution time based on Spark Machine Learning Libraries

Affiliation auteurs!!!! Error affiliation !!!!
TitreBig data execution time based on Spark Machine Learning Libraries
Type de publicationConference Paper
Year of Publication2019
AuteursGarate-Escamilla AKaren, Hassani AHajjam El, Andres E
Conference NamePROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON CLOUD AND BIG DATA COMPUTING (ICCBDC 2019)
PublisherASSOC COMPUTING MACHINERY
Conference Location1515 BROADWAY, NEW YORK, NY 10036-9998 USA
ISBN Number978-1-4503-7165-0
Mots-clésApache Spark, Execution time prediction, Machine learning, Performance prediction model
Résumé

The paper focuses on exploring the time consumption of supervised and unsupervised models of Apache Spark framework in massive datasets. Big Data analytics has been relevant in the industry due to the need to convert information into knowledge. Among the challenge of big data is the creation of strategies to improve the execution costs of running machine learning models to make a prediction. Apache Spark is a powerful in-memory platform that offers an extensive machine learning library for regression, classification, clustering, and rule extraction. This investigation, from a computation cost perspective, performs different experiments using real datasets. The main contribution of the paper is to compare the execution time of different machine learning models, such as random forests, decision tree, logistic regression, linear support vector machine, and kNN. The present work expects to combine the areas of big data and machine learning, comparing the results with different configurations and the use of the optimization methods, cache and persist. The evaluation experiments show that logistic regression performed the shortest execution time of the Spark MLlib models.

DOI10.1145/3358505.3358519