D. Text Mining and Machine Learning with Apache Spark
Apache Spark is currently one of the most popular open-source cluster-computing frameworks. With its Machine Learning Library (MLlib) it supports the easy scaling of a range of feature extraction and machine learning tasks commonly employed in text mining. Furthermore, it works with both Python and R.
The tutorial will first cover the basics of using an Apache Spark cluster for text mining and machine learning, and will then provide a walk-through of the text classification solution developed within the framework of the Hungarian leg of the Comparative Agendas Project – with the support of the MTA SZTAKI Cloud team – as a use case example of the possibilities opened up by the increased speed offered by parallel computing.
The tutorial will address among other things: a) configuring the Apache Spark cluster, b) using a Hadoop Distributed File System with the cluster, c) operating the cluster via an RStudio Server and sparklyr (the Spark interface for R developed by RStudio), and d) the differences in available functionality of the Machine Learning Library for sparklyr, SparkR (the R API developed by Apache Spark) and PySpark (the Python API for Spark).
About the instructor
Zoltan Kacsuk holds a doctoral degree from Kyoto Seika University. He is a postdoctoral researcher at the Japanese Visual Media Graph project, Institute for Applied Artificial Intelligence, Stuttgart Media University, and is also a part-time research fellow at the Department of Government and Public Policy, Institute for Political Science, HAS Cente for Social Sciences.
For the participants
The easiest way to follow along with the tutorial will be to have an activated Google Cloud account (the free trial version will be enough). Another option is to have an Ubuntu 18.04 system on a multi-core/cpu (4+ cores/cpus) machine available either on hand or remotely via ssh with root privileges and internet connectivity.