Distributed architectures for big data processing and analytics (2019/2020)
Table of content
- General information
- Exam rules
- Announcements
- Slides
- Exercises
- Practices
- Exam Examples
- Additional material
Pay attention that this page is the web page for to the academic year 2019/2020
General information
- ECTS: 8
- Professor: Paolo Garza
- Teaching assistant: Martino Trevisan
Exam rules
- Exam rules Academic Year 2019-2020 – ONLINE EXAMINATION SESSION (pdf)
- Exam rules Academic Year 2019-2020 (pdf)
Announcements
- (24/02/2020)
- No lab activities during the first week
Slides
- Video lectures:
- Teaching portal / Material/ Virtual classroom
- or
- Teaching portal / Material/ Link relativi al corso -> Links to the dropbox copy of the video lectures
- Introduction to the course content and exam rules (2 slides per page, 6 slides per page)
- Question session – March 10, 2020
- Introduction to Big Data (2 slides per page, 6 slides per page)
- Big Data Architectures (2 slides per page, 6 slides per page)
- Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Interaction with HDFS and Hadoop by means of the command line (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce (2 slides per page, 6 slides per page)
- Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
- PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (2 slides per page, 6 slides per page)
- Updated on April 20, 2020 with some more details on the Distributed cache topic
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Spark
- Introduction to Apache Spark (2 slides per page, 6 slides per page)
- How to submit Spark applications (2 slides per page, 6 slides per page)
- How to use Jupyter notebooks for your Spark applications (2 slides per page, 6 slides per page)
- A useful online tutorial for those who want to install and run Spark locally on their PCs (tested for Linux)
- “How to use PySpark on your computer” by Favio Vázquez (link)
- Some comments and hints
- Download the following pre-built version of Spark from spark.apache.org: spark-2.4.5-bin-hadoop2.7.tgz
- Pay attention to install and use Python 3
- Install Java 8 (Spark 2.4.5 runs on Java 8)
- e.g., sudo apt-get install openjdk-8-jdk
- and then set the JAVA_HOME variable in your environment to the folder containing Java 8
- e.g., export JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64/”
- A useful online tutorial for those who want to install and run Spark locally on their PCs (tested for Linux)
- RDD-based programs
- RDDs: creation, basic transformations and actions (2 slides per page, 6 slides per page)
- Key-value RDDs: transformations and actions on key-value RDDs (2 slides per page, 6 slides per page)
- DoubleRDDs (2 slides per page, 6 slides per page)
- Advanced Topics: Cache, accumulators, broadcast variables (2 slides per page, 6 slides per page)
- Advanced Topics – Part II: Custom partitioners, broadcast join (2 slides per page, 6 slides per page)
- RDD partition examples (RDDPartitionsExamples.zip)
- PageRank example (RDDPageRank.zip)
- Spark SQL and DataFrames
- Spark SQL (2 slides per page, 6 slides per page)
- Simple examples – Jupyter notebook (SparkSQLSimpleExamples.zip)
- Spark SQL – Part II (2 slides per page, 6 slides per page)
- Spark SQL (2 slides per page, 6 slides per page)
- Data mining and Machine learning algorithms with Spark
- MLlib
- Introduction and Preprocessing (2 slides per page, 6 slides per page)
- Classification (2 slides per page, 6 slides per page)
- Classification examples – Jupyter notebooks and sample data (ExampleClassificationMLlib.zip)
- Clustering (2 slides per page, 6 slides per page)
- Clustering example – Jupyter notebook and sample data (ExampleClusteringMLlib.zip)
- Regression (2 slides per page, 6 slides per page)
- Regression example – Jupyter notebook and sample data (ExampleRegressionMLlib.zip)
- Itemset and Association rule mining (2 slides per page, 6 slides per page)
- Itemset and Association rule mining example – Jupyter notebook and sample data (ExampleItemsetMLlib.zip)
- MLlib
- GraphX/GraphFrames
- Introduction to GraphX and GraphFrames (2 slides per page, 6 slides per page) Updated on May 16, 2020
- Graph Algorithms with GraphFrames (2 slides per page, 6 slides per page)
- Simple example – Jupyter notebook (GraphFrameExamples.zip)
- Select kernel GraphFrames (Yarn) to run it on jupyter.polito.it
- Run “pyspark – -packages graphframes:graphframes:0.8.0-spark2.4-s_2.11” to run it locally on your PC
- Simple example – Jupyter notebook (GraphFrameExamples.zip)
- Introduction to Apache Spark (2 slides per page, 6 slides per page)
- Streaming data analytics
- Spark Streaming
- Spark Streaming (DStreams) (2 slides per page, 6 slides per page)
- Simple examples – Jupyter notebooks (SparkSteamingExamples.zip)
- Structured Streaming (2 slides per page, 6 slides per page)
- Simple examples – Jupyter notebooks (SparkStructutedStreamingExamples.zip)
- Spark Streaming (DStreams) (2 slides per page, 6 slides per page)
- Introduction to other big stream processing frameworks: Apache Storm, Apache Flink, .. (2 slides per page, 6 slides per page)
- Spark Streaming
- Relational and Non-relational databases for Big data
- Introduction to relational and non-relational databases for Big data: Hive, HBase NOT COVERED THIS YEAR
Exercises
- MapReduce
- MapReduce exercises (2 slides per page, 6 slides per page)
- Solutions of Exercises 1-12 (Solutions1_12.zip)
- Solutions of Exercises 13-22 (Solutions13_22.zip)
- Solutions of Exercises 23-29 (Solutions23_29.zip)
- Solution of Exercise 23 – Two Jobs – Version 2: Updated version (SolExercise23TwoJobsV2Cluster.zip). The former version does not find the cached file when it is executed on the cluster.
- Basic project
- Linux and macOS
- Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProject.zip)
- Windows
- Setup instructions (ConfigureWindowsEnviroment.pdf)
- Winutils executable (winutils.zip)
- Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProjectWindows.zip)
- Linux and macOS
- MapReduce exercises (2 slides per page, 6 slides per page)
- Spark
- Spark RDD-, DataFrame-based exercises (2 slides per page, 6 slides per page)
- Example data – One folder with (few) data for each exercise (ExSparkData.zip)
- Solutions of Exercises 30-36 – Jupyter notebooks (SparkNotebooksSol30_36.zip)
- Solutions of Exercises 37-42 – Jupyter notebooks (SparkNotebooksSol37_42.zip)
- Solutions of Exercises 43-46 – Jupyter notebooks (SparkNotebooksSol43_46.zip)
- Spark SQL-based Solutions
- Exercises 37-38 – Spark SQL-based solutions – Jupyter notebooks (SparkNotebooksSol37_38DataframeSQL.zip)
- Spark SQL exercises (2 slides per page, 6 slides per page)
- Example data – One folder with (few) data for each exercise (ExSparkSQLData.zip)
- Solutions of Exercises 47-48 – Jupyter notebooks (SparkNotebooksSol47_48.zip)
- Solutions of Exercises 49-50 – Jupyter notebooks (SparkNotebooksSol49_50.zip) – The problem specifications of these two exercises are in Spark RDD-, DataFrame-based exercises (2 slides per page, 6 slides per page)
- Spark MLlib exercises (2 slides per page, 6 slides per page)
- Example data – One folder with (few) data for each exercise (ExampleMLlibData.zip)
- Solutions of Exercise 51 (SparkNotebooksSol51.zip)
- GraphFrame exercises (2 slides per page, 6 slides per page)
- Example data – One folder with (few) data for each exercise (ExampleGraphFrameData.zip)
- Solutions of Exercises 52-57 – Jupyter notebooks (SparkNotebooksSol52_57.zip)
- Spark streaming exercises (2 slides per page, 6 slides per page)
- Example data – One folder with (few) data for each exercise (ExampleSparkStreamingData.zip)
- Solutions of Exercises 58-65 – Jupyter notebooks (SparkNotebooksSol58_65.zip)
- Spark RDD-, DataFrame-based exercises (2 slides per page, 6 slides per page)
Practices
- Lab1: Hadoop and MapReduce (Wednesday, March 18 – 13:00-14:30)
- Problem specification (pdf)
- How to import and run locally on your PC a MapReduce program by using Eclipse + maven (01_ImportProject_LocalRun.mp4)
- How to create a jar file and execute your application on the remote cluster BigData@Polito (02_Jar_ClusterExecution.mp4)
- Basic project and small example data set
- Linux and macOS (Lab1.zip)
- Windows (Lab1Windows.zip)
- Solution
- Bonus track: Lab1_SolBonus_1920.zip
- Lab2: Filter with Hadoop MapReduce (Friday, March 20 – 13:00-14:30)
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
- Linux and macOS (Lab2_Skeleton1920.zip)
- Windows (Lab2Windows_Skeleton1920.zip)
- Solution
- Lab3: Frequently bought/reviewed together application with Hadoop MapReduce (Friday, March 27 – 13:00-14:30)
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
- Linux and macOS (Lab3_Skeleton1920.zip)
- Windows (Lab3Windows_Skeleton1920.zip)
- Sample file (AmazonTransposedDataset_Sample.txt)
- Solution
- Lab3_Sol1920.zip – Three alternative solutions are provided (the solutions are characterized by a different efficiency)
- Comments on the three uploaded solutions (2 slides per page, 6 slides per page)
- Lab4: Normalized ratings for product recommendations with Hadoop MapReduce (Friday, April 3 – 13:00-14:30)
- Problem specification (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project Hadoop – MapReduce
- Linux and macOS (Lab4_Skeleton1920.zip)
- Windows (Lab4Windows_Skeleton1920.zip)
- Solution
- Lab5: Filter data and compute basic statistics with Apache Spark (Friday, April 17 – 13:00-14:30)
- Problem specification (pdf)
- Sample file (SampleLocalFile.csv)
- Solution
- Lab5_Sol1920.zip – Jupyter notebook (Lab5_Sol.ipynb) and Python script (Lab5_Sol.py)
- Lab6: Frequently bought/reviewed together application with Apache Spark (Friday, April 24 – 13:00-14:30)
- Problem specification (pdf)
- Sample dataset (ReviewsSample.csv)
- Solution
- Lab6_Sol1920.zip – Jupyter notebook (Lab6_1920Sol.ipynb) and Python script (Lab6_1920Sol.py)
- Lab7: Bike sharing data analysis (Wednesday, April 29 – 13:00-14:30)
- Problem specification (pdf)
- Sample data (zip)
- Example KML file (zip)
- Another KML visualizer that can be used to visualize on a map the result of your analysis: http://kmlviewer.nsspot.net
- Solution
- Lab7_Sol1920.zip – Jupyter notebook (Lab7_1920Sol.ipynb) and Python script (Lab7_1920Sol.py)
- Lab8: Bike sharing data analysis based on Spark SQL (Friday, May 8 – 13:00-14:30)
- Problem specification (pdf)
- Sample data (zip)
- Solution
- Lab8_Sol1920.zip – Jupyter notebook and Python script
- Lab9: A classification pipeline with MLlib + SparkSQL (Friday, May 15 – 13:00-14:30)
- Problem specification (pdf)
- Template (zip)
- Solution
- Lab9_Sol1920.zip – Jupyter notebooks
- Lab10: GraphFrame (Friday, May 22 – 13:00-14:30)
- Problem specification (pdf)
- Solution
- Lab10_Sol1920.zip – Jupyter notebook
- Lab11: Tweet analysis – Spark streaming (Friday, May 29 – 13:00-14:30)
- Problem specification (pdf)
- Example files – tweets (exampledata_tweets.zip)
- Solution
- Lab11_Sol1920.zip – Jupyter notebook
Exam Examples
- Exam Example #1 (pdf)
- Solution
- Question 1: (d)
- Question 2: (c)
- SolutionExamExample1.zip
- Solution
- Exam Example #2 (pdf)
- Solution
- Question 1: (d)
- Question 2: (c)
- SolutionExamExample2.zip
- Solution
- Exam Example #3 (pdf)
- Solution
- Question 1: (c)
- Question 2: (c)
- SolutionExamExample3.zip
- Solution
- Exam Example #4 (pdf)
- Solution
- Question 1: (d)
- Question 2: (c)
- SolutionExamExample4.zip
- Solution
- Exam Example #5 (pdf)
- Solution
- Question 1: (b)
- Question 2: (b)
- SolutionExamExample5.zip
- Solution
- Exam June 27, 2020 (pdf)
- Solution
- Question 1: (b)
- Question 2: (a)
- Part II: MapReduce and Spark (DBD_Exam20200627Sol.zip)
- Solution
- Exam July 20, 2020 (pdf)
- Solution
- Question 1: (d)
- Question 2: (b) – Note that there are three actions and hence the input file is read three times.
- Part II: MapReduce and Spark (DBD_Exam20200720Sol.zip)
- Solution
- Exam September 14, 2020 (pdf)
- Solution
- Question 1: (d)
- Question 2: (c)
- Part II: MapReduce and Spark (DBD_Exam20200914Sol.zip)
- Solution
Additional material
- Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who have never used Java
- OO Paradigm and UML (The UML part is not mandatory)
- The Java Environment
- Java Basic Features
- Java Inheritance
- Suggested slides/lectures for those students who have never used Java