Distributed architectures for big data processing and analytics (2019/2020)

This page has hierarchy - Parent page: Teaching

Table of content

General information
Exam rules
Announcements
Slides
Exercises
Practices
Exam Examples
Additional material

Pay attention that this page is the web page for to the academic year 2019/2020

General information

ECTS: 8
Professor: Paolo Garza
Teaching assistant: Martino Trevisan

Exam rules

Exam rules Academic Year 2019-2020 – ONLINE EXAMINATION SESSION (pdf)
Exam rules Academic Year 2019-2020 (pdf)

Announcements

(24/02/2020)
- No lab activities during the first week

Slides

Video lectures:
- Teaching portal / Material/ Virtual classroom
- or
- Teaching portal / Material/ Link relativi al corso -> Links to the dropbox copy of the video lectures

Introduction to the course content and exam rules (2 slides per page, 6 slides per page)
- Question session – March 10, 2020
Introduction to Big Data (2 slides per page, 6 slides per page)
Big Data Architectures (2 slides per page, 6 slides per page)
Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
  - Interaction with HDFS and Hadoop by means of the command line (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce (2 slides per page, 6 slides per page)
  - Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
  - PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
  - BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (2 slides per page, 6 slides per page)
  - Updated on April 20, 2020 with some more details on the Distributed cache topic
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
Spark
- Introduction to Apache Spark (2 slides per page, 6 slides per page)
  - How to submit Spark applications (2 slides per page, 6 slides per page)
  - How to use Jupyter notebooks for your Spark applications (2 slides per page, 6 slides per page)
    - A useful online tutorial for those who want to install and run Spark locally on their PCs (tested for Linux)
      - “How to use PySpark on your computer” by Favio Vázquez (link)
    - Some comments and hints
      - Download the following pre-built version of Spark from spark.apache.org: spark-2.4.5-bin-hadoop2.7.tgz
      - Pay attention to install and use Python 3
      - Install Java 8 (Spark 2.4.5 runs on Java 8)
        
        e.g., sudo apt-get install openjdk-8-jdk
      - and then set the JAVA_HOME variable in your environment to the folder containing Java 8
        
        e.g., export JAVA_HOME=”/usr/lib/jvm/java-8-openjdk-amd64/”
- RDD-based programs
  - RDDs: creation, basic transformations and actions (2 slides per page, 6 slides per page)
  - Key-value RDDs: transformations and actions on key-value RDDs (2 slides per page, 6 slides per page)
  - DoubleRDDs (2 slides per page, 6 slides per page)
  - Advanced Topics: Cache, accumulators, broadcast variables (2 slides per page, 6 slides per page)
  - Advanced Topics – Part II: Custom partitioners, broadcast join (2 slides per page, 6 slides per page)
    - RDD partition examples (RDDPartitionsExamples.zip)
    - PageRank example (RDDPageRank.zip)
- Spark SQL and DataFrames
  - Spark SQL (2 slides per page, 6 slides per page)
    - Simple examples – Jupyter notebook (SparkSQLSimpleExamples.zip)
  - Spark SQL – Part II (2 slides per page, 6 slides per page)
- Data mining and Machine learning algorithms with Spark
  - MLlib
    - Introduction and Preprocessing (2 slides per page, 6 slides per page)
    - Classification (2 slides per page, 6 slides per page)
      - Classification examples – Jupyter notebooks and sample data (ExampleClassificationMLlib.zip)
    - Clustering (2 slides per page, 6 slides per page)
      - Clustering example – Jupyter notebook and sample data (ExampleClusteringMLlib.zip)
    - Regression (2 slides per page, 6 slides per page)
      - Regression example – Jupyter notebook and sample data (ExampleRegressionMLlib.zip)
    - Itemset and Association rule mining (2 slides per page, 6 slides per page)
      - Itemset and Association rule mining example – Jupyter notebook and sample data (ExampleItemsetMLlib.zip)
- GraphX/GraphFrames
  - Introduction to GraphX and GraphFrames (2 slides per page, 6 slides per page) Updated on May 16, 2020
  - Graph Algorithms with GraphFrames (2 slides per page, 6 slides per page)
    - Simple example – Jupyter notebook (GraphFrameExamples.zip)
      - Select kernel GraphFrames (Yarn) to run it on jupyter.polito.it
      - Run “pyspark – -packages graphframes:graphframes:0.8.0-spark2.4-s_2.11” to run it locally on your PC
Streaming data analytics
- Spark Streaming
  - Spark Streaming (DStreams) (2 slides per page, 6 slides per page)
    - Simple examples – Jupyter notebooks (SparkSteamingExamples.zip)
  - Structured Streaming (2 slides per page, 6 slides per page)
    - Simple examples – Jupyter notebooks (SparkStructutedStreamingExamples.zip)
- Introduction to other big stream processing frameworks: Apache Storm, Apache Flink, .. (2 slides per page, 6 slides per page)
Relational and Non-relational databases for Big data
- Introduction to relational and non-relational databases for Big data: Hive, HBase NOT COVERED THIS YEAR

Exercises

MapReduce
- MapReduce exercises (2 slides per page, 6 slides per page)
  - Solutions of Exercises 1-12 (Solutions1_12.zip)
  - Solutions of Exercises 13-22 (Solutions13_22.zip)
  - Solutions of Exercises 23-29 (Solutions23_29.zip)
    - Solution of Exercise 23 – Two Jobs – Version 2: Updated version (SolExercise23TwoJobsV2Cluster.zip). The former version does not find the cached file when it is executed on the cluster.
- Basic project
  - Linux and macOS
    - Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProject.zip)
  - Windows
    - Setup instructions (ConfigureWindowsEnviroment.pdf)
    - Winutils executable (winutils.zip)
    - Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProjectWindows.zip)
Spark
- Spark RDD-, DataFrame-based exercises (2 slides per page, 6 slides per page)
  - Example data – One folder with (few) data for each exercise (ExSparkData.zip)
  - Solutions of Exercises 30-36 – Jupyter notebooks (SparkNotebooksSol30_36.zip)
  - Solutions of Exercises 37-42 – Jupyter notebooks (SparkNotebooksSol37_42.zip)
  - Solutions of Exercises 43-46 – Jupyter notebooks (SparkNotebooksSol43_46.zip)
  - Spark SQL-based Solutions
    - Exercises 37-38 – Spark SQL-based solutions – Jupyter notebooks (SparkNotebooksSol37_38DataframeSQL.zip)
- Spark SQL exercises (2 slides per page, 6 slides per page)
  - Example data – One folder with (few) data for each exercise (ExSparkSQLData.zip)
  - Solutions of Exercises 47-48 – Jupyter notebooks (SparkNotebooksSol47_48.zip)
  - Solutions of Exercises 49-50 – Jupyter notebooks (SparkNotebooksSol49_50.zip) – The problem specifications of these two exercises are in Spark RDD-, DataFrame-based exercises (2 slides per page, 6 slides per page)
- Spark MLlib exercises (2 slides per page, 6 slides per page)
  - Example data – One folder with (few) data for each exercise (ExampleMLlibData.zip)
  - Solutions of Exercise 51 (SparkNotebooksSol51.zip)
- GraphFrame exercises (2 slides per page, 6 slides per page)
  - Example data – One folder with (few) data for each exercise (ExampleGraphFrameData.zip)
  - Solutions of Exercises 52-57 – Jupyter notebooks (SparkNotebooksSol52_57.zip)
- Spark streaming exercises (2 slides per page, 6 slides per page)
  - Example data – One folder with (few) data for each exercise (ExampleSparkStreamingData.zip)
  - Solutions of Exercises 58-65 – Jupyter notebooks (SparkNotebooksSol58_65.zip)

Practices

Lab1: Hadoop and MapReduce (Wednesday, March 18 – 13:00-14:30)
- Problem specification (pdf)
- How to import and run locally on your PC a MapReduce program by using Eclipse + maven (01_ImportProject_LocalRun.mp4)
- How to create a jar file and execute your application on the remote cluster BigData@Polito (02_Jar_ClusterExecution.mp4)
- Basic project and small example data set
  - Linux and macOS (Lab1.zip)
  - Windows (Lab1Windows.zip)
- Solution
  - Bonus track: Lab1_SolBonus_1920.zip
Lab2: Filter with Hadoop MapReduce (Friday, March 20 – 13:00-14:30)
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
  - Linux and macOS (Lab2_Skeleton1920.zip)
  - Windows (Lab2Windows_Skeleton1920.zip)
- Solution
  - Lab2_Sol1920.zip
Lab3: Frequently bought/reviewed together application with Hadoop MapReduce (Friday, March 27 – 13:00-14:30)
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
  - Linux and macOS (Lab3_Skeleton1920.zip)
  - Windows (Lab3Windows_Skeleton1920.zip)
- Sample file (AmazonTransposedDataset_Sample.txt)
- Solution
  - Lab3_Sol1920.zip – Three alternative solutions are provided (the solutions are characterized by a different efficiency)
  - Comments on the three uploaded solutions (2 slides per page, 6 slides per page)
Lab4: Normalized ratings for product recommendations with Hadoop MapReduce (Friday, April 3 – 13:00-14:30)
- Problem specification (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project Hadoop – MapReduce
  - Linux and macOS (Lab4_Skeleton1920.zip)
  - Windows (Lab4Windows_Skeleton1920.zip)
- Solution
  - Lab4_Sol1920.zip
Lab5: Filter data and compute basic statistics with Apache Spark (Friday, April 17 – 13:00-14:30)
- Problem specification (pdf)
- Sample file (SampleLocalFile.csv)
- Solution
  - Lab5_Sol1920.zip – Jupyter notebook (Lab5_Sol.ipynb) and Python script (Lab5_Sol.py)
Lab6: Frequently bought/reviewed together application with Apache Spark (Friday, April 24 – 13:00-14:30)
- Problem specification (pdf)
- Sample dataset (ReviewsSample.csv)
- Solution
  - Lab6_Sol1920.zip – Jupyter notebook (Lab6_1920Sol.ipynb) and Python script (Lab6_1920Sol.py)
Lab7: Bike sharing data analysis (Wednesday, April 29 – 13:00-14:30)
- Problem specification (pdf)
- Sample data (zip)
- Example KML file (zip)
- Another KML visualizer that can be used to visualize on a map the result of your analysis: http://kmlviewer.nsspot.net
- Solution
  - Lab7_Sol1920.zip – Jupyter notebook (Lab7_1920Sol.ipynb) and Python script (Lab7_1920Sol.py)
Lab8: Bike sharing data analysis based on Spark SQL (Friday, May 8 – 13:00-14:30)
- Problem specification (pdf)
- Sample data (zip)
- Solution
  - Lab8_Sol1920.zip – Jupyter notebook and Python script
Lab9: A classification pipeline with MLlib + SparkSQL (Friday, May 15 – 13:00-14:30)
- Problem specification (pdf)
- Template (zip)
- Solution
  - Lab9_Sol1920.zip – Jupyter notebooks
Lab10: GraphFrame (Friday, May 22 – 13:00-14:30)
- Problem specification (pdf)
- Solution
  - Lab10_Sol1920.zip – Jupyter notebook
Lab11: Tweet analysis – Spark streaming (Friday, May 29 – 13:00-14:30)
- Problem specification (pdf)
- Example files – tweets (exampledata_tweets.zip)
- Solution
  - Lab11_Sol1920.zip – Jupyter notebook

Exam Examples

Exam Example #1 (pdf)
- Solution
  - Question 1: (d)
  - Question 2: (c)
  - SolutionExamExample1.zip
Exam Example #2 (pdf)
- Solution
  - Question 1: (d)
  - Question 2: (c)
  - SolutionExamExample2.zip
Exam Example #3 (pdf)
- Solution
  - Question 1: (c)
  - Question 2: (c)
  - SolutionExamExample3.zip
Exam Example #4 (pdf)
- Solution
  - Question 1: (d)
  - Question 2: (c)
  - SolutionExamExample4.zip
Exam Example #5 (pdf)
- Solution
  - Question 1: (b)
  - Question 2: (b)
  - SolutionExamExample5.zip
Exam June 27, 2020 (pdf)
- Solution
  - Question 1: (b)
  - Question 2: (a)
  - Part II: MapReduce and Spark (DBD_Exam20200627Sol.zip)
Exam July 20, 2020 (pdf)
- Solution
  - Question 1: (d)
  - Question 2: (b) – Note that there are three actions and hence the input file is read three times.
  - Part II: MapReduce and Spark (DBD_Exam20200720 Sol.zip)
Exam September 14, 2020 (pdf)
- Solution
  - Question 1: (d)
  - Question 2: (c)
  - Part II: MapReduce and Spark (DBD_Exam20200914Sol.zip)

Additional material

Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who have never used Java
  - OO Paradigm and UML (The UML part is not mandatory)
  - The Java Environment
  - Java Basic Features
  - Java Inheritance

DataBase and Data Mining Group

Distributed architectures for big data processing and analytics (2019/2020)

Table of content

Pay attention that this page is the web page for to the academic year 2019/2020

General information

Exam rules

Announcements

Slides

Exercises

Practices

Exam Examples

Additional material

Welcome

Recently Updated Pages