Big Data: Architectures and Data Analytics (2019/2020)

This page has hierarchy - Parent page: Teaching

Table of content

General information
Exam rules
Announcements
Slides
Exercises
Practices
Exam Examples
Additional material

General information

ECTS: 6
Professor: Paolo Garza
Teaching assistants:
- Alessandro Farasin
- Marilisa Montemurro

Exam rules

Exam rules Academic Year 2019-2020 – ONLINE EXAMINATION SESSION (pdf)
Exam rules Academic Year 2019-2020 (pdf)

Announcements

(24/02/2020)
- No lab activities during the first two weeks.

Slides

Introduction to the course content and exam rules (2 slides per page, 6 slides per page)
Introduction to Big Data (2 slides per page, 6 slides per page)
Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
  - Interaction with HDFS and Hadoop by means of the command line (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce (2 slides per page, 6 slides per page)
  - Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
  - PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
  - BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (2 slides per page, 6 slides per page)
  - Updated on April 20, 2020 with some more details on the Distributed cache topic
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
Spark
- Introduction to Apache Spark (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
  - How to submit Spark applications (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- RDD-based programs
  - RDDs: creation, basic transformations and actions (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
  - Key-value pair RDDs: transformations and actions on PairRDDs (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
  - DoubleRDDs (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
  - Advanced Topics: Cache, accumulators, broadcast variables (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Spark SQL, Datasets and DataFrames (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Data mining and Machine learning algorithms with Spark MLlib
  - Data Mining – Recap
    - Introduction (2 slides per page, 6 slides per page)
    - Data and Preprocessing (2 slides per page, 6 slides per page)
    - Itemset mining and Association rules (2 slides per page, 6 slides per page)
    - Classification (2 slides per page, 6 slides per page)
    - Clustering (2 slides per page, 6 slides per page)
  - Spark MLlib
    - Spark MLlib – Introduction and Classification of structured data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
      - Logistic Regression example code (zip)
      - Decision Trees example code (zip)
      - Decision Trees and Categorical class label example code (zip)
    - Spark MLlib – Classification of textual data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
      - Textual data classification example code (zip)
    - Spark MLlib – Classification and Parameter tuning (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
      - Parameter tuning example code (zip)
    - Spark MLlib – Clustering of structured data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
      - Clustering example code (zip)
    - Spark MLlib – Itemset and Association rule mining (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
      - Itemset and Association rule mining example code (zip)
    - Spark MLlib – Linear regression (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
      - Linear regression example code (zip)
- Spark Streaming (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
  - Word Count – Streaming version (zip)
  - Word Count and Window (zip)
  - Word Count – Stateful version (zip)
  - Word Count – Streaming version – Read data from HDFS folder (zip)
  - Word Count – Output sort by key – Based on the transformPair() transformation (zip)
Relational and Non-relational databases for Big data
- Introduction to relational and non-relational databases for Big data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)

Exercises

MapReduce
- MapReduce exercises (2 slides per page, 6 slides per page)
  - Solutions of Exercises 1-12 (Solutions1_12.zip)
  - Solutions of Exercises 13-22 (Solutions13_22.zip)
  - Solutions of Exercises 23-29 (Solutions23_29.zip)
    - Solution of Exercise 23 – Two Jobs – Version 2: Updated version (SolExercise23TwoJobsV2Cluster.zip). The former version does not find the cached file when it is executed on the cluster.
- Basic project
  - Linux and MacOs
    - Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProject.zip)
  - Windows
    - Setup instructions (ConfigureWindowsEnviroment.pdf)
    - Winutils executable (winutils.zip)
    - Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProjectWindows.zip)
Spark
- Spark RDD-, Dataset-, DataFrame-based exercises (2 slides per page, 6 slides per page)
  - Example data – One folder with (few) data for each exercise (ExSparkData.zip)
  - Solutions of Exercises 30-46 (Solutions30_46.zip)
- Spark SQL exercises (2 slides per page, 6 slides per page)
  - Example data – One folder with (few) data for each exercise (ExSparkSQLData.zip)
  - Solutions of Exercises 47-48 (Solutions47_48.zip)
  - Solutions of Exercises 49-50 (Solutions49_50.zip) – The problem specifications of these two exercises are in Spark RDD-, Dataset-, DataFrame-based exercises (2 slides per page, 6 slides per page)
- Spark streaming exercises (2 slides per page, 6 slides per page)

Practices

No lab activities during the first two weeks
TEAM 1: Students from A to H – Tuesday from 5.30pm to 7pm
TEAM 2: Students from I to Z – Wednesday from 5.30pm to 7pm

Lab1: Hadoop and MapReduce
- Online virtual lab, for online questions and answers
  - Team 1: Tuesday, March 24 – 5.30pm – 7pm – Team 1
  - Team 2: Wednesday, March 25 – 5.30pm – 7pm – Team 2
- Problem specification (pdf)
- How to import and run locally on your PC a MapReduce program by using Eclipse + maven (01_ImportProject_LocalRun.mp4)
- How to create a jar file and execute your application on the remote cluster BigData@Polito (02_Jar_ClusterExecution.mp4)
- Basic project and small example data set
  - Linux and macOS (Lab1.zip)
  - Windows (Lab1Windows.zip)
- Solution
  - Bonus track: Lab1_SolBonus_1920.zip
Lab2: Filter with Hadoop MapReduce
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
  - Linux and macOS (Lab2_Skeleton1920.zip)
  - Windows (Lab2Windows_Skeleton1920.zip)
- Solution
  - Lab2_Sol1920.zip
Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
  - Linux and macOS (Lab3_Skeleton1920.zip)
  - Windows (Lab3Windows_Skeleton1920.zip)
- Sample file (AmazonTransposedDataset_Sample.txt)
- Solution
  - Lab3_Sol1920.zip – Three alternative solutions are provided (the solutions are characterized by a different efficiency)
  - Comments on the three uploaded solutions (2 slides per page, 6 slides per page)
Lab4: Normalized ratings for product recommendations with Hadoop MapReduce (Team 1: Tuesday, April 7 – Team 2: Wednesday, April 8)
- Problem specification (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project Hadoop – MapReduce
  - Linux and macOS (Lab4_Skeleton1920.zip)
  - Windows (Lab4Windows_Skeleton1920.zip)
- Solution
  - Lab4_Sol1920.zip
Lab5: Filter data and compute basic statistics with Apache Spark (Team 1: Tuesday, April 21 – Team 2: Wednesday, April 22)
- Problem specification (pdf)
- Sample file (SampleLocalFile.csv)
- Skeleton Eclipse project Spark (Lab5BigData_Template1920.zip)
- Solution
  - Lab5BigData_Sol1920.zip
Lab6: Frequently bought/reviewed together application with Apache Spark (Team 1: Tuesday, April 28 – Team 2: Wednesday, April 29)
- Problem specification (pdf)
- Sample file (ReviewsSample.csv)
- Skeleton Eclipse project Spark (Lab6BigData_Template1920.zip)
- Solution
  - Lab6BigData_Sol1920.zip
Lab7: Bike sharing data analysis (Team 1 and Team 2: Tuesday, May 5)
- Problem specification (pdf)
- Sample data (zip)
- Skeleton Eclipse project Spark (Lab7BigData_Template1920.zip)
- Example KML file (zip)
- Another KML visualizer that can be used to visualize on a map the result of your analysis: http://kmlviewer.nsspot.net
- Solution
  - Lab7BigData_Sol1920.zip
Lab8: Bike sharing data analysis based on Spark SQL (Team 1: Tuesday, May 19 – Team 2: Wednesday, May 20)
- Problem specification (pdf)
- Sample data (zip)
- Skeleton Eclipse project Spark (Lab8BigData_Template1920.zip)
- Solution
  - Lab8BigData_Sol1920.zip
Lab9: A classification pipeline with MLlib + SparkSQL
- Problem specification (pdf)
- Sample file with 100 reviews (ReviewsSample.csv)
- Skeleton Eclipse project Spark (Lab9BigData_Template1920.zip)
- Solution
  - Logistic regression (zip)
  - DecisionTree (zip)
  - Logistic regression based on text analysis (zip)
  - DecisionTree based on text analysis (zip)
Lab10: Tweet analysis – Spark streaming (Team 1 and Team 2: Wednesday, June 3)
- Problem specification (pdf)
- Example files – tweets (exampledata_tweets.zip)
- Skeleton Eclipse project Spark (Lab10BigData_Template1920.zip)
- Solution
  - Lab10BigData_Sol1920.zip

Exam Examples

At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
- For the Spark exercises, no templates are provided

Exam example #1
- Exam (pdf)
- Solution
  - Source code/Eclipse projects (zip)
Exam example #2
- Exam (pdf)
- Solution
  - Source code/Eclipse projects (zip)
Exam July 1, 2016
- Exam (pdf)
- Solution
  - Question 1: (d)
  - Question 2: (b)
  - Source code/Eclipse projects (zip)
Exam July 12, 2016
- Exam (pdf)
- Solution
  - Question 1: (a)
  - Question 2: (a)
  - Source code/Eclipse projects (zip)
Exam September 19, 2016
- Exam (pdf)
- Solution
  - Question 1: (c)
  - Question 2: (a)
  - Source code/Eclipse projects (zip)
Exam June 30, 2017
- Exam (pdf)
- Solution
  - Question 1: (b)
  - Question 2: (c)
  - Source code/Eclipse projects (zip) – Updated on June 12, 2019
Exam July 14, 2017
- Exam (pdf)
- Solution
  - Question 1: (d)
  - Question 2: (c)
  - Source code/Eclipse projects (zip)
Exam September 14, 2017
- Exam (pdf)
- Solution
  - Question 1: (a)
  - Question 2: (b)
  - Source code/Eclipse projects (zip)
Exam January 22, 2018
- Exam (pdf)
- Solution
  - Question 1: (b)
  - Question 2: (b)
  - Source code/Eclipse projects (zip)
Exam June 26, 2018
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (c)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
Exam July 16, 2018
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (a)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (d)
    - Source code/Eclipse projects (zip)
Exam September 3, 2018
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - - Question 1: (b)
      - Question 2: (c)
Exam February 15, 2019
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (b)
Exam July 2, 2019
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (a)
    - Question 2: (b)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (a)
    - Question 2: (b)
    - Source code/Eclipse projects (zip)
Exam July 18, 2019
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (b)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (c)
    - Question 2: (b)
    - Source code/Eclipse projects (zip)
Exam July 2, 2020
- Exam (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (a) – Note that there are two actions and both actions are associated with paths that include the filter transformation. Hence, the filter transformation is executed two times. For this reason the value of the accumulator is 4.
    - Source code/Eclipse projects (zip)
Exam July 16, 2020
- Exam (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (b) – Note that there are two actions and hence the input file is read two times.
    - Source code/Eclipse projects (zip)
Exam September 19, 2020
- Exam (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)

Additional material

Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who have never used Java
  - OO Paradigm and UML (The UML part is not mandatory)
  - The Java Environment
  - Java Basic Features
  - Java Inheritance

DataBase and Data Mining Group

Big Data: Architectures and Data Analytics (2019/2020)

Table of content

General information

Exam rules

Announcements

Slides

Exercises

Practices

Exam Examples

Additional material

Welcome

Recently Updated Pages