This is the old version of the web page of the Big data course.

Web page of the academic year 2021/22: link

General information

ECTS: 6
Professor: Paolo Garza
Teaching assistants:
- Luca Colomba
- Francesco Ventura

Exam rules

Exam rules Academic Year 2020-2021 (link)

Announcements

(24/09/2020)
- First (online) lecture: Tuesday, September 29 at 13.00 – Online virtual classroom
(24/09/2020)
- No lab activities during the first two weeks.
- The lab activities scheduled for Monday, September 28 from 17:30 to 19:00 and Tuesday, September 29 from 8:30 to 10:00 are cancelled.

Slides

Introduction to the course content and exam rules (slides)
Introduction to Big Data (slides) (slides – no black background)
Big Data Architectures (slides) (slides – no black background)
Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (slides) (slides – no black background)
  - Interaction with HDFS and Hadoop by means of the command line (slides) (slides – no black background)
- Hadoop implementation of MapReduce (slides) (slides – no black background)
  - Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
  - PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
  - BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (slides) (slides – no black background)
- MapReduce – Design patterns – Part 1 (slides) (slides without black background) (slides) (slides – no black background)
- MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (slides) (slides – no black background)
- MapReduce – Design patterns – Part 2 (slides) (slides – no black background)
- MapReduce – Relational Algebra/SQL operators (slides) (slides – no black background)
Spark
- Introduction to Apache Spark (slides) (slides – no black background)
  - How to submit Spark applications (slides) (slides – no black background)
- RDD-based programs
  - RDDs: creation, basic transformations and actions (slides) (slides – no black background)
  - Key-value pair RDDs: transformations and actions on PairRDDs (slides) (slides – no black background)
  - DoubleRDDs (slides) (slides – no black background)
  - Advanced Topics: Cache, accumulators, broadcast variables (slides) (slides – no black background)
- Spark SQL, Datasets and DataFrames (slides) (slides – no black background)
- Data Mining – Recap
  - Introduction (slides)
- Spark MLlib
  - Spark MLlib – (slides) (slides – no black background)
    - Logistic Regression example code (zip)
    - Decision Trees example code (zip)
    - Decision Trees and Categorical class label example code (zip)
  - Spark MLlib – Classification of textual data (slides) (slides – no black background)
    - Textual data classification example code (zip)
  - Spark MLlib – Classification and Parameter tuning (slides) (slides – no black background)
    - Parameter tuning example code (zip)
  - Spark MLlib – Clustering of structured data (slides) (slides – no black background)
    - Clustering example code (zip)
  - Spark MLlib – Itemset and Association rule mining (slides) (slides – no black background)
    - Itemset and Association rule mining example code (zip)
  - Spark MLlib – Linear regression (slides) (slides – no black background)
    - Linear regression example code (zip)
- Spark Streaming (slides) (slides – no black background) Last update – Dec 11, 2020 (Slides 64-69 are new. The other slides have not been changed.)
  - Word Count – Streaming version (zip)
  - Word Count and Window (zip)
  - Word Count – Stateful version (zip)
  - Word Count – Streaming version – Read data from HDFS folder (zip)
  - Word Count – Output sort by key – Based on the transformPair() transformation (zip)
Relational and Non-relational databases for Big data
- Introduction to relational and non-relational databases for Big data (slides) (slides – no black background)

Exercises

MapReduce
- Basic project
  - Linux and MacOs
    - Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProject.zip)
  - Windows
    - Setup instructions (ConfigureWindowsEnviroment.pdf)
      - You must install also JDK 1.8 and select it for the imported project inside Eclipse. If you already installed the JDK environment but the version is greater than JDK 1.8 you must install also JDK 1.8.
    - Winutils executable (winutils.zip)
    - Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProjectWindows.zip)
- MapReduce exercises (slides) (slides – no black background)
  - Solutions of Exercises 1-12 (Solutions1_12.zip)
  - Solutions of Exercises 13-22 (Solutions13_22.zip)
  - Solutions of Exercises 23-29 (Solutions23_29.zip) – The solution of Exercise 23 Bis has been updated (October 29, 2020)
Spark
- Spark RDD-, Dataset-, DataFrame-based exercises (slides) (slides – no black background)
  - Example data – One folder with (few) data for each exercise (ExSparkData.zip)
  - Solutions of Exercises 30-50 (SolutionsExSpark.zip)
    - Ex. 39 Bis – Comparison between two alternative solutions (slides) (slides – no background)
- Spark streaming exercises (slides) (slides – no black background)
  - Solutions of Exercises 51-53 (SolutionsSparkStreaming.zip)

Practices

No lab activities during the first two weeks

TEAM 1: Students from A to H – Monday from 5.30 pm to 7 pm
TEAM 2: Students from I to Z – Tuesday from 8.30 am to 10 am

Lab1: Hadoop and MapReduce
- Online virtual lab, for online questions and answers
  - Team 1: Monday, October 12 – 5.30pm – 7pm
  - Team 2: Tuesday, October 13 – 8.30 am to 10 am
- Problem specification (pdf)
- How to import and run locally on your PC a MapReduce program by using Eclipse + maven (01_ImportProject_LocalRun.mp4)
- How to create a jar file and execute your application on the remote cluster BigData@Polito (02_Jar_ClusterExecution.mp4)
- Basic project and small example data set
  - Linux and macOS (Lab1.zip)
  - Windows (Lab1Windows.zip)
- Bigger data set: finefoods_text.txt (zip)
- Solution
  - Bonus track: Lab1_SolBonus_1920.zip
Lab2: Filter with Hadoop MapReduce
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
  - Linux and macOS (Lab2_Skeleton1920.zip)
  - Windows (Lab2Windows_Skeleton1920.zip)
- Outputs of the first lab (OutputFolderLab1.zip) (OutputFolderLab1BonusTrack.zip). You can use them to test your application locally on your PC
- Solution
  - Lab2_Sol1920.zip
  - Lab2_SolBonus1920.zip
Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
  - Linux and macOS (Lab3_Skeleton1920.zip)
  - Windows (Lab3Windows_Skeleton1920.zip)
- Input file (AmazonTransposedDataset_Sample.txt)
- Expected output/result (part-r-00000)
- Solution
  - Lab3_Sol1920.zip – Three alternative solutions are provided (the solutions are characterized by a different efficiency)
  - Comments on the three uploaded solutions (slides) (slides – no black background)
Lab4: Normalized ratings for product recommendations with Hadoop MapReduce
- Problem specification (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project Hadoop – MapReduce
  - Linux and macOS (Lab4_Skeleton1920.zip)
  - Windows (Lab4Windows_Skeleton1920.zip)
- Expected output (the input is the large file Reviews.csv) (resLab4.txt)
- Solution
  - Lab4_Sol.zip
Lab5: Filter data and compute basic statistics with Apache Spark
- Problem specification (pdf)
- Sample file (SampleLocalFile.csv)
- Skeleton Eclipse project Spark (Lab5BigData_Template1920.zip)
- Solution
  - Lab5_Sol.zip

Lab6: Frequently bought/reviewed together application with Apache Spark
- Problem specification (pdf)
- Sample file (ReviewsSample.csv)
- Skeleton Eclipse project Spark (Lab6BigData_Template1920.zip)
- Expected output – Task 1 (the input is the large file Reviews.csv) (outputTask1Lab6.zip)
- Solution
  - Lab6BigData_Sol1920.zip
Lab7: Bike sharing data analysis
- Problem specification (pdf)
- Sample data (zip)
- Skeleton Eclipse project Spark (Lab7BigData_Template1920.zip)
- Example KML file (zip)
- Expected output
  - Execution on sample data (sampleData/registerSample.csv and sampleData/stations.csv) and minimum criticality
    threshold = 0.4 (part-00000)
  - Execution on complete data (/data/students/bigdata-01QYD/Lab7/register.csv and /data/students/bigdata-01QYD/Lab7/stations.csv) and minimum criticality
    threshold = 0.6 (part-00000)
- Another KML visualizer that can be used to visualize on a map the result of your analysis: http://kmlviewer.nsspot.net
- Solution
  - Lab7BigData_Sol1920.zip
Lab8: Bike sharing data analysis based on Spark SQL
- Problem specification (pdf)
- Sample data (zip)
- Skeleton Eclipse project Spark (Lab8BigData_Template1920.zip)
- Expected output
  - Execution on sample data (sampleData/registerSample.csv and sampleData/stations.csv) and minimum criticality
    threshold = 0.4 (out_Lab8sample.zip)
  - Execution on complete data (/data/students/bigdata-01QYD/Lab8/register.csv and /data/students/bigdata-01QYD/Lab8/stations.csv) and minimum criticality
    threshold = 0.6 (out_Lab8.zip)
- Solution
  - Lab8BigData_Sol.zip
Lab9: A classification pipeline with MLlib + SparkSQL
- Problem specification (pdf)
- Sample file with 100 reviews (ReviewsSample.csv)
- Skeleton Eclipse project Spark (Lab9BigData_Template1920.zip)
- Solution
  - Logistic regression (zip)
  - DecisionTree (zip)
  - Logistic regression based on text analysis (zip)
  - DecisionTree based on text analysis (zip)
Lab10: Tweet analysis – Spark streaming
- Problem specification (pdf)
- Example files – tweets (exampledata_tweets.zip)
- Skeleton Eclipse project Spark (Lab10BigData_Template1920.zip)
- Solution
  - Lab10BigData_Sol1920.zip

Exam Examples

Pay attention that from this academic year (2020/21) the exam is closed book

Exam June 26, 2018
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (c)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
Exam July 16, 2018
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (a)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (d)
    - Source code/Eclipse projects (zip)
Exam September 3, 2018
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
Exam February 15, 2019
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (b)
Exam July 2, 2019
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (a)
    - Question 2: (b)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (a)
    - Question 2: (b)
    - Source code/Eclipse projects (zip)
Exam July 18, 2019
- Exam – Version #1 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (b)
    - Source code/Eclipse projects (zip)
- Exam – Version #2 (pdf)
  - Draft of the solution
    - Question 1: (c)
    - Question 2: (b)
    - Source code/Eclipse projects (zip)
Exam July 2, 2020
- Exam (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (a)
    - Source code/Eclipse projects (zip)
Exam July 16, 2020
- Exam (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (b) – Note that there are two actions and hence the input file is read two times.
    - Source code/Eclipse projects (zip)
Exam September 17, 2020
- Exam (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
Exam February 5, 2021
- Exam (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
Exam June 30, 2021
- Exam (pdf)
  - Draft of the solution
    - Question 1: (a)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)

Additional material

Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who have never used Java
  - OO Paradigm and UML (The UML part is not mandatory)
  - The Java Environment
  - Java Basic Features
  - Java Inheritance
Data mining – Centralized algorithms
- Data and Preprocessing (slides)
- Itemset mining and Association rules (slides)
- Classification (slides) (slidese)
- Clustering (slides) (slides)

DataBase and Data Mining Group

Big Data: Architectures and Data Analytics (2020/2021)

Table of content

This is the old version of the web page of the Big data course.

Web page of the academic year 2021/22: link

General information

Exam rules

Announcements

Slides

Exercises

Practices

Exam Examples

Additional material

Welcome

Recently Updated Pages