Big Data: Architectures and Data Analytics (2019/2020)
Table of content
General information
Exam rules
Exam rules Academic Year 2019-2020 – O NLINE EXAMINATION SESSION (pdf )
Exam rules Academic Year 2019-2020 (pdf )
Announcements
(24/02/2020)
No lab activities during the first two weeks.
Slides
Introduction to the course content and exam rules (2 slides per page , 6 slides per page )
Introduction to Big Data (2 slides per page , 6 slides per page )
Hadoop and MapReduce
Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page , 6 slides per page )
Hadoop implementation of MapReduce (2 slides per page , 6 slides per page )
Source code of the Word Count Ecplise project (WordCount.zip ) – Use the import maven project option to import it in Eclipse
PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip )
BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (2 slides per page , 6 slides per page )
MapReduce – Design patterns – Part 1 (2 slides per page , 6 slides per page )
MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (2 slides per page , 6 slides per page )
Updated on April 20, 2020 with some more details on the Distributed cache topic
MapReduce – Design patterns – Part 2 (2 slides per page , 6 slides per page )
MapReduce – Relational Algebra/SQL operators (2 slides per page , 6 slides per page )
Spark
Introduction to Apache Spark (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
RDD-based programs
RDDs: creation, basic transformations and actions (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Key-value pair RDDs: transformations and actions on PairRDDs (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
DoubleRDDs (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Advanced Topics: Cache, accumulators, broadcast variables (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Spark SQL, Datasets and DataFrames (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Data mining and Machine learning algorithms with Spark MLlib
Data Mining – Recap
Spark MLlib
Spark MLlib – Introduction and Classification of structured data (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Logistic Regression example code (zip )
Decision Trees example code (zip )
Decision Trees and Categorical class label example code (zip )
Spark MLlib – Classification of textual data (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Textual data classification example code (zip )
Spark MLlib – Classification and Parameter tuning (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Parameter tuning example code (zip )
Spark MLlib – Clustering of structured data (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Clustering example code (zip )
Spark MLlib – Itemset and Association rule mining (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Itemset and Association rule mining example code (zip )
Spark MLlib – Linear regression (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Linear regression example code (zip )
Spark Streaming (2 slides per page , 6 slides per page ) (2 slides per page – no black background , 6 slides per page – no black background )
Word Count – Streaming version (zip )
Word Count and Window (zip )
Word Count – Stateful version (zip )
Word Count – Streaming version – Read data from HDFS folder (zip )
Word Count – Output sort by key – Based on the transformPair() transformation (zip )
Relational and Non-relational databases for Big data
Exercises
Practices
No lab activities during the first two weeks
TEAM 1: Students from A to H – Tuesday from 5.30pm to 7pm
TEAM 2: Students from I to Z – Wednesday from 5.30pm to 7pm
Lab1: Hadoop and MapReduce
Online virtual lab, for online questions and answers
Team 1: Tuesday, March 24 – 5.30pm – 7pm – Team 1
Team 2: Wednesday, March 25 – 5.30pm – 7pm – Team 2
Problem specification (pdf )
How to import and run locally on your PC a MapReduce program by using Eclipse + maven (01_ImportProject_LocalRun.mp4 )
How to create a jar file and execute your application on the remote cluster BigData@Polito (02_Jar_ClusterExecution.mp4 )
Basic project and small example data set
Solution
Lab2: Filter with Hadoop MapReduce
Problem specification (pdf )
Skeleton Eclipse project Hadoop – MapReduce
Solution
Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
Lab4: Normalized ratings for product recommendations with Hadoop MapReduce (Team 1: Tuesday, April 7 – Team 2: Wednesday, April 8 )
Problem specification (pdf )
Sample dataset (ReviewsSample.csv )
Skeleton Eclipse project Hadoop – MapReduce
Solution
Lab5: Filter data and compute basic statistics with Apache Spark (Team 1: Tuesday, April 21 – Team 2: Wednesday, April 22 )
Lab6: Frequently bought/reviewed together application with Apache Spark (Team 1: Tuesday, April 28 – Team 2: Wednesday, April 29 )
Lab7: Bike sharing data analysis (Team 1 and Team 2: Tuesday, May 5 )
Lab8: Bike sharing data analysis based on Spark SQL (Team 1: Tuesday, May 19 – Team 2: Wednesday, May 20 )
Lab9: A classification pipeline with MLlib + SparkSQL
Lab10: Tweet analysis – Spark streaming (Team 1 and Team 2: Wednesday, June 3 )
Exam Examples
At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template )
For the Spark exercises, no templates are provided
Exam example #1
Exam (pdf )
Solution
Source code/Eclipse projects (zip )
Exam example #2
Exam (pdf )
Solution
Source code/Eclipse projects (zip )
Exam July 1, 2016
Exam (pdf )
Solution
Question 1: (d)
Question 2: (b)
Source code/Eclipse projects (zip )
Exam July 12, 2016
Exam (pdf )
Solution
Question 1: (a)
Question 2: (a)
Source code/Eclipse projects (zip )
Exam September 19, 2016
Exam (pdf )
Solution
Question 1: (c)
Question 2: (a)
Source code/Eclipse projects (zip )
Exam June 30, 2017
Exam (pdf )
Solution
Question 1: (b)
Question 2: (c)
Source code/Eclipse projects (zip ) – Updated on June 12, 2019
Exam July 14, 2017
Exam (pdf )
Solution
Question 1: (d)
Question 2: (c)
Source code/Eclipse projects (zip )
Exam September 14, 2017
Exam (pdf )
Solution
Question 1: (a)
Question 2: (b)
Source code/Eclipse projects (zip )
Exam January 22, 2018
Exam (pdf )
Solution
Question 1: (b)
Question 2: (b)
Source code/Eclipse projects (zip )
Exam June 26, 2018
Exam – Version #1 (pdf )
Draft of the solution
Question 1: (c)
Question 2: (c)
Source code/Eclipse projects (zip )
Exam – Version #2 (pdf )
Draft of the solution
Question 1: (b)
Question 2: (c)
Source code/Eclipse projects (zip )
Exam July 16, 2018
Exam – Version #1 (pdf )
Draft of the solution
Question 1: (d)
Question 2: (a)
Source code/Eclipse projects (zip )
Exam – Version #2 (pdf )
Draft of the solution
Question 1: (b)
Question 2: (d)
Source code/Eclipse projects (zip )
Exam September 3, 2018
Exam – Version #1 (pdf )
Draft of the solution
Question 1: (d)
Question 2: (c)
Source code/Eclipse projects (zip )
Exam – Version #2 (pdf )
Draft of the solution
Question 1: (b)
Question 2: (c)
Exam February 15, 2019
Exam – Version #1 (pdf )
Draft of the solution
Question 1: (d)
Question 2: (c)
Source code/Eclipse projects (zip )
Exam – Version #2 (pdf )
Draft of the solution
Question 1: (d)
Question 2: (b)
Exam July 2, 2019
Exam – Version #1 (pdf )
Draft of the solution
Question 1: (a)
Question 2: (b)
Source code/Eclipse projects (zip )
Exam – Version #2 (pdf )
Draft of the solution
Question 1: (a)
Question 2: (b)
Source code/Eclipse projects (zip )
Exam July 18, 2019
Exam – Version #1 (pdf )
Draft of the solution
Question 1: (b)
Question 2: (b)
Source code/Eclipse projects (zip )
Exam – Version #2 (pdf )
Draft of the solution
Question 1: (c)
Question 2: (b)
Source code/Eclipse projects (zip )
Exam July 2, 2020
Exam (pdf )
Draft of the solution
Question 1: (b)
Question 2: (a) – Note that there are two actions and both actions are associated with paths that include the filter transformation. Hence, the filter transformation is executed two times. For this reason the value of the accumulator is 4.
Source code/Eclipse projects (zip )
Exam July 16, 2020
Exam (pdf )
Draft of the solution
Question 1: (b)
Question 2: (b) – Note that there are two actions and hence the input file is read two times.
Source code/Eclipse projects (zip )
Exam September 19, 2020
Exam (pdf )
Draft of the solution
Question 1: (d)
Question 2: (c)
Source code/Eclipse projects (zip )
Additional material
Slides and screencasts about Java (kindly provided by prof. Torchiano) (link )
Suggested slides/lectures for those students who have never used Java
OO Paradigm and UML (The UML part is not mandatory)
The Java Environment
Java Basic Features
Java Inheritance
Parent page
Menu
© 2025 - DataBase and Data Mining Group
This site uses cookies: by using the site, you consent to the use of cookies. For more information view the cookie policy .OK Privacy & Cookies Policy