Table of content
General information
Exam rules
- Exam rules Academic Year 2019-2020 – ONLINE EXAMINATION SESSION (pdf)
- Exam rules Academic Year 2019-2020 (pdf)
Announcements
- (24/02/2020)
- No lab activities during the first two weeks.
 
Slides
- Introduction to the course content and exam rules (2 slides per page, 6 slides per page)
- Introduction to Big Data (2 slides per page, 6 slides per page) 
- Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce (2 slides per page, 6 slides per page)
- Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
- PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (2 slides per page, 6 slides per page)
 
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (2 slides per page, 6 slides per page)
- Updated on April 20, 2020 with some more details on the Distributed cache topic
 
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
 
- Spark
- Introduction to Apache Spark (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- RDD-based programs
- RDDs: creation, basic transformations and actions (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Key-value pair RDDs: transformations and actions on PairRDDs (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- DoubleRDDs (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Advanced Topics: Cache, accumulators, broadcast variables (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
 
- Spark SQL, Datasets and DataFrames (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Data mining and Machine learning algorithms with Spark MLlib
- Data Mining – Recap
- Spark MLlib
- Spark MLlib – Introduction and Classification of structured data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Logistic Regression example code (zip)
- Decision Trees example code (zip)
- Decision Trees and Categorical class label example code (zip)
 
- Spark MLlib – Classification of textual data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Textual data classification example code (zip)
 
- Spark MLlib – Classification and Parameter tuning (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Parameter tuning example code (zip)
 
- Spark MLlib – Clustering of structured data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Clustering example code (zip)
 
- Spark MLlib – Itemset and Association rule mining (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Itemset and Association rule mining example code (zip)
 
- Spark MLlib – Linear regression (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
-  Linear regression example code (zip)
 
 
 
- Spark Streaming (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Word Count – Streaming version (zip)
- Word Count and Window (zip)
- Word Count – Stateful version (zip)
- Word Count – Streaming version – Read data from HDFS folder (zip)
- Word Count – Output sort by key – Based on the transformPair() transformation (zip)
 
 
- Relational and Non-relational databases for Big data
Exercises
Practices
- No lab activities during the first two weeks
- TEAM 1: Students from A to H – Tuesday from 5.30pm to 7pm
- TEAM 2: Students from I to Z – Wednesday from 5.30pm to 7pm
- Lab1: Hadoop and MapReduce
- Online virtual lab, for online questions and answers
- Team 1: Tuesday, March 24 – 5.30pm – 7pm – Team 1
- Team 2: Wednesday, March 25 – 5.30pm – 7pm – Team 2
 
- Problem specification (pdf)
- How to import and run locally on your PC a MapReduce program by using Eclipse + maven (01_ImportProject_LocalRun.mp4)
- How to create a jar file and execute your application on the remote cluster BigData@Polito (02_Jar_ClusterExecution.mp4)
- Basic project and small example data set
- Solution
 
- Lab2: Filter with Hadoop MapReduce
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
- Solution
 
- Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
- Lab4: Normalized ratings for product recommendations with Hadoop MapReduce   (Team 1: Tuesday, April 7 – Team 2: Wednesday, April 8)
- Problem specification (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project Hadoop – MapReduce
- Solution
 
- Lab5: Filter data and compute basic statistics with Apache Spark   (Team 1: Tuesday, April 21 – Team 2: Wednesday, April 22)
- Lab6: Frequently bought/reviewed together application with Apache Spark (Team 1: Tuesday, April 28 – Team 2: Wednesday, April 29)
- Lab7: Bike sharing data analysis (Team 1 and Team 2: Tuesday, May 5)
- Lab8: Bike sharing data analysis based on Spark SQL (Team 1: Tuesday, May 19 – Team 2: Wednesday, May 20)
- Lab9: A classification pipeline with MLlib + SparkSQL
- Lab10: Tweet analysis – Spark streaming (Team 1 and Team 2: Wednesday, June 3)
Exam Examples
- At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
- For the Spark exercises, no templates are provided
 
- Exam example #1
- Exam (pdf)
- Solution
- Source code/Eclipse projects (zip)
 
 
- Exam example #2
- Exam (pdf)
- Solution
- Source code/Eclipse projects (zip)
 
 
- Exam July 1, 2016
- Exam (pdf)
- Solution
- Question 1: (d)
- Question 2: (b)
- Source code/Eclipse projects (zip)
 
 
- Exam July 12, 2016
- Exam (pdf)
- Solution
- Question 1: (a)
- Question 2: (a)
- Source code/Eclipse projects (zip)
 
 
- Exam September 19, 2016
- Exam (pdf)
- Solution
- Question 1: (c)
- Question 2: (a)
- Source code/Eclipse projects (zip)
 
 
- Exam June 30, 2017
- Exam (pdf)
- Solution
- Question 1: (b)
- Question 2: (c)
- Source code/Eclipse projects (zip) – Updated on June 12, 2019
 
 
- Exam July 14, 2017
- Exam (pdf)
- Solution
- Question 1: (d)
- Question 2: (c)
- Source code/Eclipse projects (zip)
 
 
- Exam September 14, 2017
- Exam (pdf)
- Solution
- Question 1: (a)
- Question 2: (b)
- Source code/Eclipse projects (zip)
 
 
- Exam January 22, 2018
- Exam (pdf)
- Solution
- Question 1: (b)
- Question 2: (b)
- Source code/Eclipse projects (zip)
 
 
- Exam June 26, 2018
- Exam – Version #1 (pdf)
- Draft of the solution
- Question 1: (c)
- Question 2: (c)
- Source code/Eclipse projects (zip)
 
 
- Exam – Version #2 (pdf)
- Draft of the solution
- Question 1: (b)
- Question 2: (c)
- Source code/Eclipse projects (zip)
 
 
 
- Exam July 16, 2018
- Exam – Version #1 (pdf)
- Draft of the solution
- Question 1: (d)
- Question 2: (a)
- Source code/Eclipse projects (zip)
 
 
- Exam – Version #2 (pdf)
- Draft of the solution
- Question 1: (b)
- Question 2: (d)
- Source code/Eclipse projects (zip)
 
 
 
- Exam September 3, 2018
- Exam – Version #1 (pdf)
- Draft of the solution
- Question 1: (d)
- Question 2: (c)
- Source code/Eclipse projects (zip)
 
 
- Exam – Version #2 (pdf)
- Draft of the solution
- 
- Question 1: (b)
- Question 2: (c)
 
 
 
 
- Exam February 15, 2019
- Exam – Version #1 (pdf)
- Draft of the solution
- Question 1: (d)
- Question 2: (c)
- Source code/Eclipse projects (zip)
 
 
- Exam – Version #2 (pdf)
- Draft of the solution
- Question 1: (d)
- Question 2: (b)
 
 
 
- Exam July 2, 2019
- Exam – Version #1 (pdf)
- Draft of the solution
- Question 1: (a)
- Question 2: (b)
- Source code/Eclipse projects (zip)
 
 
- Exam – Version #2 (pdf)
- Draft of the solution
- Question 1: (a)
- Question 2: (b)
- Source code/Eclipse projects (zip)
 
 
 
- Exam July 18, 2019
- Exam – Version #1 (pdf)
- Draft of the solution
- Question 1: (b)
- Question 2: (b)
- Source code/Eclipse projects (zip)
 
 
- Exam – Version #2 (pdf)
- Draft of the solution
- Question 1: (c)
- Question 2: (b)
- Source code/Eclipse projects (zip)
 
 
 
- Exam July 2, 2020
- Exam (pdf)
- Draft of the solution
- Question 1: (b)
- Question 2: (a) – Note that there are two actions and both actions are associated with paths that include the filter transformation. Hence, the filter transformation is executed two times. For this reason the value of the accumulator is 4.
- Source code/Eclipse projects (zip)
 
 
 
- Exam July 16, 2020
- Exam (pdf)
- Draft of the solution
- Question 1: (b)
- Question 2: (b) – Note that there are two actions and hence the input file is read two times.
- Source code/Eclipse projects (zip)
 
 
 
- Exam September 19, 2020
- Exam (pdf)
- Draft of the solution
- Question 1: (d)
- Question 2: (c)
- Source code/Eclipse projects (zip)
 
 
 
Additional material
- Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who have never used Java
- OO Paradigm and UML (The UML part is not mandatory)
- The Java Environment
- Java Basic Features
- Java Inheritance