Big Data: Architectures and Data Analytics (2018/2019)
Table of content
- General information
- Exam rules
- Announcements
- Materials
- Exercises
- Exam Examples
- Practices
- Additional materials
General information
- ECTS: 6
- Professor: Paolo Garza
- Students from AA to GZ
- Teaching assistants:
- Alessandro Farasin
- Francesco Ventura
- Teaching assistants:
- Students from HA to ZZ
- Teaching assistants:
- Andrea Pasini
- Marilisa Montemurro
- Teaching assistants:
Exam rules
- Exam rules Academic Year 2018-2019 (pdf)
Announcements
- (21/01/2020)
- The exam scheduled for January 24, 2020 will be held at 11:00 in Classroom 3I
- Please, remember to bring with you:
- the student card and/or an identity document
- sheets of paper (“fogli protocollo”)
Students from AA to GZ | Students from HA to ZZ |
|
|
Materials
- Introduction to the course (2 slides per page, 6 slides per page)
- Introduction to Big Data (2 slides per page, 6 slides per page)
- Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce – Basic structure of MapReduce programs in Hadoop (2 slides per page, 6 slides per page)
- Source code of the Word Count Ecplise project (WordCount.zip) – Use the import option to import it in Eclipse
- PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- BigData@Polito environment (2 slides per page, 6 slides per page)
- Interaction with HDFS and Hadoop by means of the command line (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 2 (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 3 (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce – Multiple Inputs and Multiple Outputs (2 slides per page, 6 slides per page)
- MapReduce – Distributed cache
- New APIs (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
- Spark
- Introduction to Apache Spark (2 slides per page, 6 slides per page)
- Introduction to Apache Spark – Part 2 (2 slides per page, 6 slides per page)
- RDD-based programs (RDDs creation and basic transformations) – Part 1 (2 slides per page, 6 slides per page)
- RDD-based programs (RDDs basic actions) – Part 2 (2 slides per page, 6 slides per page)
- How to submit a Spark application (2 slides per page, 6 slides per page)
- RDD-based programs (key-value pair RDDs and transformations on PairRDDs) – Part 3 (2 slides per page, 6 slides per page) – Updated on April 10, 2018 (five new slides on flatMapToPair)
- RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (2 slides per page, 6 slides per page)
- RDD-based programs (DoubleRDDs) – Part 5 (2 slides per page, 6 slides per page)
- RDD-based programs (Cache, accumulators, broadcast variables) – Part 6 (2 slides per page, 6 slides per page)
- Datasets, DataFrames and Spark SQL (2 slides per page, 6 slides per page)
- Spark SQL example – DataFrames vs Datasets vs SQL
- Problem specification (2 slides per page, 6 slides per page)
- Solution (zip)
- Spark SQL and User Defined Functions (UDFs) (2 slides per page, 6 slides per page)
- Spark SQL example – DataFrames vs Datasets vs SQL
- Data mining and Machine learning algorithms with Spark MLlib
- Data Mining – Recap
- Introduction (2 slides per page, 6 slides per page)
- Data and Preprocessing (2 slides per page, 6 slides per page)
- Itemset mining and Association rules (2 slides per page, 6 slides per page)
- Classification (2 slides per page, 6 slides per page)
- Clustering (2 slides per page, 6 slides per page)
- Data Mining – Recap
- Spark MLlib
- Spark MLlib – Introduction and Classification of structured (2 slides per page, 6 slides per page)
- Spark MLlib – Classification of textual data (2 slides per page, 6 slides per page)
- Textual data classification example code (zip)
- Spark MLlib – Parameter tuning (2 slides per page, 6 slides per page)
- Parameter tuning example code (zip)
- Spark MLlib – Clustering of structured data (2 slides per page, 6 slides per page)
- Clustering example code (zip)
- Spark MLlib – Itemset and Association rule mining (2 slides per page, 6 slides per page)
- Itemset and Association rule mining example code (zip)
- Spark MLlib – Linear regression (2 slides per page, 6 slides per page)
- Linear regression example code (zip)
- Spark Streaming (2 slides per page, 6 slides per page)
- DBMS for Big data
- Relational and Non-relational databases for Big data (2 slides per page, 6 slides per page)
Exercises
- MapReduce Exercises – Part 1 (2 slides per page, 6 slides per page)
- MapReduce Exercises – Part 2 (2 slides per page, 6 slides per page)
- Solutions – Part 1 and 2
- Source code/Eclipse – maven projects (SolutionsExercisesPart1_Part2.zip)
- Solutions – Part 1 and 2
- MapReduce Exercises – Part 3 (2 slides per page, 6 slides per page)
- Solutions – Part 3
- Source code/Eclipse – maven projects (SolutionsExercisesPart3.zip)
- Solutions – Part 3
- MapReduce Exercises – Part 4 (2 slides per page, 6 slides per page)
- Solutions – Part 4
- Source code/Eclipse projects (SolutionsExercisesPart4.zip)
- Solutions – Part 4
- MapReduce Exercises – Part 5 (2 slides per page, 6 slides per page)
- Solutions – Part 5
- Source code/Eclipse projects (SolutionsExercisesPart5.zip)
- Solutions – Part 5
- MapReduce Exercises – Part 6 (2 slides per page, 6 slides per page)
- Solutions – Part 6
- Source code/Eclipse projects (SolutionsExercisesPart6.zip) – An alternative solution for exercise 23 has been uploaded on March 25, 2019 (Exercise23TwoJobsV2)
- Solutions – Part 6
- MapReduce Exercises – Part 7 (2 slides per page, 6 slides per page)
- Solutions – Part 7
- Source code/Eclipse projects (SolutionsExercisesPart7.zip)
- Solutions – Part 7
- Spark Exercises – Part 8 (2 slides per page, 6 slides per page)
- Simulation – Exercise #31 (2 slides per page, 6 slides per page)
- Solutions – Part 8
- Source code/Eclipse projects (SolutionsExercisesPart8.zip)
- Solutions of Exercises 32-36 based on Spark SQL (with Dataset, DataFrame, SQL-like language)
- Source code/Eclipse projects (SolutionsExercisesPart8SparkSQL.zip)
- Spark Exercises – Part 9 (2 slides per page, 6 slides per page) – Updated on April 13, 2019 (Exercise 39 bis has been included)
- Solutions – Part 9
- Source code/Eclipse projects (SolutionsExercisesPart9.zip)
- Source code/Eclipse projects (SolutionsExercise39Bis.zip)
- Solutions of Exercises 37-38 based on Spark SQL (with Dataset, DataFrame, SQL-like language)
- Source code/Eclipse projects (SolutionsExercisesPart9SparkSQL.zip)
- Solutions – Part 9
- Spark Exercises – Part 10 (2 slides per page, 6 slides per page)
- Solutions – Part 10
- Source code/Eclipse projects (SolutionsExercisesPart10.zip)
- Solutions – Part 10
- Spark Exercises – Part 11 (2 slides per page, 6 slides per page)
- Solutions – Part 11
- Source code/Eclipse projects (SolutionsExercisesPart11.zip)
- Solutions – Part 11
- Spark Exercises – Part 12 (2 slides per page, 6 slides per page)
- Solutions – Part 12
- Source code/Eclipse projects (SolutionsExercisesPart12.zip)
- Solutions – Part 12
- Spark UDFs Exercises – Part 14 (2 slides per page, 6 slides per page)
- Solutions – Part 14
- Source code/Eclipse projects (SolutionsExercisesPart14.zip)
- Solutions – Part 14
- Spark Streaming Exercises – Part 13 (2 slides per page, 6 slides per page)
- Solutions – Part 13
- Source code/Eclipse projects (SolutionsExercisesPart13.zip)
- Solutions – Part 13
- Spark Streaming Exercises – Part 15 (2 slides per page, 6 slides per page) – Uploaded on May 29, 2019
- Solutions – Part 15
- Source code/Eclipse projects (SolutionsExercisesPart15.zip)
- Solutions – Part 15
Exam Examples
- At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
- For the Spark exercises, no templates are provided
- Exam example #1
- Exam example #2
- Exam July 1, 2016
- Exam July 12, 2016
- Exam September 19, 2016
- Exam June 30, 2017
- Exam July 14, 2017
- Exam September 14, 2017
- Exam January 22, 2018
- Exam June 26, 2018
- Exam July 16, 2018
- Exam September 3, 2018
- Exam February 15, 2019
- Exam July 2, 2019
- Exam July 18, 2019
- Exam September 19, 2019
- Text (pdf)
- Draft of the solution
- Question 1: (b)
- Question 2: (b)
- Draft of the solution
- Text (pdf)
- Exam September 19, 2019
Practices
- No lab activities during the first two weeks
- Schedule of the lab activities
-
-
Students from AA to GZ Students from HA to ZZ - TEAM 1: Students from AA to CI – Tuesday from 5.30pm to 7pm
- TEAM 2: Students from CL to GZ – Wednesday from 5.30pm to 7pm
-
Team 1 Team 2 Lab #1 Tuesday, March 19 – from 5.30pm to 7pm Wednesday, March 20 – from 5.30pm to 7pm Lab #2 Tuesday, March 26 – from 5.30pm to 7pm Thursday, March 27 – from 5.30pm to 7pm Lab #3 Tuesday, April 2 – from 5.30pm to 7pm Wednesday, April 3 – from 5.30pm to 7pm Lab #4 Tuesday, April 9 – from 5.30pm to 7pm Wednesday, April 10 – from 5.30pm to 7pm Lab #5 Tuesday, April 16 – from 5.30pm to 7pm Wednesday, April 17 – from 5.30pm to 7pm Lab #6 Tuesday, May 7 – from 5.30pm to 7pm Wednesday, May 8 – from 5.30pm to 7pm Lab #7 Tuesday, May 14 – from 5.30pm to 7pm Wednesday, May 15 – from 5.30pm to 7pm Lab #8 Tuesday, May 21 – from 5.30pm to 7pm Wednesday, May 22 – from 5.30pm to 7pm Lab #9 Tuesday, May 28 – from 5.30pm to 7pm Wednesday, May 27 – from 5.30pm to 7pm Lab #10 Tuesday, June 4 – from 5.30pm to 7pm Wednesday, June 5 – from 5.30pm to 7pm
- TEAM 1: Students from HA to QZ – Tuesday from 10am to 11.30am
- TEAM 2: Students from RA to ZZ – Wednesday from 1pm to 2.30pm
-
Team 1 Team 2 Lab #1 Tuesday, March 19 – from 10am to 11.30am Wednesday, March 20 – from 1pm to 2.30pm Lab #2 Tuesday, March 26 – from 10am to 11.30am Wednesday, March 27 – from 1pm to 2.30pm Lab #3 Tuesday, April 2 – from 10am to 11.30am Wednesday, April 4 – from 1pm to 2.30pm Lab #4 Tuesday, April 9 – from 10am to 11.30am Wednesday, April 10 – from 1pm to 2.30pm Lab #5 Tuesday, April 16 – from 10am to 11.30am Wednesday, April 17 – from 1pm to 2.30pm Lab #6 Tuesday, May 7 – from 10am to 11.30am Wednesday, May 8 – from 1pm to 2.30pm Lab #7 Tuesday, May 14 – from 10am to 11.30am Wednesday, May 15 – from 1pm to 2.30pm Lab #8 Tuesday, May 21 – from 10am to 11.30am Wednesday, May 22 – from 1pm to 2.30pm Lab #9 Tuesday, May 28 – from 10am to 11.30am Wednesday, May 29 – from 1pm to 2.30pm Lab #10 Tuesday, June 4 – from 10am to 11.30am Wednesday, June 5 – from 1pm to 2.30pm
-
-
- Lab1: Hadoop and MapReduce
- BigData@Polito environment (2 slides per page, 6 slides per page)
- Text (pdf)
- Project and data (Lab1_BigData.zip)
- Solution
- Bonus track (zip)
- Lab2: Filter with Hadoop MapReduce
- Text (pdf)
- Skeleton Eclipse project Hadoop – MapReduce (Lab_Skeleton.zip
- Solution
- Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
- Text (pdf)
- Skeleton Eclipse project Hadoop – MapReduce (Lab3_Skeleton.zip)
- Solution
- Solution (zip) – Three alternative solutions are provided
- Comments on the three uploaded possible solutions (2 slides per page, 6 slides per page)
- Lab4: Normalized ratings for product recommendations with Hadoop MapReduce
- Text (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project Hadoop – MapReduce (Lab4_Skeleton.zip)
- Solution (zip)
- Lab5: Filter data and compute basic statistics with Apache Spark
- Text (pdf)
- SampleLocalFile.csv (SampleLocalFile.csv)
- Skeleton Eclipse project – Spark (Lab5_Template.zip)
- Solution (zip)
- Lab6: Frequently bought/reviewed together application with Apache Spark
- Text (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project – Spark (Lab6_Template.zip)
- Solution (zip)
- Lab7: Bike sharing data analysis
- Text (pdf)
- Sample data (zip)
- Skeleton Eclipse project – Spark (Lab7_Template.zip)
- Example of KML file (zip)
- Another KML visualizer that can be used to visualize on a map the result of your analysis: http://kmlviewer.nsspot.net/
- Solution (zip)
- Lab8: Bike sharing data analysis based on Spark SQL
- Text (pdf)
- Sample data (zip)
- Skeleton Eclipse project – Spark (Lab8_Template.zip)
- Solution (Dataset-based.zip) (SQL-based.zip) (DataFrame-based.zip)
- Lab9: A classification pipeline with MLlib + SparkSQL
- Text (pdf)
- Skeleton Eclipse project – Spark (Lab9_Template.zip)
- Sample file with 100 reviews (ReviewsSample.csv)
- Solution
- Lab10: Tweet analysis – Spark streaming
- Text (pdf)
- Skeleton Eclipse project – (Lab10_Template.zip)
- Example files – tweets (exampledata_tweets.zip)
- Solution (zip)
Additional materials
- Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who do not know Java
- OO Paradigm and UML (The UML part in not mandatory)
- The Java Environment
- Java Basic Features
- Java Inheritance
- Suggested slides/lectures for those students who do not know Java
- Slides about the Scala language – These slides are not part of the course program (no questions or exercises on these slides at the exam)
- MapReduce – Hadoop internals (2 slides per page, 6 slides per page) – These slides are not part of the course program (no questions or exercises on these slides at the exam)
- Apache HIVE (2 slides per page, 6 slides per page) – These slides are not part of the course program (no questions or exercises on these slides at the exam)
- Apache Storm – These slides are not part of the course program (no questions or exercises on these slides at the exam)
- Introduction (2 slides per page, 6 slides per page)
- Storm Architecture (2 slides per page, 6 slides per page)
- Developing Storm applications (2 slides per page, 6 slides per page)
- Advances topics (2 slides per page, 6 slides per page)
- Trident (2 slides per page, 6 slides per page)