Big Data: Architectures and Data Analytics (2019/2020)
Table of content
- General information
- Exam rules
- Announcements
- Slides
- Exercises
- Practices
- Exam Examples
- Additional material
General information
- ECTS: 6
- Professor: Paolo Garza
- Teaching assistants:
- Alessandro Farasin
- Marilisa Montemurro
Exam rules
- Exam rules Academic Year 2019-2020 – ONLINE EXAMINATION SESSION (pdf)
- Exam rules Academic Year 2019-2020 (pdf)
Announcements
- (24/02/2020)
- No lab activities during the first two weeks.
Slides
- Introduction to the course content and exam rules (2 slides per page, 6 slides per page)
- Introduction to Big Data (2 slides per page, 6 slides per page)
- Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Interaction with HDFS and Hadoop by means of the command line (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce (2 slides per page, 6 slides per page)
- Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
- PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (2 slides per page, 6 slides per page)
- Updated on April 20, 2020 with some more details on the Distributed cache topic
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Spark
- Introduction to Apache Spark (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- How to submit Spark applications (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- RDD-based programs
- RDDs: creation, basic transformations and actions (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Key-value pair RDDs: transformations and actions on PairRDDs (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- DoubleRDDs (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Advanced Topics: Cache, accumulators, broadcast variables (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Spark SQL, Datasets and DataFrames (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Data mining and Machine learning algorithms with Spark MLlib
- Data Mining – Recap
- Introduction (2 slides per page, 6 slides per page)
- Data and Preprocessing (2 slides per page, 6 slides per page)
- Itemset mining and Association rules (2 slides per page, 6 slides per page)
- Classification (2 slides per page, 6 slides per page)
- Clustering (2 slides per page, 6 slides per page)
- Spark MLlib
- Spark MLlib – Introduction and Classification of structured data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Spark MLlib – Classification of textual data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Textual data classification example code (zip)
- Spark MLlib – Classification and Parameter tuning (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Parameter tuning example code (zip)
- Spark MLlib – Clustering of structured data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Clustering example code (zip)
- Spark MLlib – Itemset and Association rule mining (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Itemset and Association rule mining example code (zip)
- Spark MLlib – Linear regression (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Linear regression example code (zip)
- Data Mining – Recap
- Spark Streaming (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Introduction to Apache Spark (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
- Relational and Non-relational databases for Big data
- Introduction to relational and non-relational databases for Big data (2 slides per page, 6 slides per page) (2 slides per page – no black background, 6 slides per page – no black background)
Exercises
- MapReduce
- MapReduce exercises (2 slides per page, 6 slides per page)
- Solutions of Exercises 1-12 (Solutions1_12.zip)
- Solutions of Exercises 13-22 (Solutions13_22.zip)
- Solutions of Exercises 23-29 (Solutions23_29.zip)
- Solution of Exercise 23 – Two Jobs – Version 2: Updated version (SolExercise23TwoJobsV2Cluster.zip). The former version does not find the cached file when it is executed on the cluster.
- Basic project
- Linux and MacOs
- Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProject.zip)
- Windows
- Setup instructions (ConfigureWindowsEnviroment.pdf)
- Winutils executable (winutils.zip)
- Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProjectWindows.zip)
- Linux and MacOs
- MapReduce exercises (2 slides per page, 6 slides per page)
- Spark
- Spark RDD-, Dataset-, DataFrame-based exercises (2 slides per page, 6 slides per page)
- Example data – One folder with (few) data for each exercise (ExSparkData.zip)
- Solutions of Exercises 30-46 (Solutions30_46.zip)
- Spark SQL exercises (2 slides per page, 6 slides per page)
- Example data – One folder with (few) data for each exercise (ExSparkSQLData.zip)
- Solutions of Exercises 47-48 (Solutions47_48.zip)
- Solutions of Exercises 49-50 (Solutions49_50.zip) – The problem specifications of these two exercises are in Spark RDD-, Dataset-, DataFrame-based exercises (2 slides per page, 6 slides per page)
- Spark streaming exercises (2 slides per page, 6 slides per page)
- Spark RDD-, Dataset-, DataFrame-based exercises (2 slides per page, 6 slides per page)
Practices
- No lab activities during the first two weeks
- TEAM 1: Students from A to H – Tuesday from 5.30pm to 7pm
- TEAM 2: Students from I to Z – Wednesday from 5.30pm to 7pm
- Lab1: Hadoop and MapReduce
- Online virtual lab, for online questions and answers
- Team 1: Tuesday, March 24 – 5.30pm – 7pm – Team 1
- Team 2: Wednesday, March 25 – 5.30pm – 7pm – Team 2
- Problem specification (pdf)
- How to import and run locally on your PC a MapReduce program by using Eclipse + maven (01_ImportProject_LocalRun.mp4)
- How to create a jar file and execute your application on the remote cluster BigData@Polito (02_Jar_ClusterExecution.mp4)
- Basic project and small example data set
- Linux and macOS (Lab1.zip)
- Windows (Lab1Windows.zip)
- Solution
- Bonus track: Lab1_SolBonus_1920.zip
- Online virtual lab, for online questions and answers
- Lab2: Filter with Hadoop MapReduce
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
- Linux and macOS (Lab2_Skeleton1920.zip)
- Windows (Lab2Windows_Skeleton1920.zip)
- Solution
- Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
- Linux and macOS (Lab3_Skeleton1920.zip)
- Windows (Lab3Windows_Skeleton1920.zip)
- Sample file (AmazonTransposedDataset_Sample.txt)
- Solution
- Lab3_Sol1920.zip – Three alternative solutions are provided (the solutions are characterized by a different efficiency)
- Comments on the three uploaded solutions (2 slides per page, 6 slides per page)
- Lab4: Normalized ratings for product recommendations with Hadoop MapReduce (Team 1: Tuesday, April 7 – Team 2: Wednesday, April 8)
- Problem specification (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project Hadoop – MapReduce
- Linux and macOS (Lab4_Skeleton1920.zip)
- Windows (Lab4Windows_Skeleton1920.zip)
- Solution
- Lab5: Filter data and compute basic statistics with Apache Spark (Team 1: Tuesday, April 21 – Team 2: Wednesday, April 22)
- Problem specification (pdf)
- Sample file (SampleLocalFile.csv)
- Skeleton Eclipse project Spark (Lab5BigData_Template1920.zip)
- Solution
- Lab6: Frequently bought/reviewed together application with Apache Spark (Team 1: Tuesday, April 28 – Team 2: Wednesday, April 29)
- Problem specification (pdf)
- Sample file (ReviewsSample.csv)
- Skeleton Eclipse project Spark (Lab6BigData_Template1920.zip)
- Solution
- Lab7: Bike sharing data analysis (Team 1 and Team 2: Tuesday, May 5)
- Problem specification (pdf)
- Sample data (zip)
- Skeleton Eclipse project Spark (Lab7BigData_Template1920.zip)
- Example KML file (zip)
- Another KML visualizer that can be used to visualize on a map the result of your analysis: http://kmlviewer.nsspot.net
- Solution
- Lab8: Bike sharing data analysis based on Spark SQL (Team 1: Tuesday, May 19 – Team 2: Wednesday, May 20)
- Problem specification (pdf)
- Sample data (zip)
- Skeleton Eclipse project Spark (Lab8BigData_Template1920.zip)
- Solution
- Lab9: A classification pipeline with MLlib + SparkSQL
- Problem specification (pdf)
- Sample file with 100 reviews (ReviewsSample.csv)
- Skeleton Eclipse project Spark (Lab9BigData_Template1920.zip)
- Solution
- Lab10: Tweet analysis – Spark streaming (Team 1 and Team 2: Wednesday, June 3)
- Problem specification (pdf)
- Example files – tweets (exampledata_tweets.zip)
- Skeleton Eclipse project Spark (Lab10BigData_Template1920.zip)
- Solution
Exam Examples
- At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
- For the Spark exercises, no templates are provided
- Exam example #1
- Exam example #2
- Exam July 1, 2016
- Exam July 12, 2016
- Exam September 19, 2016
- Exam June 30, 2017
- Exam July 14, 2017
- Exam September 14, 2017
- Exam January 22, 2018
- Exam June 26, 2018
- Exam July 16, 2018
- Exam September 3, 2018
- Exam February 15, 2019
- Exam July 2, 2019
- Exam July 18, 2019
- Exam July 2, 2020
- Exam (pdf)
- Draft of the solution
- Question 1: (b)
- Question 2: (a) – Note that there are two actions and both actions are associated with paths that include the filter transformation. Hence, the filter transformation is executed two times. For this reason the value of the accumulator is 4.
- Source code/Eclipse projects (zip)
- Draft of the solution
- Exam (pdf)
- Exam July 16, 2020
- Exam September 19, 2020
Additional material
- Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who have never used Java
- OO Paradigm and UML (The UML part is not mandatory)
- The Java Environment
- Java Basic Features
- Java Inheritance
- Suggested slides/lectures for those students who have never used Java