Big Data: Architectures and Data Analytics (2020/2021)
Table of content
- General information
- Exam rules
- Announcements
- Slides
- Exercises
- Practices
- Exam Examples
- Additional material
This is the old version of the web page of the Big data course.
Web page of the academic year 2021/22: link
General information
- ECTS: 6
- Professor: Paolo Garza
- Teaching assistants:
- Luca Colomba
- Francesco Ventura
Exam rules
- Exam rules Academic Year 2020-2021 (link)
Announcements
- (24/09/2020)
- First (online) lecture: Tuesday, September 29 at 13.00 – Online virtual classroom
- (24/09/2020)
- No lab activities during the first two weeks.
- The lab activities scheduled for Monday, September 28 from 17:30 to 19:00 and Tuesday, September 29 from 8:30 to 10:00 are cancelled.
Slides
- Introduction to the course content and exam rules (slides)
- Introduction to Big Data (slides) (slides – no black background)
- Big Data Architectures (slides) (slides – no black background)
- Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (slides) (slides – no black background)
- Interaction with HDFS and Hadoop by means of the command line (slides) (slides – no black background)
- Hadoop implementation of MapReduce (slides) (slides – no black background)
- Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
- PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (slides) (slides – no black background)
- MapReduce – Design patterns – Part 1 (slides) (slides without black background) (slides) (slides – no black background)
- MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (slides) (slides – no black background)
- MapReduce – Design patterns – Part 2 (slides) (slides – no black background)
- MapReduce – Relational Algebra/SQL operators (slides) (slides – no black background)
- Introduction to Apache Hadoop and the MapReduce programming paradigm (slides) (slides – no black background)
- Spark
- Introduction to Apache Spark (slides) (slides – no black background)
- How to submit Spark applications (slides) (slides – no black background)
- RDD-based programs
- RDDs: creation, basic transformations and actions (slides) (slides – no black background)
- Key-value pair RDDs: transformations and actions on PairRDDs (slides) (slides – no black background)
- DoubleRDDs (slides) (slides – no black background)
- Advanced Topics: Cache, accumulators, broadcast variables (slides) (slides – no black background)
- Spark SQL, Datasets and DataFrames (slides) (slides – no black background)
- Data Mining – Recap
- Introduction (slides)
- Spark MLlib
- Spark MLlib – (slides) (slides – no black background)
- Spark MLlib – Classification of textual data (slides) (slides – no black background)
- Textual data classification example code (zip)
- Spark MLlib – Classification and Parameter tuning (slides) (slides – no black background)
- Parameter tuning example code (zip)
- Spark MLlib – Clustering of structured data (slides) (slides – no black background)
- Clustering example code (zip)
- Spark MLlib – Itemset and Association rule mining (slides) (slides – no black background)
- Itemset and Association rule mining example code (zip)
- Spark MLlib – Linear regression (slides) (slides – no black background)
- Linear regression example code (zip)
- Spark Streaming (slides) (slides – no black background) Last update – Dec 11, 2020 (Slides 64-69 are new. The other slides have not been changed.)
- Introduction to Apache Spark (slides) (slides – no black background)
- Relational and Non-relational databases for Big data
- Introduction to relational and non-relational databases for Big data (slides) (slides – no black background)
Exercises
- MapReduce
- Basic project
- Linux and MacOs
- Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProject.zip)
- Windows
- Setup instructions (ConfigureWindowsEnviroment.pdf)
- You must install also JDK 1.8 and select it for the imported project inside Eclipse. If you already installed the JDK environment but the version is greater than JDK 1.8 you must install also JDK 1.8.
- Winutils executable (winutils.zip)
- Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProjectWindows.zip)
- Setup instructions (ConfigureWindowsEnviroment.pdf)
- Linux and MacOs
- MapReduce exercises (slides) (slides – no black background)
- Solutions of Exercises 1-12 (Solutions1_12.zip)
- Solutions of Exercises 13-22 (Solutions13_22.zip)
- Solutions of Exercises 23-29 (Solutions23_29.zip) – The solution of Exercise 23 Bis has been updated (October 29, 2020)
- Basic project
- Spark
- Spark RDD-, Dataset-, DataFrame-based exercises (slides) (slides – no black background)
- Example data – One folder with (few) data for each exercise (ExSparkData.zip)
- Solutions of Exercises 30-50 (SolutionsExSpark.zip)
- Ex. 39 Bis – Comparison between two alternative solutions (slides) (slides – no background)
- Spark streaming exercises (slides) (slides – no black background)
- Solutions of Exercises 51-53 (SolutionsSparkStreaming.zip)
- Spark RDD-, Dataset-, DataFrame-based exercises (slides) (slides – no black background)
Practices
- No lab activities during the first two weeks
- TEAM 1: Students from A to H – Monday from 5.30 pm to 7 pm
- TEAM 2: Students from I to Z – Tuesday from 8.30 am to 10 am
- Lab1: Hadoop and MapReduce
- Online virtual lab, for online questions and answers
- Team 1: Monday, October 12 – 5.30pm – 7pm
- Team 2: Tuesday, October 13 – 8.30 am to 10 am
- Problem specification (pdf)
- How to import and run locally on your PC a MapReduce program by using Eclipse + maven (01_ImportProject_LocalRun.mp4)
- How to create a jar file and execute your application on the remote cluster BigData@Polito (02_Jar_ClusterExecution.mp4)
- Basic project and small example data set
- Linux and macOS (Lab1.zip)
- Windows (Lab1Windows.zip)
- Bigger data set: finefoods_text.txt (zip)
- Solution
- Bonus track: Lab1_SolBonus_1920.zip
- Online virtual lab, for online questions and answers
- Lab2: Filter with Hadoop MapReduce
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
- Linux and macOS (Lab2_Skeleton1920.zip)
- Windows (Lab2Windows_Skeleton1920.zip)
- Outputs of the first lab (OutputFolderLab1.zip) (OutputFolderLab1BonusTrack.zip). You can use them to test your application locally on your PC
- Solution
- Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
- Problem specification (pdf)
- Skeleton Eclipse project Hadoop – MapReduce
- Linux and macOS (Lab3_Skeleton1920.zip)
- Windows (Lab3Windows_Skeleton1920.zip)
- Input file (AmazonTransposedDataset_Sample.txt)
- Expected output/result (part-r-00000)
- Solution
- Lab3_Sol1920.zip – Three alternative solutions are provided (the solutions are characterized by a different efficiency)
- Comments on the three uploaded solutions (slides) (slides – no black background)
- Lab4: Normalized ratings for product recommendations with Hadoop MapReduce
- Problem specification (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project Hadoop – MapReduce
- Linux and macOS (Lab4_Skeleton1920.zip)
- Windows (Lab4Windows_Skeleton1920.zip)
- Expected output (the input is the large file Reviews.csv) (resLab4.txt)
- Solution
- Lab5: Filter data and compute basic statistics with Apache Spark
- Problem specification (pdf)
- Sample file (SampleLocalFile.csv)
- Skeleton Eclipse project Spark (Lab5BigData_Template1920.zip)
- Solution
- Lab6: Frequently bought/reviewed together application with Apache Spark
- Problem specification (pdf)
- Sample file (ReviewsSample.csv)
- Skeleton Eclipse project Spark (Lab6BigData_Template1920.zip)
- Expected output – Task 1 (the input is the large file Reviews.csv) (outputTask1Lab6.zip)
- Solution
- Lab7: Bike sharing data analysis
- Problem specification (pdf)
- Sample data (zip)
- Skeleton Eclipse project Spark (Lab7BigData_Template1920.zip)
- Example KML file (zip)
- Expected output
- Execution on sample data (sampleData/registerSample.csv and sampleData/stations.csv) and minimum criticality
threshold = 0.4 (part-00000) - Execution on complete data (/data/students/bigdata-01QYD/Lab7/register.csv and /data/students/bigdata-01QYD/Lab7/stations.csv) and minimum criticality
threshold = 0.6 (part-00000)
- Execution on sample data (sampleData/registerSample.csv and sampleData/stations.csv) and minimum criticality
- Another KML visualizer that can be used to visualize on a map the result of your analysis: http://kmlviewer.nsspot.net
- Solution
- Lab8: Bike sharing data analysis based on Spark SQL
- Problem specification (pdf)
- Sample data (zip)
- Skeleton Eclipse project Spark (Lab8BigData_Template1920.zip)
- Expected output
- Execution on sample data (sampleData/registerSample.csv and sampleData/stations.csv) and minimum criticality
threshold = 0.4 (out_Lab8sample.zip) - Execution on complete data (/data/students/bigdata-01QYD/Lab8/register.csv and /data/students/bigdata-01QYD/Lab8/stations.csv) and minimum criticality
threshold = 0.6 (out_Lab8.zip)
- Execution on sample data (sampleData/registerSample.csv and sampleData/stations.csv) and minimum criticality
- Solution
- Lab9: A classification pipeline with MLlib + SparkSQL
- Problem specification (pdf)
- Sample file with 100 reviews (ReviewsSample.csv)
- Skeleton Eclipse project Spark (Lab9BigData_Template1920.zip)
- Solution
- Lab10: Tweet analysis – Spark streaming
- Problem specification (pdf)
- Example files – tweets (exampledata_tweets.zip)
- Skeleton Eclipse project Spark (Lab10BigData_Template1920.zip)
- Solution
Exam Examples
Pay attention that from this academic year (2020/21) the exam is closed book
- Exam June 26, 2018
- Exam July 16, 2018
- Exam September 3, 2018
- Exam February 15, 2019
- Exam July 2, 2019
- Exam July 18, 2019
- Exam July 2, 2020
- Exam July 16, 2020
- Exam September 17, 2020
- Exam February 5, 2021
- Exam June 30, 2021
Additional material
- Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who have never used Java
- OO Paradigm and UML (The UML part is not mandatory)
- The Java Environment
- Java Basic Features
- Java Inheritance
- Suggested slides/lectures for those students who have never used Java
- Data mining – Centralized algorithms