Big data: architectures and data analytics (2017/2018)

This page has hierarchy - Parent page: Teaching

Table of content

General information
Exam rules
Announcements
Lectures: Schedule and topics
Materials
Screencast
Exercises
Exam Examples
Practices
Additional materials

AY 2017/2018

New web page: Big data: architectures and data analytics – AY 2018/2019 (link)

General information

ECTS: 6
Professor: Paolo Garza
Students from AA to LZ
- Teaching assistants:
  - Daniele Apiletti
  - Eliana Pastor
  - Alessandro Farasin
  - Andrea Pasini
Students from MA to ZZ
- Teaching assistants:
  - Francesco Ventura
  - Andrea Pasini
  - Alessandro Farasin

Exam rules

Exam rules Academic Year 2017-2018 (pdf)

Announcements

(13/02/2019) The exam scheduled for February 15, 2019 at 8:30 will be held in Classroom 3I
- Please, remember to bring with you:
  - the student card and/or an identity document
  - sheets of paper

Students from AA to LZ

Students from MA to ZZ

(10/09/2018)
- The results of the “Big data: architectures and data analytics – July 16, 2018” exam are available on the “Teaching Portal, Exams, Provisional exam results”.
- Exam papers will be discussed on Wednesday, September 12 2018 at 10:00 in Sala Colloqui DAUIN (at the 4th floor of the Dipartimento di Automatica e Informatica) (map).
- Students, who received a grade >= 18 on the written exam and want to reject it, have to send an e-mail to Paolo Garza by Friday, September 14, 2018 at 14:00.
(31/08/2018) The exam scheduled for September 3, 2018 will be held at 16:000 in
- Classroom 2D: Students from AA to LZ
- Classroom 4D: Students from MA to ZZ
- Please, remember to bring with you:
  - the student card and/or an identity document
  - sheets of paper

(10/09/2018)
- The results of the “Big data: architectures and data analytics – July 16, 2018” exam are available on the “Teaching Portal, Exams, Provisional exam results”.
- Exam papers will be discussed on Wednesday, September 12 2018 at 10:00 in Sala Colloqui DAUIN (at the 4th floor of the Dipartimento di Automatica e Informatica) (map).
- Students, who received a grade >= 18 on the written exam and want to reject it, have to send an e-mail to Paolo Garza by Friday, September 14, 2018 at 14:00.
(31/08/2018) The exam scheduled for September 3, 2018 will be held at 16:000 in
- Classroom 2D: Students from AA to LZ
- Classroom 4D: Students from MA to ZZ
- Please, remember to bring with you:
  - the student card and/or an identity document
  - sheets of paper

Lectures: Schedule and topics

Students from AA to LZ
- Schedule of the lectures with the list of covered topics (link)
Students from MA to ZZ
- Schedule of the lectures with the list of covered topics (link)

Materials

Introduction to the course (2 slides per page, 6 slides per page)
Introduction to Big Data (2 slides per page, 6 slides per page)
Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce – Basic structure of MapReduce programs in Hadoop (2 slides per page, 6 slides per page)
  - Source code of the Word Count Ecplise project (WordCount.zip) – Use the import option to import it in Eclipse
  - PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- Interaction with HDFS and Hadoop by means of the command line (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 2 (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 3 (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce – Multiple Inputs and Multiple Outputs (2 slides per page, 6 slides per page)
- MapReduce – Distributed cache
  - New APIs (2 slides per page, 6 slides per page) UPLOADED on April 18, 2017
  - Deprecated APIs (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
Spark
- Introduction to Apache Spark (2 slides per page, 6 slides per page)
- Introduction to Apache Spark – Part 2 (2 slides per page, 6 slides per page)
- RDD-based programs (RDDs creation and basic transformations) – Part 1 (2 slides per page, 6 slides per page) UPDATED ON April 13, 2018
- RDD-based programs (RDDs basic actions) – Part 2 (2 slides per page, 6 slides per page)
- How to submit a Spark application (2 slides per page, 6 slides per page)
- RDD-based programs (key-value pair RDDs and transformations on PairRDDs) – Part 3 (2 slides per page, 6 slides per page)
- RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (2 slides per page, 6 slides per page)
- RDD-based programs (DoubleRDDs) – Part 5 (2 slides per page, 6 slides per page)
- RDD-based programs (Cache, accumulators, broadcast variables) – Part 6 (2 slides per page, 6 slides per page)
Datasets, DataFrames and Spark SQL (2 slides per page, 6 slides per page)
- Spark SQL example – DataFrames vs Datasets vs SQL
  - Problem specification (2 slides per page, 6 slides per page)
  - Solution (zip)
- Spark SQL and User Defined Functions (UDFs) (2 slides per page, 6 slides per page) UPLOADED ON May 16, 2018
- These slides about Spark SQL significantly extend the ones used in the previous academic year with the following new concepts:
  - Dataset
  - Read and Write Dataset, DataFrame
  - Dataset and map
  - Aggregate functions
  - GroupBy and aggregate functions
Data mining and Machine learning algorithms with Spark MLlib
- Data Mining – Recap
  - Introduction (2 slides per page, 6 slides per page)
  - Data and Preprocessing (2 slides per page, 6 slides per page)
  - Itemset mining and Association rules (2 slides per page, 6 slides per page)
  - Classification (2 slides per page, 6 slides per page)
  - Clustering (2 slides per page, 6 slides per page)
Spark MLlib
- Spark MLlib – Introduction and Classification of structured (2 slides per page, 6 slides per page)
  - Logistic Regression example code (zip)
  - Decision Trees example code (zip)
  - Decision Trees and Categorical class label example code (zip)
- Spark MLlib – Classification of textual data (2 slides per page, 6 slides per page)
  - Textual data classification example code (zip)
- Spark MLlib – Parameter tuning (2 slides per page, 6 slides per page)
  - Parameter tuning example code (zip)
- Spark MLlib – Clustering of structured data (2 slides per page, 6 slides per page)
  - Clustering example code (zip)
- Spark MLlib – Itemset and Association rule mining (2 slides per page, 6 slides per page)
  - Itemset and Association rule mining example code (zip)
- Spark MLlib – Linear regression (2 slides per page, 6 slides per page)
  - Linear regression example code (zip)
Spark Streaming (2 slides per page, 6 slides per page) UPDATED ON May 25, 2018
- Word Count – Streaming version (zip)
- Word Count and Window (zip)
- Word Count – Stateful version (zip)
- Word Count – Streaming version – Read data from HDFS folder (zip)
- Word Count – Output sort by key – Based on the transformPair() transformation (zip)
DBMS for Big data
- Relational and Non-relational databases for Big data (2 slides per page, 6 slides per page)

Screencast

Pay attention that each video is longer than 1 hour. The integrated player of Dropbox plays only the first hour of each video. To watch the entire videos you should use one of the following approaches:

Download locally the videos you are interested in and use a player installed on your PC
Include in your Dropbox the videos you are interested in. The Dropbox player allows watching the entire videos if they are in your dropbox.

Videos

Monday, March 12, 2018, 11:30-13:00 (mp4) (m4v)
- Introduction to the MapReduce programming paradigm and Hadoop implementation of MapReduce – Basic structure of MapReduce programs in Hadoop
Friday, March 16, 2018, 8:30-11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- MapReduce programming paradigm and Hadoop and How to run an application on the Hadoop cluster
- Exercises 1, 2, 3, 5
- Combiner, personalized data types
Monday, March 19, 2018, 11:30-13:00 (mp4) (m4v)
- Combiner, personalized data types
- Exercises 5 and 6 (without and with combiner)
Friday, March 23, 2018, 8:30-11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Personalized properties
- Personalized counters
- Map-only jobs
- In-mapper combiners
- Exercises 9, 8, 12, 10, 14, 15
Friday, April 6, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Design patterns – Part 1
- Multiple Inputs and Multiple Outputs
- Distributed cache
- Exercises 4, 13, 17, 22, 23
Monday, April 9, 2018., 11:30-13:00 (mp4) (m4v)
- Design patterns – Part 2
- Exercises 23, 26
Friday, April 13, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- MapReduce – Relational Algebra/SQL operators
- Exercises 27 and 29
- Introduction to Spark
- Introduction to Spark – part2 Design patterns – Part 1
Monday, April 16, 2018., 11:30-13:00 (mp4) (m4v)
- RDD-based programs (RDDs creation and basic transformations) – Part 1 (Slides: 1-78)
Friday, April 20, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Discussion of three possible MapReduce solution for Lab#3
- RDD-based programs (RDDs creation and basic transformations) – Part 1 (Slides: 79-end)
- RDD-based programs (RDDs basic actions) – Part 2 (Slides: 1-53)
- Spark-submit
- Exercise #30
Monday, April 23, 2018., 11:30-13:00 (mp4) (m4v)
- RDD-based programs (RDDs basic actions) – Part 2 (Slides: 54-end)
- Exercises 31, 32, 33, 34
Friday, April 27, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- RDD-based programs (key-value pair RDDs and transformations on PairRDDs) – Part 3
- RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (Slides: 1-27)
- Exercises 35, 36, 37, 38 e 39
Friday, May 4, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (Slides: 28-end)
- RDD-based programs (DoubleRDDs)
- RDD-based programs (Cache, accumulators, broadcast variables)
- Exercises 40, 41, 43 (Discussion of a possible solution for the first two parts)
Monday, May 7, 2018, 11:30-13:00 (mp4) (m4v)
- Exercise 43 (parts1, 2, 3)
- Datasets, DataFrames and Spark SQL (Slides: 1-36)
Friday, May 11, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Datasets, DataFrames and Spark SQL (Slides: 37-end)
- Spark SQL example
- Exercises: 32 (solved by using the SQL Spark component)
Monday, May 14, 2018, 11:30-13:00 (mp4) (m4v)
- Exercises 33, 36, 38 (solved by using the SQL Spark component)
- Data Mining – Recap – Introduction
Friday, May 18, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Spark SQL: User Defined Functions (UDFs)
- Exercises 49, 50
- Spark MLlib – Introduction and Classification of structured
- Spark MLlib – Classification of textual data
- Spark MLlib – Parameter tuning
Monday, May 21, 2018, 11:30-13:00 (mp4) (m4v)
- Spark MLlib – Clustering of structured data
- Spark MLlib – Itemset and Association rule mining
- Spark MLlib – Linear regression
- Spark Streaming (Slides 1-10)
Friday, May 25, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Spark Streaming (Slides 11-end)
- Relational and Non-relational databases for Big data
Monday, May 28, 2018, 11:30-13:00 (mp4) (m4v)
- Exercises: Exercise #44 and #46
Friday, June 1, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Exercises from exams June 30, 2017, July 14, 2017, September 14, 2017

Exercises

MapReduce Exercises – Part 1 (2 slides per page, 6 slides per page)
MapReduce Exercises – Part 2 (2 slides per page, 6 slides per page)
- Solutions – Part 1 and 2
  - Source code/Eclipse – maven projects (SolutionsExercisesPart1_Part2.zip)
MapReduce Exercises – Part 3 (2 slides per page, 6 slides per page)
- Solutions – Part 3
  - Source code/Eclipse – maven projects (SolutionsExercisesPart3.zip)
MapReduce Exercises – Part 4 (2 slides per page, 6 slides per page)
- Solutions – Part 4
  - Source code/Eclipse projects (SolutionsExercisesPart4.zip)
MapReduce Exercises – Part 5 (2 slides per page, 6 slides per page)
- Solutions – Part 5
  - Source code/Eclipse projects (SolutionsExercisesPart5.zip)
MapReduce Exercises – Part 6 (2 slides per page, 6 slides per page)
- Solutions – Part 6
  - Source code/Eclipse projects (SolutionsExercisesPart6.zip)
MapReduce Exercises – Part 7 (2 slides per page, 6 slides per page)
- Solutions – Part 7
  - Source code/Eclipse projects (SolutionsExercisesPart7.zip)

Spark Exercises – Part 8 (2 slides per page, 6 slides per page)
- Simulation – Exercise #31 (2 slides per page, 6 slides per page)
- Solutions – Part 8
  - Source code/Eclipse projects (SolutionsExercisesPart8.zip)
- Solutions of Exercises 32-36 based on Spark SQL (with Dataset, DataFrame, SQL-like language)
  - Source code/Eclipse projects (SolutionsExercisesPart8SparkSQL.zip)
Spark Exercises – Part 9 (2 slides per page, 6 slides per page)
- Solutions – Part 9
  - Source code/Eclipse projects (SolutionsExercisesPart9.zip)
- Solutions of Exercises 37-38 based on Spark SQL (with Dataset, DataFrame, SQL-like language)
  - Source code/Eclipse projects (SolutionsExercisesPart9SparkSQL.zip)
Spark Exercises – Part 10 (2 slides per page, 6 slides per page)
- Solutions – Part 10
  - Source code/Eclipse projects (SolutionsExercisesPart10.zip)
Spark Exercises – Part 11 (2 slides per page, 6 slides per page)
- Solutions – Part 11
  - Source code/Eclipse projects (SolutionsExercisesPart11.zip)
Spark Exercises – Part 12 (2 slides per page, 6 slides per page)
- Solutions – Part 12
  - Source code/Eclipse projects (SolutionsExercisesPart12.zip)
Spark UDFs Exercises – Part 14 (2 slides per page, 6 slides per page)
- Solutions – Part 14
  - Source code/Eclipse projects (SolutionsExercisesPart14.zip)
Spark Streaming Exercises – Part 13 (2 slides per page, 6 slides per page)
- Solutions – Part 13
  - Source code/Eclipse projects (SolutionsExercisesPart13.zip)

Exam Examples

At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
- For the Spark exercises, no templates are provided

Exam example #1
- Text (pdf)
- Solution
  - Source code/Eclipse projects (zip)
Exam example #2
- Text (pdf)
- Solution
  - Source code/Eclipse projects (zip)
Exam July 1, 2016
- Text (pdf)
- Solution
  - Question 1: (d)
  - Question 2: (b)
  - Source code/Eclipse projects (zip)
Exam July 12, 2016
- Text (pdf)
- Solution
  - Question 1: (a)
  - Question 2: (a)
  - Source code/Eclipse projects (zip)
Exam September 19, 2016
- Text (pdf)
- Solution
  - Question 1: (c)
  - Question 2: (a)
  - Source code/Eclipse projects (zip)
Exam June 30, 2017
- Text (pdf)
- Solution
  - Question 1: (b)
  - Question 2: (c)
  - Source code/Eclipse projects (zip)
Exam July 14, 2017
- Text (pdf)
- Solution
  - Question 1: (d)
  - Question 2: (c)
  - Source code/Eclipse projects (zip)
Exam September 14, 2017
- Text (pdf)
- Solution
  - Question 1: (a)
  - Question 2: (b)
  - Source code/Eclipse projects (zip)
Exam January 22, 2018 – NEW: UPLOADED ON June 12, 2018
- Text (pdf)
- Solution
  - Question 1: (b)
  - Question 2: (b)
  - Source code/Eclipse projects (zip)
Exam June 26, 2018
- Text Version #1 (pdf)
  - Draft of the solution
    - Question 1: (c)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
- Text Version #2 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
Exam July 16, 2018
- Text Version #1 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (a)
    - Source code/Eclipse projects (zip)
- Text Version #2 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (d)
    - Source code/Eclipse projects (zip)
Exam September 9, 2018
- Text Version #1 (pdf)
  - Draft of the solution
    - Question 1: (d)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)
  - Text Version #2 (pdf)
  - Draft of the solution
    - Question 1: (b)
    - Question 2: (c)
    - Source code/Eclipse projects (zip)

Practices

Schedule of the lab activities