Big data: architectures and data analytics (2017/2018)


This page has hierarchy - Parent page: Teaching

Table of content

AY 2017/2018

New web page: Big data: architectures and data analytics – AY 2018/2019 (link)

General information

  • ECTS: 6
  • Professor: Paolo Garza
  • Students from AA to LZ
    • Teaching assistants:
  • Students from MA to ZZ
    • Teaching assistants:
      • Francesco Ventura
      • Andrea Pasini
      • Alessandro Farasin

Exam rules

  • Exam rules Academic Year 2017-2018 (pdf)

Announcements

  • (13/02/2019)  The exam scheduled for February 15, 2019 at 8:30 will be held in Classroom 3I
    • Please, remember to bring with you:
      • the student card and/or an identity document
      • sheets of paper
Students from AA to LZ Students from MA to ZZ
  • (10/09/2018)
    • The results of the “Big data: architectures and data analytics – July 16, 2018” exam are available on the “Teaching Portal, Exams, Provisional exam results”.
    • Exam papers will be discussed on Wednesday, September 12 2018 at 10:00 in Sala Colloqui DAUIN (at the 4th floor of the Dipartimento di Automatica e Informatica) (map).
    • Students, who received a grade >= 18 on the written exam and want to reject it, have to send an e-mail to Paolo Garza by Friday, September 14, 2018 at 14:00.
  • (31/08/2018) The exam scheduled for September 3, 2018 will be held at 16:000 in
    • Classroom 2D: Students from AA to LZ
    • Classroom 4D: Students from MA to ZZ
    • Please, remember to bring with you:
      • the student card and/or an identity document
      • sheets of paper
  • (10/09/2018)
    • The results of the “Big data: architectures and data analytics – July 16, 2018” exam are available on the “Teaching Portal, Exams, Provisional exam results”.
    • Exam papers will be discussed on Wednesday, September 12 2018 at 10:00 in Sala Colloqui DAUIN (at the 4th floor of the Dipartimento di Automatica e Informatica) (map).
    • Students, who received a grade >= 18 on the written exam and want to reject it, have to send an e-mail to Paolo Garza by Friday, September 14, 2018 at 14:00.
  • (31/08/2018) The exam scheduled for September 3, 2018 will be held at 16:000 in
    • Classroom 2D: Students from AA to LZ
    • Classroom 4D: Students from MA to ZZ
    • Please, remember to bring with you:
      • the student card and/or an identity document
      • sheets of paper

 Lectures: Schedule and topics

  • Students from AA to LZ
    • Schedule of the lectures with the list of covered topics (link)
  • Students from MA to ZZ
    • Schedule of the lectures with the list of covered topics (link)

Materials

Screencast

Pay attention that each video is longer than 1 hour. The integrated  player of Dropbox plays only the first hour of each video. To watch the entire videos you should use one of the following approaches:

  1. Download locally the videos you are interested in and use a player installed on your PC
  2. Include in your Dropbox the videos you are interested in. The Dropbox player allows watching the entire videos if they are in your dropbox.

Videos

  • Monday, March 12, 2018, 11:30-13:00 (mp4) (m4v)
    •  Introduction to the MapReduce programming paradigm and Hadoop implementation of MapReduce – Basic structure of MapReduce programs in Hadoop
  • Friday, March 16, 2018, 8:30-11:30 (part1.mp4part2.mp4) (part1.m4vpart2.m4v)
    • MapReduce programming paradigm and Hadoop and How to run an application on the Hadoop cluster
    • Exercises 1, 2, 3, 5
    • Combiner, personalized data types
  • Monday, March 19, 2018, 11:30-13:00 (mp4) (m4v)
    • Combiner, personalized data types
    • Exercises 5 and 6 (without and with combiner)
  • Friday, March 23, 2018, 8:30-11:30 (part1.mp4part2.mp4) (part1.m4vpart2.m4v)
    • Personalized properties
    • Personalized counters
    • Map-only jobs
    • In-mapper combiners
    • Exercises 9, 8, 12, 10, 14, 15
  • Friday, April 6, 2018, 8:30 – 11:30 (part1.mp4part2.mp4) (part1.m4vpart2.m4v)
    • Design patterns – Part 1
    • Multiple Inputs and Multiple Outputs
    • Distributed cache
    • Exercises 4, 13, 17, 22, 23
  • Monday, April 9, 2018., 11:30-13:00 (mp4) (m4v)
    • Design patterns – Part 2
    • Exercises 23, 26
  • Friday, April 13, 2018, 8:30 – 11:30 (part1.mp4part2.mp4) (part1.m4vpart2.m4v)
    • MapReduce – Relational Algebra/SQL operators
    • Exercises 27 and 29
    • Introduction to Spark
    • Introduction to Spark – part2 Design patterns – Part 1
  • Monday, April 16, 2018., 11:30-13:00 (mp4) (m4v)
    • RDD-based programs (RDDs creation and basic transformations) – Part 1 (Slides: 1-78)
  • Friday, April 20, 2018, 8:30 – 11:30 (part1.mp4part2.mp4) (part1.m4vpart2.m4v)
    • Discussion of three possible MapReduce solution for Lab#3
    • RDD-based programs (RDDs creation and basic transformations) – Part 1 (Slides: 79-end)
    • RDD-based programs (RDDs basic actions) – Part 2 (Slides: 1-53)
    • Spark-submit
    • Exercise #30
  • Monday, April 23, 2018., 11:30-13:00 (mp4) (m4v)
    • RDD-based programs (RDDs basic actions) – Part 2 (Slides: 54-end)
    • Exercises 31, 32, 33, 34
  • Friday, April 27, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4vpart2.m4v)
    • RDD-based programs (key-value pair RDDs and transformations on PairRDDs) – Part 3
    • RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (Slides: 1-27)
    • Exercises 35, 36, 37, 38 e 39
  • Friday, May 4, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4vpart2.m4v)
    • RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (Slides: 28-end)
    • RDD-based programs (DoubleRDDs)
    • RDD-based programs (Cache, accumulators, broadcast variables)
    • Exercises 40, 41, 43 (Discussion of a possible solution for the first two parts)
  • Monday, May 7, 2018, 11:30-13:00 (mp4) (m4v)
    • Exercise 43 (parts1, 2, 3)
    • Datasets, DataFrames and Spark SQL (Slides: 1-36)
  • Friday, May 11, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4vpart2.m4v)
    • Datasets, DataFrames and Spark SQL (Slides: 37-end)
    • Spark SQL example
    • Exercises: 32 (solved by using the SQL Spark component)
  • Monday, May 14, 2018, 11:30-13:00 (mp4) (m4v)
    • Exercises 33, 36, 38 (solved by using the SQL Spark component)
    • Data Mining – Recap – Introduction
  • Friday, May 18, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4vpart2.m4v)
    • Spark SQL: User Defined Functions (UDFs)
    • Exercises 49, 50
    • Spark MLlib – Introduction and Classification of structured
    • Spark MLlib – Classification of textual data
    • Spark MLlib – Parameter tuning
  • Monday, May 21, 2018, 11:30-13:00 (mp4) (m4v)
    • Spark MLlib – Clustering of structured data
    • Spark MLlib – Itemset and Association rule mining
    • Spark MLlib – Linear regression
    • Spark Streaming (Slides 1-10)
  • Friday, May 25, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4vpart2.m4v)
    • Spark Streaming (Slides 11-end)
    • Relational and Non-relational databases for Big data
  • Monday, May 28, 2018, 11:30-13:00 (mp4) (m4v)
    • Exercises: Exercise #44 and #46
  • Friday, June 1, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4vpart2.m4v)
    • Exercises from exams June 30, 2017, July 14, 2017,  September 14, 2017

Exercises

Exam Examples

  • At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
    • For the Spark exercises, no templates are provided
  • Exam example #1
    • Text (pdf)
    • Solution
      • Source code/Eclipse projects (zip)
  • Exam example #2
    • Text (pdf)
    • Solution
      • Source code/Eclipse projects (zip)
  • Exam July 1, 2016
    • Text (pdf)
    • Solution
      • Question 1: (d)
      • Question 2: (b)
      • Source code/Eclipse projects (zip)
  • Exam July 12, 2016
    • Text (pdf)
    • Solution
      • Question 1: (a)
      • Question 2: (a)
      • Source code/Eclipse projects (zip)
  • Exam September 19, 2016
    • Text (pdf)
    • Solution
      • Question 1: (c)
      • Question 2: (a)
      • Source code/Eclipse projects (zip)
  • Exam June 30, 2017
    • Text (pdf)
    • Solution
      • Question 1: (b)
      • Question 2: (c)
      • Source code/Eclipse projects (zip)
  • Exam July 14, 2017
    • Text (pdf)
    • Solution
      • Question 1: (d)
      • Question 2: (c)
      • Source code/Eclipse projects (zip)
  • Exam September 14, 2017
    • Text (pdf)
    • Solution
      • Question 1: (a)
      • Question 2: (b)
      • Source code/Eclipse projects (zip)
  • Exam January 22, 2018 – NEW: UPLOADED ON June 12, 2018
    • Text (pdf)
    • Solution
      • Question 1: (b)
      • Question 2: (b)
      • Source code/Eclipse projects (zip)
  • Exam June 26, 2018
    • Text Version #1 (pdf)
      • Draft of the solution
        • Question 1: (c)
        • Question 2: (c)
        • Source code/Eclipse projects (zip)
    • Text Version #2 (pdf)
      • Draft of the solution
        • Question 1: (b)
        • Question 2: (c)
        • Source code/Eclipse projects (zip)
  • Exam July 16, 2018
    • Text Version #1 (pdf)
      • Draft of the solution
        • Question 1: (d)
        • Question 2: (a)
        • Source code/Eclipse projects (zip)
    • Text Version #2 (pdf)
      • Draft of the solution
        • Question 1: (b)
        • Question 2: (d)
        • Source code/Eclipse projects (zip)
  • Exam September 9, 2018
    • Text Version #1 (pdf)
      • Draft of the solution
        • Question 1: (d)
        • Question 2: (c)
        • Source code/Eclipse projects (zip)
      • Text Version #2 (pdf)
      • Draft of the solution
        • Question 1: (b)
        • Question 2: (c)
        • Source code/Eclipse projects (zip)

Practices

  • Schedule of the lab activities
    • Students from AA to LZ Students from MA to ZZ
      • TEAM 1: Students from A to C – Tuesday from 5.30pm to 7pm
      • TEAM 2: Students from D to L – Wednesday from 5.30pm to 7pm
      • Team 1 Team 2
        Lab #1 Tuesday, March 20 – from 5.30pm to 7pm Wednesday, March 21 – from 5.30pm to 7pm
        Lab #2 Tuesday, March 27 – from 5.30pm to 7pm Thursday, March 28 – from 5.30pm to 7pm
        Lab #3 Tuesday, April 10 – from 5.30pm to 7pm Wednesday, April 11 – from 5.30pm to 7pm
        Lab #4 Tuesday, April 17 – from 5.30pm to 7pm Wednesday, April 18 – from 5.30pm to 7pm
        Lab #5 Tuesday, April 24 – from 5.30pm to 7pm Wednesday, May 2 – from 5.30pm to 7pm
        Lab #6 Tuesday, May 8 – from 5.30pm to 7pm Wednesday, May 9 – from 5.30pm to 7pm
        Lab #7 Tuesday, May 15 – from 5.30pm to 7pm Wednesday, May 16 – from 5.30pm to 7pm
        Lab #8 Tuesday, May 22 – from 5.30pm to 7pm Wednesday, May 23 – from 5.30pm to 7pm
        Lab #9 Tuesday, May 29 – from 5.30pm to 7pm Wednesday, May 30 – from 5.30pm to 7pm
        Lab #10 Tuesday, June 5 – from 5.30pm to 7pm Wednesday, June 6 – from 5.30pm to 7pm
      • TEAM 1: Students from M to Q – Tuesday from 10am to 11.30am
      • TEAM 2: Students from R to Z – Wednesday from 1pm to 2.30pm
      • Team 1 Team 2
        Lab #1 Tuesday, March 20 – from 10am to 11.30am Wednesday, March 21 – from 1pm to 2.30pm
        Lab #2 Tuesday, March 27 – from 10am to 11.30am Wednesday, March 28 – from 1pm to 2.30pm
        Lab #3 Tuesday, April 10 – from 10am to 11.30am Wednesday, April 11 – from 1pm to 2.30pm
        Lab #4 Tuesday, April 17 – from 10am to 11.30am Wednesday, April 18 – from 1pm to 2.30pm
        Lab #5 Tuesday, April 24 – from 10am to 11.30am Wednesday, May 2 – from 1pm to 2.30pm
        Lab #6 Tuesday, May 8 – from 10am to 11.30am Wednesday, May 9 – from 1pm to 2.30pm
        Lab #7 Tuesday, May 15 – from 10am to 11.30am Wednesday, May 16 – from 1pm to 2.30pm
        Lab #8 Tuesday, May 22 – from 10am to 11.30am Wednesday, May 23 – from 1pm to 2.30pm
        Lab #9 Tuesday, May 29 – from 10am to 11.30am Wednesday, May 30 – from 1pm to 2.30pm
        Lab #10 Tuesday, June 5 – from 10am to 11.30am Wednesday, June 6 – from 1pm to 2.30pm

Additional materials

  • Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
    • Suggested slides/lectures for those students who do not know Java
      • OO Paradigm and UML (The UML part in not mandatory)
      • The Java Environment
      • Java Basic Features
      • Java Inheritance
  • Virtual Machine and Docker image with CDH5.12 – Hadoop
    • Cloudera Virtual Machine and Docker image:  https://www.cloudera.com/downloads/quickstart_vms/5-12.html
      • Pay attention that Java 1.7 is installed on the Cloudera Virtual Machine and the Cloudera Docker image. Hence, Lambda functions and some other specific functionalities cannot be used. Use the following Eclipse Project as a template if you are using the Cloudera virtual machine/docker image and Java 1.7:
  • Slides about the Scala language – These slides are not part of the course program (no questions or exercises on these slides at the exam)
    • Introduction (pdf)
    • Data types, variables, expressions, loops, basic console operations (pdf)
    • Scala and Functional programming (pdf)
    • Collections (pdf)
    • Scala and Object-oriented programming (pdf)
    • Exercises
  • MapReduce – Hadoop internals (2 slides per page, 6 slides per page) – The “Hadoop inernals”  topic is not covered this year
  • Apache HIVE (2 slides per page, 6 slides per page) – The “Apache HIVE”  topic is not covered this year
  • Apache Storm – These slides are not part of the course program (no questions or exercises on these slides at the exam)