Big data: architectures and data analytics (2017/2018)


This page has hierarchy - Parent page: Teaching

Table of content

General information

  • ECTS: 6
  • Professor: Paolo Garza
  • Students from AA to LZ
    • Teaching assistants:
  • Students from MA to ZZ
    • Teaching assistants:
      • Francesco Ventura
      • Andrea Pasini
      • Alessandro Farasin

Exam rules

  • Exam rules Academic Year 2017-2018 (pdf)

Announcements

Students from AA to LZ Students from MA to ZZ
  • (22/3/2018) The lecture scheduled for Friday, March 23 from 16:00 to 17:30 is cancelled
  • (28/2/2018) First lecture: Thursday, March 8  – 13:00 – 16:00
  • (28/2/2018) No lab the first two weeks of the course
  • (28/2/2018) First lecture: Friday, March 9 – 08:30 – 11:30
  • (28/2/2018) No lab the first two weeks of the course
  • (28/2/2018) The lecture scheduled for  Monday, March 5 is cancelled (all lectures scheduled for  Monday, March 5 are cancelled due to the Italian election day).

 Lectures: Schedule and topics

  • Students from AA to LZ
    • Schedule of the lectures with the list of covered topics (link)
  • Students from MA to ZZ
    • Schedule of the lectures with the list of covered topics (link)

Materials

Screencast

Pay attention that each video is longer than 1 hour. The integrated  player of Dropbox plays only the first hour of each video. To watch the entire videos you should use one of the following approaches:

  1. Download locally the videos you are interested in and use a player installed on your PC
  2. Include in your Dropbox the videos you are interested in. The Dropbox player allows watching the entire videos if they are in your dropbox.

Videos

  • Monday, March 12, 2018, 11:30-13:00 (mp4) (m4v)
    •  Introduction to the MapReduce programming paradigm and Hadoop implementation of MapReduce – Basic structure of MapReduce programs in Hadoop
  • Friday, March 16, 2018, 8:30-11:30 (part1.mp4part2.mp4) (part1.m4vpart2.m4v)
    • MapReduce programming paradigm and Hadoop and How to run an application on the Hadoop cluster
    • Exercises 1, 2, 3, 5
    • Combiner, personalized data types
  • Monday, March 19, 2018, 11:30-13:00 (mp4) (m4v)
    • Combiner, personalized data types
    • Exercises 5 and 6 (without and with combiner)
  • Friday, March 23, 2018, 8:30-11:30 (part1.mp4part2.mp4) (part1.m4vpart2.m4v)
    • Personalized properties
    • Personalized counters
    • Map-only jobs
    • In-mapper combiners
    • Exercises 9, 8, 12, 10, 14, 15
  • Friday, April 6, 2018, 8:30 – 11:30 (part1.mp4part2.mp4) (part1.m4vpart2.m4v)
    • Design patterns – Part 1
    • Multiple Inputs and Multiple Outputs
    • Distributed cache
    • Exercises 4, 13, 17, 22, 23
  • Monday, April 9, 2018., 11:30-13:00 (mp4) (m4v)
    • Design patterns – Part 2
    • Exercises 23, 26
  • Friday, April 13, 2018, 8:30 – 11:30 (part1.mp4part2.mp4) (part1.m4vpart2.m4v)
    • MapReduce – Relational Algebra/SQL operators
    • Exercises 27 and 29
    • Introduction to Spark
    • Introduction to Spark – part2 Design patterns – Part 1
  • Monday, April 16, 2018., 11:30-13:00 (mp4) (m4v)
    • RDD-based programs (RDDs creation and basic transformations) – Part 1 (Slides: 1-78)

Exercises

 

Exam Examples

  • At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
    • For the Spark exercises, no templates are provided
  • Exam example #1
  • Exam example #2
  • Exam July 1, 2016
  • Exam July 12, 2016
  • Exam September 19, 2016
  • Exam June 30, 2017
  • Exam July 14, 2017
  • Exam September 14, 2017

 

Practices

  • Schedule of the lab activities
    • Students from AA to LZ Students from MA to ZZ
      • TEAM 1: Students from A to C – Tuesday from 5.30pm to 7pm
      • TEAM 2: Students from D to L – Wednesday from 5.30pm to 7pm
      • Team 1 Team 2
        Lab #1 Tuesday, March 20 – from 5.30pm to 7pm Wednesday, March 21 – from 5.30pm to 7pm
        Lab #2 Tuesday, March 27 – from 5.30pm to 7pm Thursday, March 28 – from 5.30pm to 7pm
        Lab #3 Tuesday, April 10 – from 5.30pm to 7pm Wednesday, April 11 – from 5.30pm to 7pm
        Lab #4 Tuesday, April 17 – from 5.30pm to 7pm Wednesday, April 18 – from 5.30pm to 7pm
        Lab #5 Tuesday, April 24 – from 5.30pm to 7pm Wednesday, May 2 – from 5.30pm to 7pm
        Lab #6 Tuesday, May 8 – from 5.30pm to 7pm Wednesday, May 9 – from 5.30pm to 7pm
        Lab #7 Tuesday, May 15 – from 5.30pm to 7pm Wednesday, May 16 – from 5.30pm to 7pm
        Lab #8 Tuesday, May 22 – from 5.30pm to 7pm Wednesday, May 23 – from 5.30pm to 7pm
        Lab #9 Tuesday, May 29 – from 5.30pm to 7pm Wednesday, May 30 – from 5.30pm to 7pm
        Lab #10 Tuesday, June 5 – from 5.30pm to 7pm Wednesday, June 6 – from 5.30pm to 7pm
      • TEAM 1: Students from M to Q – Tuesday from 10am to 11.30am
      • TEAM 2: Students from R to Z – Wednesday from 1pm to 2.30pm
      • Team 1 Team 2
        Lab #1 Tuesday, March 20 – from 10am to 11.30am Wednesday, March 21 – from 1pm to 2.30pm
        Lab #2 Tuesday, March 27 – from 10am to 11.30am Wednesday, March 28 – from 1pm to 2.30pm
        Lab #3 Tuesday, April 10 – from 10am to 11.30am Wednesday, April 11 – from 1pm to 2.30pm
        Lab #4 Tuesday, April 17 – from 10am to 11.30am Wednesday, April 18 – from 1pm to 2.30pm
        Lab #5 Tuesday, April 24 – from 10am to 11.30am Wednesday, May 2 – from 1pm to 2.30pm
        Lab #6 Tuesday, May 8 – from 10am to 11.30am Wednesday, May 9 – from 1pm to 2.30pm
        Lab #7 Tuesday, May 15 – from 10am to 11.30am Wednesday, May 16 – from 1pm to 2.30pm
        Lab #8 Tuesday, May 22 – from 10am to 11.30am Wednesday, May 23 – from 1pm to 2.30pm
        Lab #9 Tuesday, May 29 – from 10am to 11.30am Wednesday, May 30 – from 1pm to 2.30pm
        Lab #10 Tuesday, June 5 – from 10am to 11.30am Wednesday, June 6 – from 1pm to 2.30pm

Additional materials

  • Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
    • Suggested slides/lectures for those students who do not know Java
      • OO Paradigm and UML (The UML part in not mandatory)
      • The Java Environment
      • Java Basic Features
      • Java Inheritance
  • Virtual Machine and Docker image with CDH5.12 – Hadoop
    • Cloudera Virtual Machine and Docker image:  https://www.cloudera.com/downloads/quickstart_vms/5-12.html
      • Pay attention that Java 1.7 is installed on the Cloudera Virtual Machine and the Cloudera Docker image. Hence, Lambda functions and some other specific functionalities cannot be used. Use the following Eclipse Project as a template if you are using the Cloudera virtual machine/docker image and Java 1.7:
  • Slides about the Scala language
    • Introduction (pdf)
    • Data types, variables, expressions, loops, basic console operations (pdf)
    • Scala and Functional programming (pdf)
    • Collections (pdf)
    • Scala and Object-oriented programming (pdf)
    • Exercises
  • MapReduce – Hadoop internals (2 slides per page, 6 slides per page) – The “Hadoop inernals”  topic is not covered this year