Big data: architectures and data analytics (2016/2017)


This page has hierarchy - Parent page: Teaching

Table of content

General information

Exam rules

  • Exam rules Academic Year 2016-2017 (pdf)

Announcements

  • (23/7/2017) Exam July 14, 2017
    • The results of the exam of Big Data (July 14, 2017) are available on the “Portale della Didattica – valutazioni provvisorie”
    • Exam papers will be discussed on Tuesday July 25, 2017 at 11:00 (Room 6D).
    • All exam grades will be recorded
    • Students, who received a grade >= 18 on the written exam and want to reject it, have to send an e-mail to Paolo Garza by Wednesday 26, 2017 at 19:00
  • (16/07/2017) The text and the solution of the exam scheduled for July 14, 2017 are available in the  “Exam examples” section

Materials

Exercises

Exam Examples

  • Exam example #1
  • Exam example #2
  • Exam July 1, 2016
  • Exam July 12, 2016
    • Text (pdf)
    • Solution – Draft (zip) – As discussed during the last lecture, Exercise 2.B can be solved also by simply counting for each station the number of  lines of the historical data stored in StationsOccupancy.txt that are characterized by less than 3 free slots. If the number of times a station is associated with less than 3 free slots is equal to 0, then the station is a well-sized station. Otherwise, it is not a well-sized station.
  • Exam September 19, 2016
  • Exam June 30, 2017
    • Text – version #1 (pdf)
      • Solution (Source code + pdf version) (zip)
    • Text – version #2 (pdf)
      • Solution (Source code + pdf version) (zip)
  • Exam July 14, 2017
    • Text – version #1 (pdf)
      • Solution (Source code + pdf version) (zip)
      • Answers to the theoretical questions:
        • Question 1: (d)
        • Question 2: (c)
    • Text – version #2 (pdf)
      • Solution (Source code + pdf version) (zip)
      • Answers to the theoretical questions:
        • Question 1: (c)
        • Question 2: (b)
  • At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
    • For the Spark exercises, no templates are provided

Practices

  • TEAM A: Students from AA to LE – Tuesday from 5.30pm to 7pm
  • TEAM B: Students from LI to ZZ – Thursday from 5.30pm to 7pm
  • Schedule of the lab activities
  • Team A Team B
    Lab #1 Tuesday, March 21 – from 5.30pm to 7pm Thursday, March 23 – from 5.30pm to 7pm
    Lab #2 Tuesday, March 28 – from 5.30pm to 7pm Thursday, March 30 – from 5.30pm to 7pm
    Lab #3 Tuesday, April 4 – from 5.30pm to 7pm Thursday, April 6 – from 5.30pm to 7pm
    Lab #4 Tuesday, April 11 – from 5.30pm to 7pm Thursday, April 20 – from 5.30pm to 7pm
    Lab #5 Tuesday, May 2 – from 5.30pm to 7pm Thursday, May 4 – from 5.30pm to 7pm
    Lab #6 Tuesday, May 9 – from 5.30pm to 7pm Thursday, May 11 – from 5.30pm to 7pm
    Lab #7 Tuesday, May 16 – from 5.30pm to 7pm Thursday, May 18 – from 5.30pm to 7pm
    Lab #8 Tuesday, May 23 – from 5.30pm to 7pm Thursday, May 25 – from 5.30pm to 7pm
    Lab #9 Tuesday, May 30 – from 5.30pm to 7pm Thursday, June 1 – from 5.30pm to 7pm
    Lab #9 Tuesday, June 6 – from 5.30pm to 7pm Thursday, June 8 – from 5.30pm to 7pm
  • Lab1: Hadoop and MapReduce
  • Lab2: Filter with Hadoop MapReduce
  • Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
  • Lab4: Normalized ratings for product recommendations with Hadoop MapReduce
  • Lab5: Filter with Apache Spark
  • Lab6: Frequently bought/reviewed together application with Apache Spark
  • Lab7: Bike sharing data analysis
  • Lab8: A classification pipeline with MLlib + SparkSQL
    • Text (pdf)
    • Skeleton Eclipse project – Spark (Lab8_Template.zip)
    • Sample file with 100 reviews (ReviewsSample.csv)
    • File with the all the Reviews (Reviews.zip – 110MB)
    • Solution (zip)
    • Solution provided by one of you (zip) – 90% of right predictions
    • Solution of another student (zip) – 93% of right predictions
  • Lab9: Tweet analysis – Spark streaming
    • Text (pdf)
    • Skeleton Eclipse project – Spark 2.0 (Lab9_TemplateSpark2.0.0.zip) – Use this to deploy/run your application locally at LABINF
      • Pay attention that flatMap and flatMapToPair in Spark 2.0 are slightly different with respect to the versions available in Spark 1.6
      • In version 2.0, Java RDD’s flatMap and flatMapToPair have been updated to require functions returning Java Iterators (official reference)
    • Example files – tweets (exampledata_tweets.zip)
    • Skeleton Eclipse project – Spark 1.6 (Lab9_Template.zip) – Use this to deploy/run your application on the BigData@Polito cluster
    • Solution Spark 1.6 (zip)
    • Solution Spark 2.0 (zip)

Additional materials

  • Slides about the Scala language
    • Introduction (pdf)
    • Data types, variables, expressions, loops, basic console operations (pdf)
    • Scala and Functional programming (pdf)
    • Collections (pdf)
    • Scala and Object-oriented programming (pdf)
    • Exercises
  • Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
    • Suggested slides/lectures for those students who do not know Java
      • OO Paradigm and UML (The UML part in not mandatory)
      • The Java Environment
      • Java Basic Features
      • Java Inheritance
  • Virtual Machine with Hadoop and Spark
    • How to import and use the Virtual Machine on your laptop or personal workstation (howto.pdf)
    • Virtual machine image – BigData_localHadoop.ova (download from Dropbox) – Size of the file: ~7 GB
      • The virtual machine does not include HUE. Hence, use hdfs (from the command line) to put and get files from the HDFS file system.

 

  • Link to the CIKM AnalytiCup  challenge (link)