Big data: architectures and data analytics (2016/2017)


This page has hierarchy - Parent page: Teaching

Table of content

General information

Exam rules

  • Exam rules Academic Year 2016-2017 (pdf)

Announcements

  • (07/06/2017) Schedule of the last two weeks of the course
    • Tuesday, June 6 from 17:30 to 19:00 – Lab #9: Team A
    • Thursday, June 8 from 13:00 to 16:00 – Lecture
      • (1) Relational vs Non-relational databases for Big data
      • (2) Hive
      • (3) Exercises selected from the exams of year 2015-2016
    • Thursday, June 8 from 17:30 to 19:00 – Lab #9: Team B
    • Friday, June 9 from 16:00 to 17:30 – Final lecture
      • Exercises selected from the exams of year 2015-2016
    • The lectures of the last week (from June 12 to June 16) are cancelled.
      • On Tuesday, June 13 from 17:30 to 19:00 and on Thursday, June 15 from 17:30 to 19:00 the LABINF is reserved for you. You can go to the lab to conclude the previous lab activities. During those time slots, I will be at LABINF to answer questions about the previous labs and any other questions related to all the content of the course.
  • (19/05/2017) No lab this week. Lab #9 is postponed to next week.
    • The following laboratory/practice sessions are cancelled:
      • Tuesday , May 30 (from 17:30 to 19:00) – no lab
      • Thursday, June 1 (from 17:30 to 19:00) – no lab
  • (23/03/2017) The schedule of the lab activities has been published in the “practices” section
  • (15/03/2017) The email with the credential for the BigData@Polito cluster has been sent (one personalized email for each student).
    • Please contact Paolo Garza if you are already enrolled in the course but you did not receive the email.
    • The information about how to use your account to connect to the BigData@Polito cluster will be provided during the first practice at LABINF.
  • (09/03/2017) The first lab practice will be held on Tuesday, March 21, 2017 from 17:30 to 19:00 at LABINF (map)
    • Please make sure you have an account on the LABINF PCs before the first lab practice. You can register an account at LABINF (map) every day from 2pm to 3pm. Student card and Certification of enrollment are needed.
    • Next week, I will also send you an email with your credential for the BigData@Polito cluster (it is not the same account of the LABINF laboratory).
    • LAB SCHEDULE. To fit LABINF capacity students must attend all labs according to the following schedule:
      • TEAM A: Students from AA to LE – Tuesday from 5:30pm to 7pm
      • TEAM B: Students from LI to ZZ – Thursday from 5:30pm to 7pm
  • (28/02/2017) First lecture: March 9, 2017 at 13:00
  • (28/02/2017) No lab the first two weeks of the course

Materials

Exercises

Exam Examples

  • Exam example #1
  • Exam example #2
  • Exam July, 1 2016
  • Exam July, 12 2016
    • Text (pdf)
    • Solution – Draft (zip) – As discussed during the last lecture, Exercise 2.B can be solved also by simply counting for each station the number of  lines of the historical data stored in StationsOccupancy.txt that are characterized by less than 3 free slots. If the number of times a station is associated with less than 3 free slots is equal to 0, then the station is a well-sized station. Otherwise, it is not a well-sized station.
  • At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
    • For the Spark exercises, no templates are provided

Practices

  • TEAM A: Students from AA to LE – Tuesday from 5.30pm to 7pm
  • TEAM B: Students from LI to ZZ – Thursday from 5.30pm to 7pm
  • Schedule of the lab activities
  • Team A Team B
    Lab #1 Tuesday, March 21 – from 5.30pm to 7pm Thursday, March 23 – from 5.30pm to 7pm
    Lab #2 Tuesday, March 28 – from 5.30pm to 7pm Thursday, March 30 – from 5.30pm to 7pm
    Lab #3 Tuesday, April 4 – from 5.30pm to 7pm Thursday, April 6 – from 5.30pm to 7pm
    Lab #4 Tuesday, April 11 – from 5.30pm to 7pm Thursday, April 20 – from 5.30pm to 7pm
    Lab #5 Tuesday, May 2 – from 5.30pm to 7pm Thursday, May 4 – from 5.30pm to 7pm
    Lab #6 Tuesday, May 9 – from 5.30pm to 7pm Thursday, May 11 – from 5.30pm to 7pm
    Lab #7 Tuesday, May 16 – from 5.30pm to 7pm Thursday, May 18 – from 5.30pm to 7pm
    Lab #8 Tuesday, May 23 – from 5.30pm to 7pm Thursday, May 25 – from 5.30pm to 7pm
    Lab #9 Tuesday, May 30 – from 5.30pm to 7pm Thursday, June 1 – from 5.30pm to 7pm
    Lab #9 Tuesday, June 6 – from 5.30pm to 7pm Thursday, June 8 – from 5.30pm to 7pm
  • Lab1: Hadoop and MapReduce
  • Lab2: Filter with Hadoop MapReduce
  • Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
  • Lab4: Normalized ratings for product recommendations with Hadoop MapReduce
  • Lab5: Filter with Apache Spark
  • Lab6: Frequently bought/reviewed together application with Apache Spark
  • Lab7: Bike sharing data analysis
  • Lab8: A classification pipeline with MLlib + SparkSQL
    • Text (pdf)
    • Skeleton Eclipse project – Spark (Lab8_Template.zip)
    • Sample file with 100 reviews (ReviewsSample.csv)
    • File with the all the Reviews (Reviews.zip – 110MB)
    • Solution (zip)
    • Solution provided by one of you (zip) – 90% of right predictions
    • Solution of another student (zip) – 93% of right predictions
  • Lab9: Tweet analysis – Spark streaming
    • Text (pdf)
    • Skeleton Eclipse project – Spark 2.0 (Lab9_TemplateSpark2.0.0.zip) – Use this to deploy/run your application locally at LABINF
      • Pay attention that flatMap and flatMapToPair in Spark 2.0 are slightly different with respect to the versions available in Spark 1.6
      • In version 2.0, Java RDD’s flatMap and flatMapToPair have been updated to require functions returning Java Iterators (official reference)
    • Example files – tweets (exampledata_tweets.zip)
    • Skeleton Eclipse project – Spark 1.6 (Lab9_Template.zip) – Use this to deploy/run your application on the BigData@Polito cluster
    • Solution Spark 1.6 (zip)
    • Solution Spark 2.0 (zip)

Additional materials

  • Slides about the Scala language
    • Introduction (pdf)
    • Data types, variables, expressions, loops, basic console operations (pdf)
    • Scala and Functional programming (pdf)
    • Collections (pdf)
    • Scala and Object-oriented programming (pdf)
    • Exercises
  • Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
    • Suggested slides/lectures for those students who do not know Java
      • OO Paradigm and UML (The UML part in not mandatory)
      • The Java Environment
      • Java Basic Features
      • Java Inheritance
  • Virtual Machine with Hadoop and Spark
    • How to import and use the Virtual Machine on your laptop or personal workstation (howto.pdf)
    • Virtual machine image – BigData_localHadoop.ova (download from Dropbox) – Size of the file: ~7 GB
      • The virtual machine does not include HUE. Hence, use hdfs (from the command line) to put and get files from the HDFS file system.

 

  • Link to the CIKM AnalytiCup  challenge (link)