Table of content
AY 2016/2017
General information
Exam rules
- Exam rules Academic Year 2016-2017 (pdf)
Announcements
- (4/2/2018) Exam January 22, 2018
- The results of the exam “Big data: architectures and data analytics – January 22, 2018” are available on the “Portale della Didattica – valutazioni provvisorie”
- Exam papers will be discussed on Wednesday, February 7, 2018 at 16:00 (Room “Sala Colloqui DAUIN” – 4th floor of the Department of Control and Computer Engineering).
- All exam grades will be recorded
- Students, who received a grade >= 18 on the written exam and want to reject it, have to send an e-mail to Paolo Garza by Friday 9, 2018 at 18:00.
Materials
- Introduction to the course (2 slides per page, 6 slides per page)
- Introduction to Big Data (2 slides per page, 6 slides per page)
- Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce – Basic structure of MapReduce programs in Hadoop (2 slides per page, 6 slides per page)
- Source code of the Word Count Ecplise project (WordCount.zip) – Use the import option to import it in Eclipse
- PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- Interaction with HDFS and Hadoop by means of the command line (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 2 (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 3 (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce – Multiple Inputs and Multiple Outputs (2 slides per page, 6 slides per page)
- MapReduce – Distributed cache (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
- MapReduce – Hadoop internals (2 slides per page, 6 slides per page)
- Spark
- Introduction to Apache Spark (2 slides per page, 6 slides per page)
- Introduction to Apache Spark – Part 2 (2 slides per page, 6 slides per page)
- RDD-based programs (RDDs creation and basic transformations) – Part 1 (2 slides per page, 6 slides per page)
- Pay attention that the slides are based on Spark 1.6.
- Pay attention that flatMap and flatMapToPair in Spark 2.0 are slightly different with respect to the versions available in Spark 1.6:
- In version 1.6, Java RDD’s flatMap and flatMapToPair functions used to require functions returning Java Iterable. They have been updated to require functions returning Java iterator so the functions do not need to materialize all the data (official reference)
- RDD-based programs (RDDs basic actions) – Part 2 (2 slides per page, 6 slides per page)
- How to submit a Spark application (2 slides per page, 6 slides per page)
- RDD-based programs (key-value pair RDDs) – Part 3 (2 slides per page, 6 slides per page)
- RDD-based programs (transformations on two PairRDDs and actions on PairRDDs) – Part 4 (2 slides per page, 6 slides per page)
- RDD-based programs (DoubleRDDs) – Part 5 (2 slides per page, 6 slides per page)
- RDD-based programs (Cache, accumulators, broadcast variables) – Part 6 (2 slides per page, 6 slides per page)
- Spark SQL
- Data Mining – Recap
- Spark MLlib
- Streaming data analysis
- Spark Streaming (2 slides per page, 6 slides per page)
- Word Count – Streaming version (zip)
- Word Count and Window (zip)
- Word Count – Stateful version (zip)
- Word Count – Streaming version – Read data from HDFS folder (zip)
- Word Count – Output sort by key – Based on the transformPair() transformation (zip)
- Apache Storm
- DBMS for Big data
Exercises
- BigData@Polito and Virtual machine environments (2 slides per page, 6 slides per page)
- MapReduce Exercises – Part 1 (2 slides per page, 6 slides per page)
- MapReduce Exercises – Part 2 (2 slides per page, 6 slides per page)
- MapReduce Exercises – Part 3 (2 slides per page, 6 slides per page)
- MapReduce Exercises – Part 4 (2 slides per page, 6 slides per page)
- MapReduce Exercises – Part 5 (2 slides per page, 6 slides per page)
- MapReduce Exercises – Part 6 (2 slides per page, 6 slides per page)
- MapReduce Exercises – Part 7 (2 slides per page, 6 slides per page)
- Spark Exercises – Part 8 (2 slides per page, 6 slides per page)
- Spark Exercises – Part 9 (2 slides per page, 6 slides per page)
- Spark Exercises – Part 10 (2 slides per page, 6 slides per page)
- Spark Exercises – Part 11 (2 slides per page, 6 slides per page)
- Spark Exercises – Part 12 (2 slides per page, 6 slides per page)
- Spark Streaming Exercises – Part 13 (2 slides per page, 6 slides per page)