Big data: architectures and data analytics (2017/2018)
Table of content
- General information
- Exam rules
- Announcements
- Lectures: Schedule and topics
- Materials
- Screencast
- Exercises
- Exam Examples
- Practices
- Additional materials
AY 2017/2018
New web page: Big data: architectures and data analytics – AY 2018/2019 (link)
General information
- ECTS: 6
- Professor: Paolo Garza
- Students from AA to LZ
- Teaching assistants:
- Daniele Apiletti
- Eliana Pastor
- Alessandro Farasin
- Andrea Pasini
- Teaching assistants:
- Students from MA to ZZ
- Teaching assistants:
- Francesco Ventura
- Andrea Pasini
- Alessandro Farasin
- Teaching assistants:
Exam rules
- Exam rules Academic Year 2017-2018 (pdf)
Announcements
- (13/02/2019) The exam scheduled for February 15, 2019 at 8:30 will be held in Classroom 3I
- Please, remember to bring with you:
- the student card and/or an identity document
- sheets of paper
- Please, remember to bring with you:
Students from AA to LZ | Students from MA to ZZ |
|
|
Lectures: Schedule and topics
- Students from AA to LZ
- Schedule of the lectures with the list of covered topics (link)
- Students from MA to ZZ
- Schedule of the lectures with the list of covered topics (link)
Materials
- Introduction to the course (2 slides per page, 6 slides per page)
- Introduction to Big Data (2 slides per page, 6 slides per page)
- Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce – Basic structure of MapReduce programs in Hadoop (2 slides per page, 6 slides per page)
- Source code of the Word Count Ecplise project (WordCount.zip) – Use the import option to import it in Eclipse
- PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- Interaction with HDFS and Hadoop by means of the command line (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 2 (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 3 (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce – Multiple Inputs and Multiple Outputs (2 slides per page, 6 slides per page)
- MapReduce – Distributed cache
- New APIs (2 slides per page, 6 slides per page) UPLOADED on April 18, 2017
- Deprecated APIs (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
- Spark
- Introduction to Apache Spark (2 slides per page, 6 slides per page)
- Introduction to Apache Spark – Part 2 (2 slides per page, 6 slides per page)
- RDD-based programs (RDDs creation and basic transformations) – Part 1 (2 slides per page, 6 slides per page) UPDATED ON April 13, 2018
- RDD-based programs (RDDs basic actions) – Part 2 (2 slides per page, 6 slides per page)
- How to submit a Spark application (2 slides per page, 6 slides per page)
- RDD-based programs (key-value pair RDDs and transformations on PairRDDs) – Part 3 (2 slides per page, 6 slides per page)
- RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (2 slides per page, 6 slides per page)
- RDD-based programs (DoubleRDDs) – Part 5 (2 slides per page, 6 slides per page)
- RDD-based programs (Cache, accumulators, broadcast variables) – Part 6 (2 slides per page, 6 slides per page)
- Datasets, DataFrames and Spark SQL (2 slides per page, 6 slides per page)
- Spark SQL example – DataFrames vs Datasets vs SQL
- Problem specification (2 slides per page, 6 slides per page)
- Solution (zip)
- Spark SQL and User Defined Functions (UDFs) (2 slides per page, 6 slides per page) UPLOADED ON May 16, 2018
- These slides about Spark SQL significantly extend the ones used in the previous academic year with the following new concepts:
- Dataset
- Read and Write Dataset, DataFrame
- Dataset and map
- Aggregate functions
- GroupBy and aggregate functions
- Data mining and Machine learning algorithms with Spark MLlib
- Data Mining – Recap
- Introduction (2 slides per page, 6 slides per page)
- Data and Preprocessing (2 slides per page, 6 slides per page)
- Itemset mining and Association rules (2 slides per page, 6 slides per page)
- Classification (2 slides per page, 6 slides per page)
- Clustering (2 slides per page, 6 slides per page)
- Spark MLlib
- Spark MLlib – Introduction and Classification of structured (2 slides per page, 6 slides per page)
- Spark MLlib – Classification of textual data (2 slides per page, 6 slides per page)
- Textual data classification example code (zip)
- Spark MLlib – Parameter tuning (2 slides per page, 6 slides per page)
- Parameter tuning example code (zip)
- Spark MLlib – Clustering of structured data (2 slides per page, 6 slides per page)
- Clustering example code (zip)
- Spark MLlib – Itemset and Association rule mining (2 slides per page, 6 slides per page)
- Itemset and Association rule mining example code (zip)
- Spark MLlib – Linear regression (2 slides per page, 6 slides per page)
- Linear regression example code (zip)
- Spark Streaming (2 slides per page, 6 slides per page) UPDATED ON May 25, 2018
- DBMS for Big data
- Relational and Non-relational databases for Big data (2 slides per page, 6 slides per page)
Screencast
- Schedule of the lectures with the list of covered topics (link)
- Schedule of the lectures with the list of covered topics (link)
- Introduction to Apache Hadoop and the MapReduce programming paradigm (2 slides per page, 6 slides per page)
- Hadoop implementation of MapReduce – Basic structure of MapReduce programs in Hadoop (2 slides per page, 6 slides per page)
- Source code of the Word Count Ecplise project (WordCount.zip) – Use the import option to import it in Eclipse
- PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- Interaction with HDFS and Hadoop by means of the command line (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 2 (2 slides per page, 6 slides per page)
- MapReduce programs and Hadoop – Part 3 (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 1 (2 slides per page, 6 slides per page)
- MapReduce – Multiple Inputs and Multiple Outputs (2 slides per page, 6 slides per page)
- MapReduce – Distributed cache
- New APIs (2 slides per page, 6 slides per page) UPLOADED on April 18, 2017
- Deprecated APIs (2 slides per page, 6 slides per page)
- MapReduce – Design patterns – Part 2 (2 slides per page, 6 slides per page)
- MapReduce – Relational Algebra/SQL operators (2 slides per page, 6 slides per page)
- Introduction to Apache Spark (2 slides per page, 6 slides per page)
- Introduction to Apache Spark – Part 2 (2 slides per page, 6 slides per page)
- RDD-based programs (RDDs creation and basic transformations) – Part 1 (2 slides per page, 6 slides per page) UPDATED ON April 13, 2018
- RDD-based programs (RDDs basic actions) – Part 2 (2 slides per page, 6 slides per page)
- How to submit a Spark application (2 slides per page, 6 slides per page)
- RDD-based programs (key-value pair RDDs and transformations on PairRDDs) – Part 3 (2 slides per page, 6 slides per page)
- RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (2 slides per page, 6 slides per page)
- RDD-based programs (DoubleRDDs) – Part 5 (2 slides per page, 6 slides per page)
- RDD-based programs (Cache, accumulators, broadcast variables) – Part 6 (2 slides per page, 6 slides per page)
- Spark SQL example – DataFrames vs Datasets vs SQL
- Problem specification (2 slides per page, 6 slides per page)
- Solution (zip)
- Spark SQL and User Defined Functions (UDFs) (2 slides per page, 6 slides per page) UPLOADED ON May 16, 2018
- These slides about Spark SQL significantly extend the ones used in the previous academic year with the following new concepts:
- Dataset
- Read and Write Dataset, DataFrame
- Dataset and map
- Aggregate functions
- GroupBy and aggregate functions
- Data Mining – Recap
- Introduction (2 slides per page, 6 slides per page)
- Data and Preprocessing (2 slides per page, 6 slides per page)
- Itemset mining and Association rules (2 slides per page, 6 slides per page)
- Classification (2 slides per page, 6 slides per page)
- Clustering (2 slides per page, 6 slides per page)
- Spark MLlib – Introduction and Classification of structured (2 slides per page, 6 slides per page)
- Spark MLlib – Classification of textual data (2 slides per page, 6 slides per page)
- Textual data classification example code (zip)
- Spark MLlib – Parameter tuning (2 slides per page, 6 slides per page)
- Parameter tuning example code (zip)
- Spark MLlib – Clustering of structured data (2 slides per page, 6 slides per page)
- Clustering example code (zip)
- Spark MLlib – Itemset and Association rule mining (2 slides per page, 6 slides per page)
- Itemset and Association rule mining example code (zip)
- Spark MLlib – Linear regression (2 slides per page, 6 slides per page)
- Linear regression example code (zip)
- Relational and Non-relational databases for Big data (2 slides per page, 6 slides per page)
Pay attention that each video is longer than 1 hour. The integrated player of Dropbox plays only the first hour of each video. To watch the entire videos you should use one of the following approaches:
- Download locally the videos you are interested in and use a player installed on your PC
- Include in your Dropbox the videos you are interested in. The Dropbox player allows watching the entire videos if they are in your dropbox.
Videos
- Monday, March 12, 2018, 11:30-13:00 (mp4) (m4v)
- Introduction to the MapReduce programming paradigm and Hadoop implementation of MapReduce – Basic structure of MapReduce programs in Hadoop
- Friday, March 16, 2018, 8:30-11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- MapReduce programming paradigm and Hadoop and How to run an application on the Hadoop cluster
- Exercises 1, 2, 3, 5
- Combiner, personalized data types
- Monday, March 19, 2018, 11:30-13:00 (mp4) (m4v)
- Combiner, personalized data types
- Exercises 5 and 6 (without and with combiner)
- Friday, March 23, 2018, 8:30-11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Personalized properties
- Personalized counters
- Map-only jobs
- In-mapper combiners
- Exercises 9, 8, 12, 10, 14, 15
- Friday, April 6, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Design patterns – Part 1
- Multiple Inputs and Multiple Outputs
- Distributed cache
- Exercises 4, 13, 17, 22, 23
- Monday, April 9, 2018., 11:30-13:00 (mp4) (m4v)
- Design patterns – Part 2
- Exercises 23, 26
- Friday, April 13, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- MapReduce – Relational Algebra/SQL operators
- Exercises 27 and 29
- Introduction to Spark
- Introduction to Spark – part2 Design patterns – Part 1
- Monday, April 16, 2018., 11:30-13:00 (mp4) (m4v)
- RDD-based programs (RDDs creation and basic transformations) – Part 1 (Slides: 1-78)
- Friday, April 20, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Discussion of three possible MapReduce solution for Lab#3
- RDD-based programs (RDDs creation and basic transformations) – Part 1 (Slides: 79-end)
- RDD-based programs (RDDs basic actions) – Part 2 (Slides: 1-53)
- Spark-submit
- Exercise #30
- Monday, April 23, 2018., 11:30-13:00 (mp4) (m4v)
- RDD-based programs (RDDs basic actions) – Part 2 (Slides: 54-end)
- Exercises 31, 32, 33, 34
- Friday, April 27, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- RDD-based programs (key-value pair RDDs and transformations on PairRDDs) – Part 3
- RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (Slides: 1-27)
- Exercises 35, 36, 37, 38 e 39
- Friday, May 4, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- RDD-based programs (Set transformations and actions on PairRDDs) – Part 4 (Slides: 28-end)
- RDD-based programs (DoubleRDDs)
- RDD-based programs (Cache, accumulators, broadcast variables)
- Exercises 40, 41, 43 (Discussion of a possible solution for the first two parts)
- Monday, May 7, 2018, 11:30-13:00 (mp4) (m4v)
- Exercise 43 (parts1, 2, 3)
- Datasets, DataFrames and Spark SQL (Slides: 1-36)
- Friday, May 11, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Datasets, DataFrames and Spark SQL (Slides: 37-end)
- Spark SQL example
- Exercises: 32 (solved by using the SQL Spark component)
- Monday, May 14, 2018, 11:30-13:00 (mp4) (m4v)
- Exercises 33, 36, 38 (solved by using the SQL Spark component)
- Data Mining – Recap – Introduction
- Friday, May 18, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Spark SQL: User Defined Functions (UDFs)
- Exercises 49, 50
- Spark MLlib – Introduction and Classification of structured
- Spark MLlib – Classification of textual data
- Spark MLlib – Parameter tuning
- Monday, May 21, 2018, 11:30-13:00 (mp4) (m4v)
- Spark MLlib – Clustering of structured data
- Spark MLlib – Itemset and Association rule mining
- Spark MLlib – Linear regression
- Spark Streaming (Slides 1-10)
- Friday, May 25, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Spark Streaming (Slides 11-end)
- Relational and Non-relational databases for Big data
- Monday, May 28, 2018, 11:30-13:00 (mp4) (m4v)
- Exercises: Exercise #44 and #46
- Friday, June 1, 2018, 8:30 – 11:30 (part1.mp4 – part2.mp4) (part1.m4v – part2.m4v)
- Exercises from exams June 30, 2017, July 14, 2017, September 14, 2017
Exercises
- MapReduce Exercises – Part 1 (2 slides per page, 6 slides per page)
- MapReduce Exercises – Part 2 (2 slides per page, 6 slides per page)
- Solutions – Part 1 and 2
- Source code/Eclipse – maven projects (SolutionsExercisesPart1_Part2.zip)
- Solutions – Part 1 and 2
- MapReduce Exercises – Part 3 (2 slides per page, 6 slides per page)
- Solutions – Part 3
- Source code/Eclipse – maven projects (SolutionsExercisesPart3.zip)
- Solutions – Part 3
- MapReduce Exercises – Part 4 (2 slides per page, 6 slides per page)
- Solutions – Part 4
- Source code/Eclipse projects (SolutionsExercisesPart4.zip)
- Solutions – Part 4
- MapReduce Exercises – Part 5 (2 slides per page, 6 slides per page)
- Solutions – Part 5
- Source code/Eclipse projects (SolutionsExercisesPart5.zip)
- Solutions – Part 5
- MapReduce Exercises – Part 6 (2 slides per page, 6 slides per page)
- Solutions – Part 6
- Source code/Eclipse projects (SolutionsExercisesPart6.zip)
- Solutions – Part 6
- MapReduce Exercises – Part 7 (2 slides per page, 6 slides per page)
- Solutions – Part 7
- Source code/Eclipse projects (SolutionsExercisesPart7.zip)
- Solutions – Part 7
- Spark Exercises – Part 8 (2 slides per page, 6 slides per page)
- Simulation – Exercise #31 (2 slides per page, 6 slides per page)
- Solutions – Part 8
- Source code/Eclipse projects (SolutionsExercisesPart8.zip)
- Solutions of Exercises 32-36 based on Spark SQL (with Dataset, DataFrame, SQL-like language)
- Source code/Eclipse projects (SolutionsExercisesPart8SparkSQL.zip)
- Spark Exercises – Part 9 (2 slides per page, 6 slides per page)
- Solutions – Part 9
- Source code/Eclipse projects (SolutionsExercisesPart9.zip)
- Solutions of Exercises 37-38 based on Spark SQL (with Dataset, DataFrame, SQL-like language)
- Source code/Eclipse projects (SolutionsExercisesPart9SparkSQL.zip)
- Solutions – Part 9
- Spark Exercises – Part 10 (2 slides per page, 6 slides per page)
- Solutions – Part 10
- Source code/Eclipse projects (SolutionsExercisesPart10.zip)
- Solutions – Part 10
- Spark Exercises – Part 11 (2 slides per page, 6 slides per page)
- Solutions – Part 11
- Source code/Eclipse projects (SolutionsExercisesPart11.zip)
- Solutions – Part 11
- Spark Exercises – Part 12 (2 slides per page, 6 slides per page)
- Solutions – Part 12
- Source code/Eclipse projects (SolutionsExercisesPart12.zip)
- Solutions – Part 12
- Spark UDFs Exercises – Part 14 (2 slides per page, 6 slides per page)
- Solutions – Part 14
- Source code/Eclipse projects (SolutionsExercisesPart14.zip)
- Solutions – Part 14
- Spark Streaming Exercises – Part 13 (2 slides per page, 6 slides per page)
- Solutions – Part 13
- Source code/Eclipse projects (SolutionsExercisesPart13.zip)
- Solutions – Part 13
Exam Examples
- At the exam, the following template will be provided for the exercise based on Hadoop for the Driver part (Hadoop template)
- For the Spark exercises, no templates are provided
- Exam example #1
- Exam example #2
- Exam July 1, 2016
- Exam July 12, 2016
- Exam September 19, 2016
- Exam June 30, 2017
- Exam July 14, 2017
- Exam September 14, 2017
- Exam January 22, 2018 – NEW: UPLOADED ON June 12, 2018
- Exam June 26, 2018
- Exam July 16, 2018
- Exam September 9, 2018
Practices
- Schedule of the lab activities
-
Students from AA to LZ Students from MA to ZZ - TEAM 1: Students from A to C – Tuesday from 5.30pm to 7pm
- TEAM 2: Students from D to L – Wednesday from 5.30pm to 7pm
-
Team 1 Team 2 Lab #1 Tuesday, March 20 – from 5.30pm to 7pm Wednesday, March 21 – from 5.30pm to 7pm Lab #2 Tuesday, March 27 – from 5.30pm to 7pm Thursday, March 28 – from 5.30pm to 7pm Lab #3 Tuesday, April 10 – from 5.30pm to 7pm Wednesday, April 11 – from 5.30pm to 7pm Lab #4 Tuesday, April 17 – from 5.30pm to 7pm Wednesday, April 18 – from 5.30pm to 7pm Lab #5 Tuesday, April 24 – from 5.30pm to 7pm Wednesday, May 2 – from 5.30pm to 7pm Lab #6 Tuesday, May 8 – from 5.30pm to 7pm Wednesday, May 9 – from 5.30pm to 7pm Lab #7 Tuesday, May 15 – from 5.30pm to 7pm Wednesday, May 16 – from 5.30pm to 7pm Lab #8 Tuesday, May 22 – from 5.30pm to 7pm Wednesday, May 23 – from 5.30pm to 7pm Lab #9 Tuesday, May 29 – from 5.30pm to 7pm Wednesday, May 30 – from 5.30pm to 7pm Lab #10 Tuesday, June 5 – from 5.30pm to 7pm Wednesday, June 6 – from 5.30pm to 7pm
- TEAM 1: Students from M to Q – Tuesday from 10am to 11.30am
- TEAM 2: Students from R to Z – Wednesday from 1pm to 2.30pm
-
Team 1 Team 2 Lab #1 Tuesday, March 20 – from 10am to 11.30am Wednesday, March 21 – from 1pm to 2.30pm Lab #2 Tuesday, March 27 – from 10am to 11.30am Wednesday, March 28 – from 1pm to 2.30pm Lab #3 Tuesday, April 10 – from 10am to 11.30am Wednesday, April 11 – from 1pm to 2.30pm Lab #4 Tuesday, April 17 – from 10am to 11.30am Wednesday, April 18 – from 1pm to 2.30pm Lab #5 Tuesday, April 24 – from 10am to 11.30am Wednesday, May 2 – from 1pm to 2.30pm Lab #6 Tuesday, May 8 – from 10am to 11.30am Wednesday, May 9 – from 1pm to 2.30pm Lab #7 Tuesday, May 15 – from 10am to 11.30am Wednesday, May 16 – from 1pm to 2.30pm Lab #8 Tuesday, May 22 – from 10am to 11.30am Wednesday, May 23 – from 1pm to 2.30pm Lab #9 Tuesday, May 29 – from 10am to 11.30am Wednesday, May 30 – from 1pm to 2.30pm Lab #10 Tuesday, June 5 – from 10am to 11.30am Wednesday, June 6 – from 1pm to 2.30pm
-
- Lab1: Hadoop and MapReduce
- BigData@Polito environment (2 slides per page, 6 slides per page)
- Text (pdf)
- Project and data (Lab1_BigData.zip)
- Solution
- Bonus track (zip)
- Lab2: Filter with Hadoop MapReduce
- Text (pdf)
- Skeleton Eclipse project Hadoop – MapReduce (Lab_Skeleton.zip)
- Solution
- Lab3: Frequently bought/reviewed together application with Hadoop MapReduce
- Text (pdf)
- Skeleton Eclipse project Hadoop – MapReduce (Lab3_Skeleton.zip)
- Solution
- Solution (zip) – Three alternative solutions are provided – The 3rd solution has been UPDATED ON April 20, 2018
- Comments on the three uploaded possible solutions (2 slides per page, 6 slides per page)
- Lab4: Normalized ratings for product recommendations with Hadoop MapReduce
- Text (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project Hadoop – MapReduce (Lab4_Skeleton.zip)
- Solution (zip)
- Lab5: Filter with Apache Spark
- Text (pdf)
- SampleLocalFile.csv (SampleLocalFile.csv)
- Skeleton Eclipse project – Spark (Lab5_Template.zip)
- Solution (zip)
- Lab6: Frequently bought/reviewed together application with Apache Spark
- Text (pdf)
- Sample dataset (ReviewsSample.csv)
- Skeleton Eclipse project – Spark (Lab6_Template.zip)
- Solution (zip)
- Lab7: Bike sharing data analysis
- Text (pdf)
- Sample data (zip)
- Skeleton Eclipse project – Spark (Lab7_Template.zip)
- Example of KML file (zip)
- Another KML visualizer that can be used to visualize on a map the result of your analysis: http://kmlviewer.nsspot.net/
- Solution (zip)
- Lab8: Bike sharing data analysis based on Spark SQL
- Text (pdf)
- Sample data (zip)
- Skeleton Eclipse project – Spark (Lab8_Template.zip)
- Solution (Dataset-based.zip) (SQL-based.zip) (DataFrame-based.zip)
- Lab9: A classification pipeline with MLlib + SparkSQL
- Text (pdf)
- Skeleton Eclipse project – Spark (Lab9_Template.zip)
- Sample file with 100 reviews (ReviewsSample.csv)
- Solution
- Lab10: Tweet analysis – Spark streaming
- Text (pdf)
- Skeleton Eclipse project – (Lab10_Template.zip)
- Example files – tweets (exampledata_tweets.zip)
- Solution (zip)
Additional materials
- Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who do not know Java
- OO Paradigm and UML (The UML part in not mandatory)
- The Java Environment
- Java Basic Features
- Java Inheritance
- Suggested slides/lectures for those students who do not know Java
- Virtual Machine and Docker image with CDH5.12 – Hadoop
- Cloudera Virtual Machine and Docker image: https://www.cloudera.com/downloads/quickstart_vms/5-12.html
- Pay attention that Java 1.7 is installed on the Cloudera Virtual Machine and the Cloudera Docker image. Hence, Lambda functions and some other specific functionalities cannot be used. Use the following Eclipse Project as a template if you are using the Cloudera virtual machine/docker image and Java 1.7:
- Skeleton Eclipse project Hadoop – MapReduce for Java 1.7 (Lab_Skeleton.zip)
- Pay attention that Java 1.7 is installed on the Cloudera Virtual Machine and the Cloudera Docker image. Hence, Lambda functions and some other specific functionalities cannot be used. Use the following Eclipse Project as a template if you are using the Cloudera virtual machine/docker image and Java 1.7:
- Cloudera Virtual Machine and Docker image: https://www.cloudera.com/downloads/quickstart_vms/5-12.html
- Slides about the Scala language – These slides are not part of the course program (no questions or exercises on these slides at the exam)
- MapReduce – Hadoop internals (2 slides per page, 6 slides per page) – The “Hadoop inernals” topic is not covered this year
- Apache HIVE (2 slides per page, 6 slides per page) – The “Apache HIVE” topic is not covered this year
- Apache Storm – These slides are not part of the course program (no questions or exercises on these slides at the exam)
- Introduction (2 slides per page, 6 slides per page)
- Storm Architecture (2 slides per page, 6 slides per page)
- Developing Storm applications (2 slides per page, 6 slides per page)
- Advances topics (2 slides per page, 6 slides per page)
- Trident (2 slides per page, 6 slides per page)