Table of content
General information
- ECTS: 8
- Professor: Paolo Garza
- Teaching assistant: Luca Colomba
Exam rules
Slides
- Introduction to the course content and exam rules (pdf)
- Introduction to Big Data (pdf)
- Big Data Architectures (pdf)
- Hadoop and MapReduce
- Introduction to Apache Hadoop and the MapReduce programming paradigm (pdf)
- Interaction with HDFS and Hadoop by means of the command line (pdf)
- Hadoop implementation of MapReduce (pdf)
- Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
- PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
- BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (pdf)
- MapReduce – Design patterns – Part 1 (pdf)
- MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (pdf)
- MapReduce – Design patterns – Part 2 (pdf)
- MapReduce – Relational Algebra/SQL operators (pdf)
- Spark
- Introduction to Apache Spark (pdf)
-
- RDD-based programs
- RDDs: creation, basic transformations and actions (pdf)
- Key-value RDDs: transformations and actions on key-value RDDs (pdf)
- DoubleRDDs (pdf)
- Advanced Topics: Cache, accumulators, broadcast variables (pdf)
- Advanced Topics – Part II: Custom partitioners, broadcast join (pdf)
- Spark SQL and DataFrames
- Spark SQL (pdf)
- Spark SQL – Part II (pdf)
- Data mining and Machine learning algorithms with Spark
- MLlib
- Introduction and Preprocessing (pdf)
- Classification (pdf)
- Clustering (pdf)
- Regression (pdf)
- Itemset and Association rule mining (pdf)
- GraphX/GraphFrames
- Introduction to GraphX and GraphFrames (pdf)
- Graph Algorithms with GraphFrames (pdf)
- Simple example – Jupyter notebook (GraphFrameExamples.zip)
- Select kernel GraphFrames (Yarn) to run it on jupyter.polito.it
- Run “pyspark – -packages graphframes:graphframes:0.8.0-spark2.4-s_2.11” to run it locally on your PC
- Streaming data analytics
- Spark Streaming
- Spark Streaming (DStreams) (pdf)
- Structured Streaming (pdf)
- Introduction to other big stream processing frameworks: Apache Storm, Apache Flink, .. (pdf)
Exercises
- MapReduce
- MapReduce exercises (pdf)
- Basic project
- Linux and macOS
- Windows
- Setup instructions (ConfigureWindowsEnviroment.pdf)
- You must install also JDK 1.8 and select it for the imported project inside Eclipse. If you already installed the JDK environment but the version is greater than JDK 1.8 you must install also JDK 1.8.
- Winutils executable (winutils.zip)
- Basic Eclipse project for MapReduce applications (based on maven) (MapReduceBasicProjectWindows.zip)
- Spark
- Spark RDD-, DataFrame-based exercises (pdf)
- Spark SQL exercises (pdf)
- Spark MLlib exercises (pdf)
- GraphFrame exercises (pdf)
- Spark streaming exercises (pdf)
Practices
- No lab activities during the first week
- TEAM 1: Students from A to L – Friday from 2.30 pm to 4 pm
- TEAM 2: Students from M to Z – Friday from 4 pm to 5.30 pm
Exam Examples
- Exam Example #1 (pdf)
- Exam Example #2 (pdf)
- Exam Example #3 (pdf)
- Exam Example #4 (pdf)
- Exam Example #5 (pdf)
- Exam June 27, 2020 (pdf)
- Exam July 20, 2020 (pdf)
- Solution
- Question 1: (d)
- Question 2: (b) – Note that there are three actions and hence the input file is read three times.
- Part II: MapReduce and Spark (DBD_Exam20200720Sol.zip)
- Exam September 14, 2020 (pdf)
Additional material
- Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
- Suggested slides/lectures for those students who have never used Java
- OO Paradigm and UML (The UML part is not needed)
- The Java Environment
- Java Basic Features
- Java Inheritance
- Slides about Relational model and SQL language (link)
- Suggested parts
- Relational data model
- SQL language:
- Basics
- The SELECT statement: basics
- Nested queries
- Set operators