Distributed architectures for big data processing and analytics (2020/2021)

This page has hierarchy - Parent page: Teaching

Table of content

General information

  • ECTS: 8
  • Professor: Paolo Garza
  • Teaching assistant: Luca Colomba

Exam rules


  • Introduction to the course content and exam rules (pdf)
  • Introduction to Big Data (pdf)
  • Big Data Architectures (pdf)
  • Hadoop and MapReduce
    • Introduction to Apache Hadoop and the MapReduce programming paradigm (pdf)
      • Interaction with HDFS and Hadoop by means of the command line (pdf)
    • Hadoop implementation of MapReduce (pdf)
      • Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
      • PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
      • BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (pdf)
    • MapReduce – Design patterns – Part 1 (pdf)
    • MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (pdf)
    • MapReduce – Design patterns – Part 2 (pdf)
    • MapReduce – Relational Algebra/SQL operators (pdf)
  • Spark
    • Introduction to Apache Spark (pdf)
      • How to submit Spark applications (pdf)
      • How to use Jupyter notebooks for your Spark applications (pdf)
        • A useful online tutorial for those who want to install and run Spark locally on their PCs (tested for Linux) – How to use PySpark on your computer” by Favio Vázquez (link)
    • RDD-based programs
      • RDDs: creation, basic transformations and actions (pdf)
      • Key-value RDDs: transformations and actions on key-value RDDs (pdf)
      • DoubleRDDs (pdf)
      • Advanced Topics: Cache, accumulators, broadcast variables (pdf)
      • Advanced Topics – Part II: Custom partitioners, broadcast join (pdf)
    • Spark SQL and DataFrames
    • Data mining and Machine learning algorithms with Spark
    • GraphX/GraphFrames
      • Introduction to GraphX and GraphFrames (pdf)
      • Graph Algorithms with GraphFrames (pdf)
        • Simple example – Jupyter notebook (GraphFrameExamples.zip)
          • Select kernel GraphFrames (Yarn) to run it on jupyter.polito.it
          • Run “pyspark – -packages graphframes:graphframes:0.8.0-spark2.4-s_2.11” to run it locally on your PC
  • Streaming data analytics
    • Spark Streaming
    • Introduction to other big stream processing frameworks: Apache Storm, Apache Flink, .. (pdf)



  • No lab activities during the first week
  • TEAM 1: Students from A to L – Friday from 2.30 pm to 4 pm
  • TEAM 2: Students from M to Z – Friday from 4 pm to 5.30 pm


Exam Examples

Additional material

  • Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
    • Suggested slides/lectures for those students who have never used Java
      • OO Paradigm and UML (The UML part is not needed)
      • The Java Environment
      • Java Basic Features
      • Java Inheritance
  • Slides about Relational model and SQL language (link)
    • Suggested parts
      • Relational data model
      • SQL language:
        • Basics
        • The SELECT statement: basics
        • Nested queries
        • Set operators