Distributed architectures for big data processing and analytics (2020/2021)


This page has hierarchy - Parent page: Teaching

Table of content

General information

  • ECTS: 8
  • Professor: Paolo Garza
  • Teaching assistant: Luca Colomba

Exam rules

Slides

  • Introduction to the course content and exam rules (pdf)
  • Introduction to Big Data (pdf)
  • Big Data Architectures (pdf)
  • Hadoop and MapReduce
    • Introduction to Apache Hadoop and the MapReduce programming paradigm (pdf)
      • Interaction with HDFS and Hadoop by means of the command line (pdf)
    • Hadoop implementation of MapReduce (pdf)
      • Source code of the Word Count Ecplise project (WordCount.zip) – Use the import maven project option to import it in Eclipse
      • PDF version of the code (i.e., PDF version of the java files) (WordCountPDF.zip)
      • BigData@Polito environment + Jupyter – How to submit MapReduce jobs on BigData@Polito (pdf)
    • MapReduce – Design patterns – Part 1 (pdf)
    • MapReduce and Hadoop – Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (pdf)
    • MapReduce – Design patterns – Part 2 (pdf)
    • MapReduce – Relational Algebra/SQL operators (pdf)
  • Spark
    • Introduction to Apache Spark (pdf)
      • How to submit Spark applications (pdf)
      • How to use Jupyter notebooks for your Spark applications (pdf)
        • A useful online tutorial for those who want to install and run Spark locally on their PCs (tested for Linux) – How to use PySpark on your computer” by Favio Vázquez (link)
    • RDD-based programs
      • RDDs: creation, basic transformations and actions (pdf)
      • Key-value RDDs: transformations and actions on key-value RDDs (pdf)
      • DoubleRDDs (pdf)
      • Advanced Topics: Cache, accumulators, broadcast variables (pdf)
      • Advanced Topics – Part II: Custom partitioners, broadcast join (pdf)
    • Spark SQL and DataFrames
    • Data mining and Machine learning algorithms with Spark
    • GraphX/GraphFrames
      • Introduction to GraphX and GraphFrames (pdf)
      • Graph Algorithms with GraphFrames (pdf)
        • Simple example – Jupyter notebook (GraphFrameExamples.zip)
          • Select kernel GraphFrames (Yarn) to run it on jupyter.polito.it
          • Run “pyspark –packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 –repositories https://repos.spark-packages.org” to run it locally on your PC
            • Use package graphframes:graphframes:0.8.0-spark2.4-s_2.11 if you locally installed Spark 2 instead of Spark 3
  • Streaming data analytics

Exercises

Practices

  • No lab activities during the first week
  • TEAM 1: Students from A to L – Friday from 2.30 pm to 4 pm
  • TEAM 2: Students from M to Z – Friday from 4 pm to 5.30 pm

 

  • Lab1: Hadoop and MapReduce (Friday, March 12)
  • Lab2: Frequently bought/reviewed together application with Hadoop MapReduce (Friday, March 19)
  • Lab3: Normalized ratings for product recommendations with Hadoop MapReduce (Friday, March 26)
  • Lab4: Filter data and compute basic statistics with Apache Spark (Friday, April 9)
  • Lab5: Frequently bought/reviewed together application with Apache Spark (Friday, April 16)
  • Lab6: Bike sharing data analysis (Friday, April 23)
    • Problem specification (pdf)
    • Sample data (zip)
    • Example KML file (zip)
    • Another KML visualizer that can be used to visualize on a map the result of your analysis: http://kmlviewer.nsspot.net
    • Solution
      • Lab6_Sol2021.zip – Jupyter notebook (Lab6_DBD2021Sol.ipynb) and Python script (Lab6_DBD2021Sol.py)
  • Lab7: Bike sharing data analysis based on Spark SQL (Friday, April 30 – 14:30-16:00)
    • Problem specification (pdf)
    • Sample data (zip)
    • Solution
  • Lab8: A classification pipeline with MLlib + SparkSQL (Friday, May 7 – 14:30-16:00)
  • Lab9: GraphFrame (Friday, May 14 – 14:30-16:00)
    • Problem specification (pdf)

Exam Examples

Additional material

  • Slides and screencasts about Java (kindly provided by prof. Torchiano) (link)
    • Suggested slides/lectures for those students who have never used Java
      • OO Paradigm and UML (The UML part is not needed)
      • The Java Environment
      • Java Basic Features
      • Java Inheritance
  • Slides about Relational model and SQL language (link)
    • Suggested parts
      • Relational data model
      • SQL language:
        • Basics
        • The SELECT statement: basics
        • Nested queries
        • Set operators