{"id":11322,"date":"2025-02-20T21:54:43","date_gmt":"2025-02-20T20:54:43","guid":{"rendered":"https:\/\/dbdmg.polito.it\/dbdmg_web\/?p=11322"},"modified":"2026-03-07T17:12:24","modified_gmt":"2026-03-07T16:12:24","slug":"distributed-architectures-for-big-data-processing-and-analytics-2024-2025","status":"publish","type":"post","link":"https:\/\/dbdmg.polito.it\/dbdmg_web\/2025\/distributed-architectures-for-big-data-processing-and-analytics-2024-2025\/","title":{"rendered":"Distributed architectures for big data processing and analytics (2024\/2025)"},"content":{"rendered":"\n<h1 class=\"wp-block-heading has-accent-2-color has-text-color has-link-color eplus-wrapper wp-elements-9afb7d9cb088918f00ecd8287abc56ce\"><strong>Former version of the web page of the course<\/strong><\/h1>\n\n\n\n<p>General Information<\/p>\n\n\n\n<p class=\" eplus-wrapper\"><strong>SSD<\/strong>: ING-INF\/05<\/p>\n\n\n\n<p class=\" eplus-wrapper\"><strong>CFU<\/strong>: 8<\/p>\n\n\n\n<p class=\" eplus-wrapper\"><strong>Professor<\/strong>: Paolo Garza<\/p>\n\n\n\n<p class=\" eplus-wrapper\"><strong>Teaching Assistants<\/strong>: Simone Monaco and Claudio Savelli<\/p>\n\n\n\n<hr class=\" wp-block-separator has-css-opacity eplus-wrapper\"\/>\n\n\n\n<h2 class=\" wp-block-heading eplus-wrapper\">Teaching Material<\/h2>\n\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">Introduction<\/h5>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-fe604d\">\n<li class=\" eplus-wrapper\">Introduction to the course content and exam rules (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/02\/00_Intro_DistributedBigData_2425.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Introduction to Big Data (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/01_Intro_BigData_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Big Data Architectures (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/02_Architectures_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n<\/ul>\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">Hadoop and MapReduce<\/h5>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-dc808d\">\n<li class=\" eplus-wrapper\">Introduction to Apache Hadoop and the MapReduce programming paradigm (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/03_Intro_HadoopAndMapReduce_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-6b9945\">\n<li class=\" eplus-wrapper\">Interaction with HDFS and Hadoop using the command line (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/03b_HDFS_Hadoop_CommandLine_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Hadoop implementation of MapReduce (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/04_HadoopImplementationOfMapReduceNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-e00730\">\n<li class=\" eplus-wrapper\">Source code of the Word Count Ecplise project (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/MapReduceBasicProject.zip\" target=\"_blank\" rel=\"noreferrer noopener\">WordCount.zip<\/a>) \u2013 Use the import maven project option to import it into Visual Studio Code<\/li>\n\n\n\n<li class=\" eplus-wrapper\">BigData@Polito environment + Jupyter \u2013 How to submit MapReduce jobs on BigData@Polito (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/04b_ClusterJupyter_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">MapReduce \u2013 Design patterns \u2013 Part 1 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/05_MapReduce_Patterns_Part1_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">MapReduce and Hadoop \u2013 Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/06_AdvancedTopicsMapReduce_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">MapReduce \u2013 Design patterns \u2013 Part 2 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/07_MapReduce_Patterns_Part2_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">MapReduce \u2013 Relational Algebra\/SQL operators (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/08_SQLOperators_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n<\/ul>\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">Spark<\/h5>\n\n\n\n<div class=\"wp-block-group eplus-wrapper\"><div class=\"wp-block-group__inner-container is-layout-constrained wp-block-group-is-layout-constrained\"><ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-e760e7\">\n<li class=\" eplus-wrapper\">Introduction to Apache Spark (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/10_SparkIntroduction_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-f14933\">\n<li class=\" eplus-wrapper\">How to submit Spark applications (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/10b_SparkSubmit_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">How to use Jupyter Notebooks for your Spark applications (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/10c_JupyterNotebooks_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">You can install PySpark and JupyterLab using\u00a0<strong>Conda\/Miniconda\/pip<\/strong>\u00a0(<a rel=\"noreferrer noopener\" href=\"https:\/\/github.com\/dbdmg\/pyspark-install\" target=\"_blank\">instructions here<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">RDD-based programs<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-034388\">\n<li class=\" eplus-wrapper\">RDDs: creation, basic transformations and actions (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/11_SparkRDDBasedProgramming_DistributedBigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-9e39b1\">\n<li class=\" eplus-wrapper\">Some examples (partially selected from the slides): Examples &#8211; Notebook (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/04\/ExamplesSlides.zip\">ExamplesFromSlides.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Key-value RDDs: transformations and actions on key-value RDDs (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/12_SparkPairRDD_DistributedBigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-aaefd6\">\n<li class=\" eplus-wrapper\">Inner join, left outer join, right outer join, full outer join, and &#8220;NOT IN&#8221; with PairRDDs: Examples &#8211; Notebook (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/04\/JoinsRDD.zip\" target=\"_blank\" rel=\"noreferrer noopener\">JoinsRDD.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">DoubleRDDs (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/13_SparkDoubleRDD_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Advanced Topics: Cache, accumulators, broadcast variables, custom partitioners, broadcast join (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/14_SparkRDDBasedProgramming_AdvancedTopics_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-fa52f4\">\n<li class=\" eplus-wrapper\">RDD partition examples (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/RDDPartitionsExamples.zip\" target=\"_blank\">RDDPartitionsExamples.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Introduction to PageRank (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/15b_SparkIntroPageRankNB.pdf\" target=\"_blank\">pdf<\/a>) \u2013 Example: PageRank \u201cnaive\u201d implementation (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/RDDPageRank.zip\" target=\"_blank\">RDDPageRank.zip<\/a>)<\/li>\n<\/ul><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark SQL and DataFrames<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-571431\">\n<li class=\" eplus-wrapper\">Spark SQL (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/16_SparkSQL_DistributedBigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-a28896\">\n<li class=\" eplus-wrapper\">Simple examples \u2013 Jupyter notebook (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/SparkSQLSimpleExamples.zip\" target=\"_blank\">SparkSQLSimpleExamples.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark SQL join examples \u2013 Jupyter notebook (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExamplesSparkSQLJoins.zip\" target=\"_blank\">ExamplesSparkSQLJoins.zip<\/a>)<\/li>\n<\/ul><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Data mining and Machine learning algorithms with Spark MLlib<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-eb5d46\">\n<li class=\" eplus-wrapper\">Introduction and Preprocessing (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/18a_SparkMLlib_DistributedBigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Classification (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/18b_SparkMLlib_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-7673ad\">\n<li class=\" eplus-wrapper\">Classification examples \u2013 Jupyter notebooks and sample data (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExampleClassificationMLlib.zip\" target=\"_blank\">ExampleClassificationMLlib.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Clustering (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/18c_SparkMLlib_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-9ab7da\">\n<li class=\" eplus-wrapper\">Clustering example \u2013 Jupyter notebook and sample data (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExampleClusteringMLlib.zip\" target=\"_blank\">ExampleClusteringMLlib.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Regression (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/18d_SparkMLlib_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-6f8259\">\n<li class=\" eplus-wrapper\">Regression example \u2013 Jupyter notebook and sample data (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExampleRegressionMLlib.zip\" target=\"_blank\">ExampleRegressionMLlib.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Itemset and Association rule mining (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/18e_SparkMLlib_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-7077e6\">\n<li class=\" eplus-wrapper\">Itemset and Association rule mining example \u2013 Jupyter notebook and sample data (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExampleItemsetMLlib.zip\" target=\"_blank\">ExampleItemsetMLlib.zip<\/a>)<\/li>\n<\/ul><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">GraphX\/GraphFrames<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-f3684d\">\n<li class=\" eplus-wrapper\">Introduction to GraphX and GraphFrames (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/19_SparkGraphFrame_PartI_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Graph Algorithms with GraphFrames (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/20_SparkGraphFrame_Algorithms_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-bd7993\">\n<li class=\" eplus-wrapper\">Simple example \u2013 Jupyter notebook (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/GraphFrameExamples.zip\" target=\"_blank\">GraphFrameExamples.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Select kernel GraphFrames (Yarn) to run it on jupyter.polito.it<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Run \u201cpyspark \u2013packages graphframes:graphframes:0.8.1-spark3.0-s_2.12 \u2013repositories https:\/\/repos.spark-packages.org\u201d to run it locally on your PC \u2013 Use package graphframes:graphframes:0.8.0-spark2.4-s_2.11 if you locally installed Spark 2 instead of Spark 3<\/li>\n<\/ul><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Streaming data analytics<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-1005f5\">\n<li class=\" eplus-wrapper\">Spark Streaming Spark Streaming (DStreams) (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/21_SparkStreaming_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-362f1a\">\n<li class=\" eplus-wrapper\">Simple examples \u2013 Jupyter notebooks (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/SparkSteamingExamples.zip\" target=\"_blank\">SparkSteamingExamples.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Structured Streaming (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/22_SparkStructuredStreaming_DistributedBigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-d24d17\">\n<li class=\" eplus-wrapper\">Simple examples \u2013 Jupyter notebooks (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExampleStructutedStreaming.zip\" target=\"_blank\">SparkStructutedStreamingExamples.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Introduction to other big stream processing frameworks: Apache Storm, Apache Flink, &#8230; (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/23_StreamingFrameworks_DistributedBigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>) &#8211; <strong>Not covered this year<\/strong><\/li>\n<\/ul><\/li>\n<\/ul><\/div><\/div>\n\n\n\n<h2 class=\" wp-block-heading eplus-wrapper\">Exercises<\/h2>\n\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">MapReduce<\/h5>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-3c5c52\">\n<li class=\" eplus-wrapper\">MapReduce Exercises (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/01_MapReduce_Exercises_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-5e6255\">\n<li class=\" eplus-wrapper\">Solutions of Exercises 1-29 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/SolutionsExMapReduce.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SolutionsExMapReduce.zip<\/a>)<\/li>\n<\/ul><\/li>\n<\/ul>\n\n\n<p class=\" eplus-wrapper\">To configure Visual Studio Code on your laptop, follow <a href=\"#laptop_configuration\" data-type=\"internal\" data-id=\"#laptop_configuration\">these instructions<\/a>.<\/p>\n\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">Spark<\/h5>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-06dc14\">\n<li class=\" eplus-wrapper\">Spark exercises (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/02_Spark_Exercises_BigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-aee68a\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/ExSparkData30_46.zip\" target=\"_blank\">ExSparkData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">RDD-based solutions of Exercises 30-46 \u2013 Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/SolutionsExSpark30_46.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkNotebooksSol30_46.zip<\/a>) &#8211; <strong>Updated on April 8, 2025<\/strong> &#8211; A new solution based on reduceByKey has been uploaded for solving exercise 39bis (ex39bis_v3.ipynb)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-357ee2\">\n<li class=\" eplus-wrapper\">Solution of Exercise 44 based on Left Outer Join (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/04\/ex44LeftOuterJoin.zip\">ex44LeftOuterJoin.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solution of Exercise 46 based on Spark SQL APIs + RDD.groupByKey() &#8211; Example to show how to create and manage &#8220;static windows&#8221; with almost only Spark SQL APIs (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/06\/ex46_DF.zip\" target=\"_blank\" rel=\"noreferrer noopener\">ex46_DF.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">PySpark Installation Guide<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-a89385\">\n<li class=\" eplus-wrapper\">How to run PySpark applications on your PC or Google Colab: You can install PySpark and JupyterLab using\u00a0<strong>Conda\/Miniconda\/pip<\/strong>\u00a0(<a href=\"https:\/\/github.com\/dbdmg\/pyspark-install\" target=\"_blank\" rel=\"noreferrer noopener\">instructions here<\/a>)<\/li>\n<\/ul><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark SQL exercises (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/02_Spark_ExerciseSparkSQLNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-025798\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExSparkSQLData.zip\" target=\"_blank\">ExSparkSQLData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solutions of Exercises 47-50 \u2013 Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/SparkNotebooksSol47_50.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkNotebooksSol47_50.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark MLlib exercises (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/03_MLlib_Exercises_BigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-ee1282\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExampleMLlibData.zip\" target=\"_blank\">ExampleMLlibData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solutions of Exercise 51 (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/SparkNotebooksSol51.zip\" target=\"_blank\">SparkNotebooksSol51.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">GraphFrame exercises (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/04_GraphFrame_Exercises_BigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-a28a38\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExampleGraphFrameData.zip\" target=\"_blank\">ExampleGraphFrameData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solutions of Exercises 52-57b \u2013 Jupyter notebooks (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/05\/SparkNotebooksSol52_57b.zip\" target=\"_blank\">SparkNotebooksSol52_57b.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark streaming exercises (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/05_SparkStreaming_Exercises_BigDataNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-dbd8d0\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExampleSparkStreamingData-1.zip\" target=\"_blank\">ExampleSparkStreamingData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solutions of Exercises 58-65 \u2013 Jupyter notebooks (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/SparkNotebooksSol58_65.zip\" target=\"_blank\">SparkNotebooksSol58_65.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark structured streaming and MLlib exercise (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/06_SparkStructuredStreamingAndMLlib_ExercisesNB.pdf\" target=\"_blank\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-8cef2f\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/ExampleSparkStructuredMLlibData.zip\" target=\"_blank\">ExampleSparkStructuredMLlibData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solution of Exercise 66 \u2013 Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/SparkNotebooksSol66.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkNotebooksSol66.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Further exercises focused on Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/06\/SparkExercisesGruppoStudioHKN.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a>) shared by Nunzio Licalzi. These exercises have been used during the Study Group organized by IEEE Eta Kappa Nu.  <\/li>\n<\/ul>\n\n\n<h2 class=\" wp-block-heading eplus-wrapper\">Laboratory Material<\/h2>\n\n\n\n<p class=\" eplus-wrapper\">No lab activities during the first week.<\/p>\n\n\n\n<p class=\" eplus-wrapper\">Team 1: Students from A to K \u2013 Tuesday from 11:30 to 13:00 (First lab activity \u2013 March 4, 2025) @ <a href=\"https:\/\/www.labinf.polito.it\/\">LABINF<\/a><br>Team 2: Students from L to Z \u2013 Friday from 11:30 to 13:00 (First lab activity \u2013 March 7, 2025) @ <a href=\"https:\/\/www.labinf.polito.it\/\" target=\"_blank\" rel=\"noreferrer noopener\">LABINF<\/a><\/p>\n\n\n\n<h4 class=\" wp-block-heading eplus-wrapper\" id=\"laptop_configuration\">Laptop Configuration<\/h4>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-e8dbb9\">\n<li class=\" eplus-wrapper\"><strong>MapReduce installation Guide: Configure Visual Studio Code on your laptop<\/strong> (\ud83d\udcd8<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/10\/BigData_labs-VSCode_guide.pdf\">guide<\/a>).<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-aa6657\">\n<li class=\" eplus-wrapper\"><strong>Windows users only:<\/strong>\u00a0You must configure the\u00a0<strong>winutils<\/strong>\u00a0(\ud83d\uddc3\ufe0f<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/winutils.zip\" target=\"_blank\" rel=\"noreferrer noopener\">winutils.zip<\/a>) and set up some environmental variables. Follow this \ud83d\udcd8<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/ConfigureWindowsEnviroment.pdf\">extra guide<\/a>\u00a0for the complete configuration.<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Laboratory materials are available in multiple versions. The version with libraries is <strong><span style=\"text-decoration: underline;\">the only one<\/span><\/strong> you can use on the LABINF computers. Use it on your laptop if you are not interested in running the applications locally. All the other versions are Maven projects, so you can use them locally on your laptop to write the code and then run it locally inside Visual Studio Code or on the BigData@Polito cluster (pay attention that the way you export the jar is different with Maven!). The legend follows: \ud83d\udcdalib: Project\/template with libraries, \ud83d\udc27mavU: Maven project for Linux\/MacOS, \ud83e\ude9fmavW: Maven project for Windows (Hadoop projects only).<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-7daf8a\">\n<li class=\" eplus-wrapper\">Basic project for <strong>MapReduce<\/strong> applications (\ud83d\udcda<strong><a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/03\/MapReduceBasicProjectWithLibraries.zip\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>lib<\/strong><\/a><\/strong>, \ud83d\udc27<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/MapReduceBasicProject.zip\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>mavU<\/strong><\/a>, \ud83e\ude9f<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/MapReduceBasicProjectWindows.zip\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>mavW<\/strong><\/a>)<\/li>\n<\/ul><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\"><strong>PySpark Installation Guide<\/strong>: How to run PySpark applications on your PC or Google Colab: You can install PySpark and JupyterLab using\u00a0<strong>Conda\/Miniconda\/pip<\/strong>\u00a0(\ud83d\udcd8<a href=\"https:\/\/github.com\/dbdmg\/pyspark-install\" target=\"_blank\" rel=\"noreferrer noopener\">guide<\/a>).<\/li>\n<\/ul>\n\n\n<h4 class=\" wp-block-heading eplus-wrapper\">Problem specifications and solutions<\/h4>\n\n\n\n\n\n<h2 class=\" wp-block-heading eplus-wrapper\">Previous exam examples<\/h2>\n\n\n\n<figure class=\" wp-block-table eplus-wrapper\"><table><tbody><tr><td>Exams<\/td><td>Solutions<\/td><\/tr><tr><td>Exam July 11, 2025 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/07\/DBD_Exam_2025_07_11_v2.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (c)<br>Question 2: (c)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/07\/dbd_20250711.zip\">DBD_Exam20250711Sol.zip<\/a>)<\/td><\/tr><tr><td>Exam June 27, 2025 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/07\/DBD_Exam_2025_06_27.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (a)<br>Question 2: (b)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/07\/dbd_20250627.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20250627Sol.zip<\/a>)<\/td><\/tr><tr><td>Exam February 10, 2025 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/06\/DBD_Exam_2025_02_10.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (b)<br>Question 2: (a)<\/td><\/tr><tr><td>Exam September 6, 2024 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/09\/DBD_Exam_2024_09_06.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (a) &#8211; The three codes are equivalent. They are based on commutative functions\/methods.<br>Question 2: (a) &#8211; There are 3 distinct keys emitted by the map phase. Hence, the reduce method is invoked 3 times. It follows that the sum of the values of the three instances of numCitiesD is 3.<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/09\/dbd_20240906.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20240906Sol.zip<\/a>)<\/td><\/tr><tr><td>Exam July 19, 2024 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/07\/DBD_Exam_2024_07_19.pdf\">pdf<\/a>)<\/td><td>Question 1: (b) &#8211; 2 times &#8211; Three actions are based on the content of the input file, but highTempRDD is cached. Hence, the input file is read once to compute the value of the count action applied to tempRDD and then one more time to compute the content of highTempRDD, which is then used to calculate the results of the actions count and reduce applied to highTempRDD. Globally, due to the cache of highTempRDD, the input file is read twice. <br>Question 2: (d) &#8211; 6 &#8211; There are 6 input lines =&gt; the map method is invoked, overall, 6 times.<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/07\/dbd_20240719.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20240719Sol.zip<\/a>)<\/td><\/tr><tr><td>Exam July 5, 2024 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/07\/DBD_Exam_2024_07_05.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (c) &#8211; Application B is not equivalent to A and C because .reduce(lambda v1,v2: min(v1, v2) ).filter(lambda value : value&gt;5) is not equivalent to .filter(lambda value : value&gt;5).reduce(lambda v1,v2: min(v1, v2) ). The two functions are not commutative.<br>Question 2: (a) &#8211; Considering all instances of the reducer class, the reduce method is invoked 3 times overall (2 + 1 + 0).<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/07\/dbd_20240705.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20240705Sol.zip<\/a>) &#8211; <strong>A more efficient solution based on one single job has been uploaded &#8211; June 3, 2025<\/strong><br>Sketch of a solution based on SQL (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/07\/DraftSQLBased.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">SQLBasedSolution.pdf<\/a>)<\/td><\/tr><tr><td>Exam February 20, 2024 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/06\/DBD_Exam_2024_02_20.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (a), Question 2: (b)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/06\/DraftSolution20240220DBD.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20240220Sol.zip<\/a>)<\/td><\/tr><tr><td>Exam September 18, 2023 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/06\/DBD_Exam20230918.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (c), Question 2: (c)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/07\/Draft_DBD_EXAM_20230918.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">Paper-based sketch of the solution &#8211; No code_ Exam20230918.pdf<\/a>)<\/td><\/tr><tr><td>Exam July 19, 2023 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/07\/DBD_Exam20230719.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (a), Question 2: (b)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/07\/dbd_20230719.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20230719Sol.zip<\/a>) &#8211; with an SQL-based solution and some example data &#8211; <strong>Updated on June 2, 2025<\/strong><\/td><\/tr><tr><td>Exam June 26, 2023 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/07\/DBD_Exam20230626.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (b), Question 2: (c)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/07\/DBD_Exam20230626Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20230626Sol.zip<\/a>) &#8211; with an SQL-based solution and some example data<\/td><\/tr><tr><td>Exam September 1, 2022 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/09\/DBD_Exam20220901.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (b), Question 2: (d)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/09\/DBD_Exam20220901Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20220901Sol.zip<\/a>)<\/td><\/tr><tr><td>Exam July 18, 2022 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/07\/DBD_Exam20220718.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (b), Question 2: (b)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/07\/DraftSolution20220718.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20220718Sol.zip<\/a>) &#8211; with an SQL-based solution &#8211; Example related to &#8220;static windows&#8221; and how to manage them either RDD or Spark SQL APIs<\/td><\/tr><tr><td>Exam June 27, 2022 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/06\/DBD_Exam20220627.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (c), Question 2: (a)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/06\/DBD_Exam20220627Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20220607Sol.zip<\/a>) &#8211; with an SQL-based solution and some example data<\/td><\/tr><tr><td>Exam February 10, 2022 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/DBD_Exam20220210.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (a), Question 2: (b)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/06\/DBD_Exam20220210Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20220210Sol.zip<\/a>)<\/td><\/tr><tr><td>Exam September 17, 2021 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/DBD_Exam20210917.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (b), Question 2: (a)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/06\/DraftSolution20210917.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20210917.zip<\/a>)<\/td><\/tr><tr><td>Exam July 5, 2021 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/04\/DBD_Exam20210705.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (c), Question 2: (a)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/05\/DBD_Exam20210705Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20210705Sol.zip<\/a>) &#8211; with an SQL-based solution<\/td><\/tr><tr><td>Exam June 21, 2021 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2021\/06\/DBD_Exam20210621.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (b), Question 2: (a)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2021\/06\/DraftSolutionExam_20210621.zip\">DBD_Exam20210621Sol.zip<\/a>)<\/td><\/tr><tr><td>Exam July 20, 2020 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/04\/DBD_Exam20200720.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (d), Question 2: (b)<br>Question 2 \u2013 Note that there are three actions. Hence, the input file is read three times.<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/05\/DBD_Exam20200720Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">DBD_Exam20200720Sol.zip<\/a>)<\/td><\/tr><tr><td>Exam June 27, 2020 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/04\/DBD_Exam20200627.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (b), Question 2: (a)<br>MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/DBD_Exam20200627Sol.zip\">DBD_Exam20200627Sol.zip<\/a>)<\/td><\/tr><tr><td>More examples of multiple choice questions (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2021\/06\/ExamplesMultipleChoiceQuestions.pdf\">pdf<\/a>)<br>\ufeff<\/td><td>Question 1: (c)<br>Question 2: (d)<br>Question 3: (d)<br>Question 4: (d)<br>Question 5: (b)<br>Question 6: (d)<\/td><\/tr><tr><td>GraphFrame \u2013 Examples of multiple choice questions (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/05\/ExamplesMultipleChoiceQuestionsGraphFrame.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/td><td>Question 1: (d)<br>Question 2: (c)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\" wp-block-heading eplus-wrapper\">Additional material<\/h2>\n\n\n\n<p class=\" eplus-wrapper\">Slides and screencasts about Java (kindly provided by Prof. Torchiano) (<a href=\"http:\/\/dbdmg.polito.it\/~paolo\/JavaMaterials\/02JEY%20-%20Object%20Oriented%20Programming.html\">link<\/a>)<br>Focus on the following subset of slides\/lectures (for students who have never used Java):<br>&#8212; OO Paradigm and UML (The UML part is not mandatory)<br>&#8212; The Java Environment<br>&#8212;  Java Basic Features<br>&#8212; Java Inheritance<\/p>\n\n\n\n<p class=\" eplus-wrapper\"><\/p>\n\n\n\n<p class=\" eplus-wrapper\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Former version of the web page of the course General Information SSD: ING-INF\/05 CFU: 8 Professor: Paolo Garza Teaching Assistants: Simone Monaco and Claudio Savelli Teaching Material Introduction Hadoop and MapReduce Spark Exercises MapReduce To configure Visual Studio Code on your laptop, follow these instructions. Spark Laboratory Material No lab &hellip;<\/p>\n","protected":false},"author":5,"featured_media":3290,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"editor_plus_copied_stylings":"{}","footnotes":""},"categories":[37],"tags":[],"class_list":["post-11322","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-courses"],"_links":{"self":[{"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/posts\/11322","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/comments?post=11322"}],"version-history":[{"count":60,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/posts\/11322\/revisions"}],"predecessor-version":[{"id":13815,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/posts\/11322\/revisions\/13815"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/media\/3290"}],"wp:attachment":[{"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/media?parent=11322"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/categories?post=11322"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/tags?post=11322"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}