{"id":12554,"date":"2025-09-19T14:36:56","date_gmt":"2025-09-19T12:36:56","guid":{"rendered":"https:\/\/dbdmg.polito.it\/dbdmg_web\/?p=12554"},"modified":"2026-02-19T09:07:18","modified_gmt":"2026-02-19T08:07:18","slug":"big-data-processing-and-analytics-2025-26","status":"publish","type":"post","link":"https:\/\/dbdmg.polito.it\/dbdmg_web\/2025\/big-data-processing-and-analytics-2025-26\/","title":{"rendered":"Big Data Processing and Analytics (2025\/26)"},"content":{"rendered":"\n<h2 class=\" wp-block-heading eplus-wrapper\" id=\"general-information\">General Information<\/h2>\n\n\n\n<p class=\" eplus-wrapper\"><strong>SSD<\/strong>: ING-INF\/05<\/p>\n\n\n\n<p class=\" eplus-wrapper\"><strong>CFU<\/strong>: 6<\/p>\n\n\n\n<p class=\" eplus-wrapper\"><strong>Professor<\/strong>: Paolo Garza<\/p>\n\n\n\n<p class=\" eplus-wrapper\"><strong>Teaching Assistants<\/strong>: Lorenzo Vaiani and Etibar Vazirov<\/p>\n\n\n\n<h2 class=\" wp-block-heading eplus-wrapper\">Announcements<\/h2>\n\n\n\n<p class=\" eplus-wrapper\">The first lecture is scheduled for Monday, September 22, 2025, at 14:30 in Room R3 (+ streaming of the virtual classroom)<\/p>\n\n\n\n<hr class=\" wp-block-separator has-css-opacity eplus-wrapper\"\/>\n\n\n\n<h3 class=\" wp-block-heading eplus-wrapper\" id=\"teaching-material\">Teaching Material<\/h3>\n\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">INTRODUCTION<\/h5>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-37798b\">\n<li class=\" eplus-wrapper\">Introduction to the course content and exam rules (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/09\/00_Intro_BigDataProcessing_2526.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Introduction to Big Data (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/09\/01_Intro_BigData_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Big Data Architectures (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/09\/02_Architectures_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/li>\n<\/ul>\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">HADOOP AND MAPREDUCE<\/h5>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-5ecbd8\">\n<li class=\" eplus-wrapper\">Introduction to Apache Hadoop and the MapReduce programming paradigm (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/09\/03_Intro_HadoopAndMapReduce_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-93f520\">\n<li class=\" eplus-wrapper\"><em>Interaction with HDFS and Hadoop by means of the command line (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/03b_HDFS_Hadoop_CommandLine_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>) &#8211; Not covered this academic year.<\/em><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Hadoop implementation of MapReduce (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/09\/04_HadoopImplementationOfMapReduce_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-9ae7fd\">\n<li class=\" eplus-wrapper\">BigData@Polito environment + Jupyter \u2013 How to submit MapReduce jobs on BigData@Polito (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/09\/04b_ClusterJupyter_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">MapReduce and Hadoop \u2013 Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/06_AdvancedTopicsMapReduce_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">MapReduce \u2013 Design patterns \u2013 Part 1 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/05_MapReduce_Patterns_Part1_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">MapReduce \u2013 Design patterns \u2013 Part 2 (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/07_MapReduce_Patterns_Part2_BigData_NewStyle.pdf\" target=\"_blank\">slides<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">MapReduce \u2013 Relational Algebra\/SQL operators (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/08_SQLOperators_BigData_NewStyle.pdf\" target=\"_blank\">slides<\/a>)<\/li>\n<\/ul>\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">SPARK<\/h5>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-9f0362\">\n<li class=\" eplus-wrapper\">Introduction to Apache Spark (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/10_SparkIntroduction_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>) <ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-d63ce3\">\n<li class=\" eplus-wrapper\">How to submit Spark applications (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/10b_SparkSubmit_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">How to use Jupyter Notebooks for your Spark applications (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/10c_JupyterNotebooks_DistributedBigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">You can install PySpark and JupyterLab using\u00a0<strong>Conda\/Miniconda\/pip<\/strong>\u00a0(<a href=\"https:\/\/github.com\/dbdmg\/pyspark-install\" target=\"_blank\" rel=\"noreferrer noopener\">instructions here<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">RDD-based programs RDDs<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-cf22be\">\n<li class=\" eplus-wrapper\">Creation, basic transformations, and actions (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/11_SparkRDDBasedProgramming_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>) &#8211; Notebook with some examples from the slides (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/FirstExamplesNotebook.zip\">FirstExamplesNotebook.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Key-value pair RDDs: transformations and actions on PairRDDs (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/12_SparkPairRDD_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-a9db5a\">\n<li class=\" eplus-wrapper\">Inner join, left outer join, right outer join, full outer join, and \u201cNOT IN\u201d with PairRDDs: Examples \u2013 Notebook (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/11\/JoinsRDD.zip\" target=\"_blank\" rel=\"noreferrer noopener\">JoinsRDD.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">DoubleRDDs (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/13_SparkDoubleRDD_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Advanced Topics: Cache, accumulators, broadcast variables (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/14_SparkRDDBasedProgramming_AdvancedTopics_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>) &#8211; Notebooks with some examples (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/11\/ExamplesAccumulatorPython.zip\">ExamplesAccumulatorPython.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark SQL and DataFrames (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/15_SparkSQL_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-8a6dd7\">\n<li class=\" eplus-wrapper\">Simple examples \u2013 Jupyter notebook (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/SparkSQLSimpleExamples.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkSimpleExamples.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark SQL \u2013 Join examples (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/ExamplesSparkSQLJoins.zip\">ExamplesSparkSQLJoins.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark MLlib<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-537ff9\">\n<li class=\" eplus-wrapper\">Introduction to MLlib (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/16a_SparkMLlib_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Classification of structured data and textual data (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/16b_SparkMLlib_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-2beccd\">\n<li class=\" eplus-wrapper\">Classification example code (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/ExampleClassificationMLlib.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\"><em>Regression (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/18d_SparkMLlib_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<strong> &#8211; Not covered this academic year<\/strong><\/em><ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-851acb\">\n<li class=\" eplus-wrapper\"><em>Linear regression example code (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/ExampleRegressionMLlib.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a>)<strong> &#8211; Not covered this academic year<\/strong><\/em><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\"><em>Clustering of structured data (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/16c_SparkMLlib_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<strong> &#8211; Not covered this academic year<\/strong><\/em><ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-f0ac3b\">\n<li class=\" eplus-wrapper\"><em>Clustering example code (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/ExampleClusteringMLlib.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a>)<strong> &#8211; Not covered this academic year<\/strong><\/em><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\"><em>Itemset and Association rule mining (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/18e_SparkMLlib_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<strong> &#8211; Not covered this academic year<\/strong><\/em><ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-78a252\">\n<li class=\" eplus-wrapper\"><em>Itemset and Association rule mining example code (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/ExampleItemsetMLlib.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a>)<strong> &#8211; Not covered this academic year<\/strong><\/em><\/li>\n<\/ul><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark Streaming (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/21_SparkStreaming_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-2fb6c0\">\n<li class=\" eplus-wrapper\">Simple examples \u2013 Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/12\/ExampleSparkStreaming.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkStreamingExamples.zip<\/a>)<\/li>\n<\/ul><\/li>\n<\/ul>\n\n\n<h4 class=\" wp-block-heading eplus-wrapper\" id=\"exercise\">Exercises<\/h4>\n\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">MAP REDUCE<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-accent-2-color\"> <\/mark><\/h5>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-91677a\">\n<li class=\" eplus-wrapper\">MapReduce exercises (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/01_MapReduce_Exercises_BigData_NewStyle.pdf\" target=\"_blank\">slides<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-d30017\">\n<li class=\" eplus-wrapper\">Solutions of Exercises 1-29 (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/SolutionsExMapReduce.zip\" target=\"_blank\">SolutionsExMapReduce.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">How to Write and Compile your Java Application using VSCode (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/10\/BigData_labs-VSCode_guide.pdf\" data-type=\"link\" data-id=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/10\/BigData_labs-VSCode_guide.pdf\" target=\"_blank\">pdf<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Linux or Mac: Basic project for MapReduce applications (based on maven) (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/MapReduceBasicProject.zip\" target=\"_blank\">MapReduceBasicProject.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Windows: Basic project for MapReduce applications (based on maven) (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/MapReduceBasicProjectWindows.zip\" target=\"_blank\">MapReduceBasicProjectWindows.zip<\/a>)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-d05fd1\">\n<li class=\" eplus-wrapper\">How to configure the Windows environment to run MapReduce applications locally on your PC(<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/ConfigureWindowsEnviroment.pdf\" target=\"_blank\">ConfigureWindowsEnviroment.pdf<\/a>)<\/li>\n<\/ul>\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-f6f4ec\">\n<li class=\" eplus-wrapper\"><strong>You must also install<\/strong> <strong>JDK 1.8<\/strong> and select it for the imported project inside the IDE. If you have already installed the JDK environment but the version is greater than JDK 1.8, you must also install<strong> JDK 1.8<\/strong>.<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Winutils executable (<a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/09\/winutils.zip\" target=\"_blank\">winutils.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n<li class=\" eplus-wrapper eplus-styles-uid-2689a9\">If you use your PC to write and run your code locally, use the projects based on Maven (those projects can be run locally).<\/li>\n\n<li class=\" eplus-wrapper eplus-styles-uid-2689a9\">If you use the PC available in the LAB, import the projects with libraries as reported in the first lab (those projects cannot be run locally, but only on the cluster by exporting the project jar file).<\/li><\/ul>\n\n\n<h5 class=\" wp-block-heading eplus-wrapper\">SPARK<\/h5>\n\n\n<ul class=\"eplus-ce5B3z wp-block-list eplus-wrapper eplus-styles-uid-b64d15\">\n<li class=\" eplus-wrapper\">Spark RDD-based exercises (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/02_Spark_Exercises_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-5e9efd\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/ExSparkData30_46.zip\" target=\"_blank\" rel=\"noreferrer noopener\">ExSparkData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">RDD-based solutions of Exercises 30-46 \u2013 Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/03\/SolutionsExSpark30_46.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkNotebooksSol30_46.zip<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-3dfe8c\">\n<li class=\" eplus-wrapper\">Solution of Exercise 44 based on Left Outer Join (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/ex44LeftOuterJoin.zip\">ex44LeftOuterJoin.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solution of Exercise 46 based on Spark SQL APIs + RDD.groupByKey() \u2013 Example to show how to create and manage \u201cstatic windows\u201d with Spark SQL APIs (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/ex46_DF.zip\" target=\"_blank\" rel=\"noreferrer noopener\">ex46_DF.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">PySpark Installation Guide<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-6b41d3\">\n<li class=\" eplus-wrapper\">How to run PySpark applications on your PC or Google Colab: You can install PySpark and JupyterLab using\u00a0<strong>Conda\/Miniconda\/pip<\/strong>\u00a0(<a href=\"https:\/\/github.com\/dbdmg\/pyspark-install\" target=\"_blank\" rel=\"noreferrer noopener\">instructions here<\/a>)<\/li>\n<\/ul><\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark SQL exercises (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/02_Spark_ExerciseSparkSQLNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-cc8aab\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/ExSparkSQLData.zip\" target=\"_blank\" rel=\"noreferrer noopener\">ExSparkSQLData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solutions of Exercises 47-50 \u2013 Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/SparkNotebooksSol47_50.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkNotebooksSol47_50.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solution of Exercise 50_new \u2013 Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/SparkNotebooksSol50_new.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkNotebookSol50_new.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark MLlib exercises (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/03_MLlib_Exercises_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-f6ffd6\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/ExampleMLlibData.zip\" target=\"_blank\" rel=\"noreferrer noopener\">ExampleMLlibData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solutions of Exercise 51 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/SparkNotebooksSol51.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkNotebooksSol51.zip<\/a>)<\/li>\n<\/ul><\/li>\n\n\n\n<li class=\" eplus-wrapper\">Spark streaming exercises (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/05_SparkStreaming_Exercises_BigDataNB.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-482597\">\n<li class=\" eplus-wrapper\">Example data \u2013 One folder with (few) data for each exercise (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/ExampleSparkStreamingData.zip\" target=\"_blank\" rel=\"noreferrer noopener\">ExampleSparkStreamingData.zip<\/a>)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Solutions of Exercises 58-65 \u2013 Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/SparkNotebooksSol58_65.zip\" target=\"_blank\" rel=\"noreferrer noopener\">SparkNotebooksSol58_65.zip<\/a>)<\/li>\n<\/ul><\/li>\n<\/ul>\n\n\n<hr class=\" wp-block-separator has-css-opacity eplus-wrapper\"\/>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\" wp-block-spacer eplus-wrapper\"><\/div>\n\n\n\n<h3 class=\" wp-block-heading eplus-wrapper\" id=\"laboratory-material\">Laboratory Material<\/h3>\n\n\n\n<p class=\" eplus-wrapper\"><strong><mark style=\"background-color:rgba(0, 0, 0, 0);color:#fd0202\" class=\"has-inline-color\">No lab activities during the first week of the course<\/mark><\/strong><\/p>\n\n\n\n<p class=\" eplus-wrapper\">Team 1: Students from A to L &#8211; Friday from 14:30 to 16:00 &#8211; LAIB1<\/p>\n\n\n\n<p class=\" eplus-wrapper\">Team 2: Students from M to Z &#8211; Friday from 16:00 to 17:30 &#8211; LAIB1<\/p>\n\n\n\n<p class=\" eplus-wrapper\"><\/p>\n\n\n\n<figure class=\" wp-block-table eplus-wrapper\"><table class=\"has-fixed-layout\"><tbody><tr><td>Problem specification and input data<\/td><td>Solution (Maven-based for Java)<\/td><\/tr><tr><td><strong>Lab 1<\/strong>: Hadoop and Map Reduce<br>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/10\/Lab1_BigData_vscode.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<br>Basic project and small example dataset (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/04\/Lab1_BigData_with_libraries_vscode.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab1_BigData_with_libraries_vscode.zip<\/a>)<br>Basic project based on Maven &#8211; Use this version to run the MapReduce application locally on your own PC (<strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-accent-2-color\">DO NOT USE IT AT LAIB1<\/mark><\/strong>)<br>&#8212; Linux and macOS (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab1.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab1.zip<\/a>)<br>&#8212; Windows (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab1Windows.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab1_Windows.zip<\/a>)<br>Bigger dataset: finefoods_text.txt (<a href=\"https:\/\/www.dropbox.com\/s\/fswdiblx15mhmyo\/finefoods_text.zip?dl=0\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a>)<\/td><td>Solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab1_SolBonusMvn.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Bonus track Lab1_SolBonusMvn.zip<\/a><\/td><\/tr><tr><td><strong>Lab 2<\/strong>: Filter with Hadoop MapReduce and Frequency Count<br>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/10\/Lab2_BigData_vscode.pdf\">pdf<\/a>)<br>Skeletion project Hadoop &#8212; MapReduce (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/10\/Lab2_Skeleton_with_libraries_vscode.zip\">Lab2_Skeleton_with_libraries_vscode.zip<\/a>)<br>Basic project based on Maven &#8212; Use this version to run the MapReduce application locally on your own PC (<strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-accent-2-color\">DO NOT USE IT AT LAIB1<\/mark><\/strong>)<br>&#8212; Linux and macOS (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab2_Skeleton.zip\">Lab2_Skeleton.zip<\/a>)<br>&#8212; Windows (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab2Windows_Skeleton.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab2Windows_Skeleton.zip<\/a>)<br>Outputs of the first lab (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/OutputFolderLab1.zip\" target=\"_blank\" rel=\"noreferrer noopener\">OutputFolderLab1.zip<\/a>). You can use them to test your application locally on your own PC if you are using Maven<\/td><td>Solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/Lab2_SolTask1_processing_2024_2025.zip\">Task 1<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/10\/Lab2_SolTask2_2024_2025.zip\">Task 2<\/a><\/td><\/tr><tr><td><strong>Lab 3<\/strong>: Frequently bought\/reviewed together with Hadoop and MapReduce<br>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/10\/Lab3_BigData_updated.pdf\">pdf<\/a>)<br>Skeleton project Hadoop &#8212; MapReduce (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/10\/Lab3_Skeleton_with_libraries_vscode.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab3_Skeleton_with_libraries_vscode.zip<\/a>)<br>Sample data (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/AmazonTransposedDataset_Sample.txt\" target=\"_blank\" rel=\"noreferrer noopener\">AmazonTransposedDataset_Sample.txt<\/a>)<br>Basic project based on Maven &#8212; Use this version of the project to run the MapReduce application locally on your own PC (<mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-accent-2-color\"><strong><strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-accent-2-color\">DO NOT USE IT AT LAIB<\/mark><\/strong><\/strong><\/mark>)<br>&#8212; Linux and macOS (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab3_Skeleton.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab3_Skeleton.zip<\/a>)<br>&#8212; Windows (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab3Windows_Skeleton.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab3Windows_Skeleton.zip<\/a>)<\/td><td>Solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab3_Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab3_Sol.zip<\/a> &#8211; <strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-accent-2-color\">The second solution MUST NOT BE USED because it is highly inefficient<\/mark><\/strong>. The second solution has been uploaded to show an inefficient solution that someone implemented in the past.<br>Comments on the three uploaded solutions (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab3_DraftSolution_BigData_NewStyle.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">slides<\/a>)<\/td><\/tr><tr><td><strong>Lab 4:<\/strong> Normalized ratings for product recommendations with Hadoop MapReduce<br>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/10\/Lab4_2025_2026-1.pdf\">pdf<\/a>)<br>Skeleton project Hadoop &#8212; MapReduce (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/10\/Lab4_Skeleton_with_libraries_vscode.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab4_Skeleton_with_libraries_vscode.zip<\/a>)<br>Sample file (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/ReviewsSample.csv\" target=\"_blank\" rel=\"noreferrer noopener\">ReviewsSample.csv<\/a>)<br>Basic project based on Maven &#8212; Use this version of the project to run the MapReduce application locally on your own PC (<strong><mark style=\"background-color:rgba(0, 0, 0, 0)\" class=\"has-inline-color has-accent-2-color\">DO NOT USE IT AT LAIB1<\/mark><\/strong>)<br>&#8212; Linux and macOS (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab4_Skeleton.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab4_Skeleton.zip<\/a>)<br>&#8212; Windows (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/10\/Lab4Windows_Skeleton.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab4Windows_Skeleton.zip<\/a>)<\/td><td>Solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/11\/Lab4_Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab4_Sol.zip<\/a><\/td><\/tr><tr><td><strong>Lab 5<\/strong>: Filter data and compute basic statistics with Apache Spark<br>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/11\/Lab5_BigData_2025_2026.pdf\">pdf<\/a>)<br>Sample file (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/11\/SampleLocalFile.csv\" target=\"_blank\" rel=\"noreferrer noopener\">SampleLocalFile.csv<\/a>)<\/td><td>Solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/11\/Lab5_BigData_2025_2026_sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab5_Sol.zip<\/a><\/td><\/tr><tr><td><strong>Lab 6<\/strong>: Frequently bought\/reviewed together with Apache Spark<br>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/11\/Lab6_BigData_2025_2026.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<br>Sample file (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/11\/ReviewsSample.csv\" target=\"_blank\" rel=\"noreferrer noopener\">ReviewsSample.csv<\/a>)<br><br>Expected output \u2013 Task 1 (expected output if the input is Reviews.csv) (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/11\/outputTask1Lab6.zip\" target=\"_blank\" rel=\"noreferrer noopener\">outputTask1Lab6.zip<\/a>)<\/td><td>Solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/11\/Lab6_BD_Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab6_Sol.zip<\/a><\/td><\/tr><tr><td><strong>Lab 7<\/strong>: Bike sharing data analysis<br>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/Lab7_BigData_2024_2025.pdf\">pdf<\/a>)<br>Sample data (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/11\/sampleData.zip\" target=\"_blank\" rel=\"noreferrer noopener\">sampleData.zip<\/a>)<br>Example KML file (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/11\/exampleKML.zip\" target=\"_blank\" rel=\"noreferrer noopener\">exampleKML.zip<\/a>)<br><br><strong>Expected output<\/strong><br>&#8212; Execution on sample data (sampleData\/registerSample.csv and sampleData\/stations.csv) and minimum criticality threshold = 0.4 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/11\/resSampleData0.4-1.txt\" target=\"_blank\" rel=\"noreferrer noopener\">part-00000<\/a>)<br>&#8212; Execution on complete data (\/share\/students\/bigdata\/Dati\/Lab7\/datiCompleti\/register.csv and \/share\/students\/bigdata\/Dati\/Lab7\/datiCompleti\/\/stations.csv) and minimum criticality threshold = 0.6 (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/11\/resAllData0.6-1.txt\" target=\"_blank\" rel=\"noreferrer noopener\">part-00000<\/a>)<\/td><td>Guidance slides: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/11\/Lab-7-\u2013-Critical-Timeslot-Detection-.pptx\" data-type=\"attachment\" data-id=\"13168\">Lab7 &#8211; Critical Timeslot detection<\/a><br>Solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/Lab7_DBD_Sol.zip\">Lab7_Sol.zip<\/a><\/td><\/tr><tr><td><strong>Lab 8<\/strong>: Bike sharing data analysis based on Spark SQL<br>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/11\/Lab8_BigData_2025_2026.pdf\">pdf<\/a>)<br>Sample data (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/11\/sampleData.zip\" target=\"_blank\" rel=\"noreferrer noopener\">sampleData.zip<\/a>)<\/td><td>Solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/04\/Lab8_DBD_Sol.zip\">Lab8_Sol.zip<\/a><\/td><\/tr><tr><td><strong>Lab 9<\/strong>: A classification pipeline with MLlib + Spark SQL<br>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/12\/Lab9_BigData_2025_2026.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>)<br>Sample file with 100 reviews (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/12\/ReviewsSample.csv\" target=\"_blank\" rel=\"noreferrer noopener\">ReviewsSample.csv<\/a>)<br>Template (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/12\/Lab9_template25_26.zip\">Lab9_template.zip<\/a>)<br>Bigger file with all reviews (<a href=\"http:\/\/dbdmg.polito.it\/Reviews.zip\">Reviews.zip<\/a>)<\/td><td>Solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/12\/Lab9_BD_Sol.zip\" target=\"_blank\" rel=\"noreferrer noopener\">Lab9_Sol.zip<\/a><\/td><\/tr><tr><td><strong>Lab 10<\/strong>: Exam simulation<br>Select the Moodle &#8220;activity&#8221; <strong>Big data processing and analytics &#8211; Exam Simulation &#8211; December 16, 2025<\/strong>, and try to answer all questions. The content of this exam simulation is taken from an exam of the previous year.<\/td><td>This is the text of the February 21, 2025, exam. The solution is available on this web page of the course (Section: Previous exam examples).<br>&#8211; February 21, 2025: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/03\/BigData_Exam_20250221.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><br>&#8211; Question 1: (b)<br>&#8211; Question 2: (a)<br>&#8211; Source code of the solution: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/06\/DraftSolutionExams202502021.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\" wp-block-separator has-css-opacity eplus-wrapper\"\/>\n\n\n\n<h3 class=\" wp-block-heading eplus-wrapper\" id=\"laboratory-material\">Previous exam examples<\/h3>\n\n\n\n<p class=\" eplus-wrapper\">The Spark solutions of some past exams are still based on Java. However, except for the syntax, the solutions are based on the same Spark methods and workflows. The solution workflow is programming language-independent.<\/p>\n\n\n\n<figure class=\" wp-block-table eplus-wrapper\"><table><tbody><tr><td><strong>Exams<\/strong><\/td><td><strong>Solutions<\/strong><\/td><\/tr><tr><td>Spark Streaming \u2013 Examples of multiple choice questions (<a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/ExamplesMultipleChoiceQuestions_Python.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a>) <\/td><td>Question 1: (c)<br>Question 2: (d)<br>Question 3: (b)<\/td><\/tr><tr><td>February 10, 2026 <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2026\/02\/BigData_Exam_20260210.pdf\">pdf<\/a><\/td><td>Question 1: (a)<br>Question 2: (a)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2026\/02\/DraftSolution20260210.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><tr><td>January 27, 2026: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2026\/02\/BigData_Exam_20260127_v2.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (c) &#8211; Both Application B and Application C return the number of elements in each window<br>Question 2: (a) &#8211; There is one instance of the mapper class for each input block. There are two files, each smaller than the block\/chunk size, so there are two input blocks and two instances of the mapper class.<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2026\/02\/DraftSolutionExam20260127.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><tr><td>June 12, 2025: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/06\/BigData_Exam_20250612_v3NoSol.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (b)<br>Question 2: (c)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/06\/20250612_BigDataExam_DraftSolution.zip\">zip<\/a><\/td><\/tr><tr><td>February 21, 2025: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/03\/BigData_Exam_20250221.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (b)<br>Question 2: (a)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/06\/DraftSolutionExams202502021.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><br><strong>A solution based on Spark SQL is available for this exam<\/strong> &#8211; Uploaded on December 13, 2025<\/td><\/tr><tr><td>February 4, 2025: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/02\/BigData_Exam_20250204_v3.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (b)<br>Question 2: (a)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/02\/20250204_BigDataExam_DraftSolution.zip\">zip<\/a><br><strong>A solution based on Spark SQL is available for this exam<\/strong> &#8211; Uploaded on December 13, 2025<\/td><\/tr><tr><td>September 12, 2024: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/11\/BigData_Exam_20240912_v2.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: <strong>(c) &#8211; Each file is read 3 times<\/strong><br>Question 2: (a)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/11\/20240912.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong><\/td><\/tr><tr><td>June 20, 2024: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/06\/BigData_Exam_20240620.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (b)<br>Question 2: (a)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/11\/DraftSolPythonExam20240620.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong><\/td><\/tr><tr><td>February 19, 2024: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/BigData_Exam_2024-02-19.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (a)<br>Question 2: (b)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/11\/DraftSolPythonExam20240219.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong><br><strong>A solution based on Spark SQL is available for this exam<\/strong><br><strong>A second solution, based on a single job, has been uploaded (HadoopV2) &#8211; Updated on January 9, 2026<\/strong><\/td><\/tr><tr><td>February 5, 2024: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/02\/BigData_Exam_2024-02-05.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (a)<br>Question 2: (b)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/11\/DraftSolPythonExam20240205.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong> <br><strong>A solution based on Spark SQL is available for this exam<\/strong><\/td><\/tr><tr><td>September 21, 2023: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/01\/BigData_Exam_20230921.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (a)<br>Question 2: (b)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2025\/01\/DraftSolPythonExam20230921.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong><\/td><\/tr><tr><td>June 21, 2023: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/01\/BigData_Exam_20230621.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (c)<br>Question 2: (d)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/DraftSolPythonExam20230621.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong><\/td><\/tr><tr><td>February 15, 2023: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/06\/BDP_Exam20230215.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (c)<br>Question 2: (b)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/11\/DraftSolPythonExam20230215.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong> <br><strong>A solution based on Spark SQL is available for this exam<\/strong><\/td><\/tr><tr><td>February 2, 2023: <a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/06\/BDP_Exam20230202_finale.pdf\" target=\"_blank\">pdf<\/a><\/td><td>Question 1: (d)<br>Question 2: (d)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/DraftSolPythonExam20230202.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong> <br><strong>A solution based on Spark SQL is available for this exam<\/strong><\/td><\/tr><tr><td>September 6, 2022: <a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/09\/BD_Exam20220906.pdf\" target=\"_blank\">pdf<\/a><\/td><td>Question 1: (c)<br>Question 2: (d)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/DraftSolPythonExam20220906.zip\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong> <br><strong>A solution based on Spark SQL is available for this exam<\/strong><\/td><\/tr><tr><td>July 4, 2022: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/07\/BD_Exam20220704.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (c)<br>Question 2: (d)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/DraftSolPythonExam20220704.zip\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong> <\/td><\/tr><tr><td>February 21, 2022: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/02\/BD_Exam20220221_v1.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (d)<br>Question 2: (b)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/DraftSolPythonExam20220221.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong> <br><strong>A solution based on Spark SQL is available for this exam<\/strong><\/td><\/tr><tr><td>February 2, 2022: <a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2022\/02\/BD_Exam20220202_v1.pdf\" target=\"_blank\">pdf<\/a><\/td><td>Question 1: (b)<br>Question 2: (d)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/DraftSolPythonExam20220202.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong> <\/td><\/tr><tr><td>June 30, 2021: <a rel=\"noreferrer noopener\" href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2021\/12\/BD_Exam20210630.pdf\" target=\"_blank\">pdf<\/a><\/td><td>Question 1: (a)<br>Question 2: (c)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2021\/07\/DraftSolutionExam20210630.zip\">zip<\/a><\/td><\/tr><tr><td>February 5, 2021: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2023\/10\/BD_Exam20210205.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (b)<br>Question 2: (c)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2021\/02\/DraftSolutionExam20210205.zip\">zip<\/a><\/td><\/tr><tr><td>September 17, 2020: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/09\/BD_Exam20200917.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (d)<br>Question 2: (c)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/09\/DraftSolutionExam20120917.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><tr><td>July 16, 2020: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/07\/BD_Exam20200716.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (b)<br>Question 2: (b) \u2013 Note that there are two actions; hence, the input file is read twice.<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/07\/DraftSolutionExam20120716.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><tr><td>July 2, 2020: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/07\/BD_Exam20200702.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (b)<br>Question 2: (a)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/07\/DraftSolutionExam20120702.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><tr><td>July 18, 2019: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2019\/07\/Exam20190718_v1.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (b)<br>Question 2: (b)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2019\/07\/DraftSolutionExam20190718_v1.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><tr><td>July 2, 2019: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2019\/07\/Exam20190702_v1.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (a)<br>Question 2: (b)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2019\/07\/BozzaSoluzionev1_20190702.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><tr><td>February 15, 2019: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2019\/03\/Exam20190215_v1.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (d)<br>Question 2: (c)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2019\/06\/DraftSolutionv1_20190215.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><tr><td>September 3, 2018: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2018\/09\/Exam20180903_v1.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (d)<br>Question 2: (c)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-content\/uploads\/2024\/12\/DraftSolPythonExam20180903.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a> &#8211; <strong>Spark &#8211; Python-based solution<\/strong> <br><strong>A solution based on Spark SQL is available for this exam<\/strong><br>Example to show how to create and manage<strong> \u201cstatic windows\u201d with only Spark SQL APIs<\/strong> <\/td><\/tr><tr><td>July 16, 2018: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2018\/07\/Exam20180716_v1.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><br><\/td><td>Question 1: (d)<br>Question 2: (a)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2018\/07\/DraftSolutionv1_20180716.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><tr><td> June 26, 2018: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2018\/06\/Exam20180626_v1.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">pdf<\/a><\/td><td>Question 1: (c)<br>Question 2: (c)<br>Source code: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2018\/06\/DraftSolutionv1.zip\" target=\"_blank\" rel=\"noreferrer noopener\">zip<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\" wp-block-separator has-alpha-channel-opacity eplus-wrapper\"\/>\n\n\n\n<h3 class=\" wp-block-heading eplus-wrapper\" id=\"additional-material\">Additional material<\/h3>\n\n\n<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-bb009d\">\n<li class=\" eplus-wrapper\">Slides and screencasts about Java (kindly provided by Prof. Torchiano) (<a href=\"http:\/\/dbdmg.polito.it\/~paolo\/JavaMaterials\/02JEY%20-%20Object%20Oriented%20Programming.html\">link<\/a>)<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-8ac08f\">\n<li class=\" eplus-wrapper\">Suggested slides\/lectures for those students who have never used Java<ul class=\" wp-block-list eplus-wrapper eplus-styles-uid-c1ce2d\">\n<li class=\" eplus-wrapper\">OO Paradigm and UML (The UML part can be skipped)<\/li>\n\n\n\n<li class=\" eplus-wrapper\">The Java Environment<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Java Basic Features<\/li>\n\n\n\n<li class=\" eplus-wrapper\">Java Inheritance<\/li>\n<\/ul><\/li>\n<\/ul><\/li>\n<\/ul>\n\n\n<div class=\"wp-block-buttons eplus-wrapper is-layout-flex wp-block-buttons-is-layout-flex\"><\/div>\n","protected":false},"excerpt":{"rendered":"<p>General Information SSD: ING-INF\/05 CFU: 6 Professor: Paolo Garza Teaching Assistants: Lorenzo Vaiani and Etibar Vazirov Announcements The first lecture is scheduled for Monday, September 22, 2025, at 14:30 in Room R3 (+ streaming of the virtual classroom) Teaching Material INTRODUCTION HADOOP AND MAPREDUCE SPARK Exercises MAP REDUCE SPARK Laboratory &hellip;<\/p>\n","protected":false},"author":5,"featured_media":4585,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"editor_plus_copied_stylings":"{}","footnotes":""},"categories":[37],"tags":[],"class_list":["post-12554","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-courses"],"_links":{"self":[{"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/posts\/12554","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/comments?post=12554"}],"version-history":[{"count":56,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/posts\/12554\/revisions"}],"predecessor-version":[{"id":13573,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/posts\/12554\/revisions\/13573"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/media\/4585"}],"wp:attachment":[{"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/media?parent=12554"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/categories?post=12554"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dbdmg.polito.it\/dbdmg_web\/wp-json\/wp\/v2\/tags?post=12554"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}