{"id":15371,"date":"2020-02-24T20:05:38","date_gmt":"2020-02-24T19:05:38","guid":{"rendered":"https:\/\/dbdmg.polito.it\/wordpress\/?page_id=15371"},"modified":"2021-02-20T19:17:33","modified_gmt":"2021-02-20T18:17:33","slug":"distributed-architectures-for-big-data-processing-and-analytics-2019-2020","status":"publish","type":"page","link":"https:\/\/dbdmg.polito.it\/wordpress\/teaching\/distributed-architectures-for-big-data-processing-and-analytics-2019-2020\/","title":{"rendered":"Distributed architectures for big data processing and analytics (2019\/2020)"},"content":{"rendered":"<h3 id=\"tinyTOC\">Table of content<\/h3>\n<ul>\n<li><a href=\"#General-information-1\"><\/strong>General information<\/a><\/li>\n<li><a href=\"#Exam-rules-1\">Exam rules<\/a><\/li>\n<li><a href=\"#Announcements-1\">Announcements<\/a><\/li>\n<li><a href=\"#Slides-1\">Slides<\/a><\/li>\n<li><a href=\"#Exercises-1\">Exercises<\/a><\/li>\n<li><a href=\"#Practices-1\">Practices<\/a><\/li>\n<li><a href=\"#Exam-Examples-1\">Exam Examples<\/a><\/li>\n<li><a href=\"#Additional-material-1\">Additional material<\/a><\/li>\n<\/ul>\n<h3><span style=\"color: #ff0000;\">Pay attention that this page is the web page for\u00a0 to the academic year 2019\/2020<\/span><\/h3>\n<h3><\/h3>\n<h3><strong><span id=\"General-information-1\"><\/strong>General information<\/span><\/h3>\n<ul>\n<li>ECTS: 8<\/li>\n<li>Professor: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/people\/paolo-garza\/\">Paolo Garza<\/a><\/li>\n<li>Teaching assistant: Martino Trevisan<\/li>\n<\/ul>\n<h3><span id=\"Exam-rules-1\">Exam rules<\/span><\/h3>\n<ul>\n<li><strong><strong>Exam rules Academic Year 2019-2020 &#8211; O<\/strong><\/strong><strong>NLINE EXAMINATION SESSION<\/strong> <strong>(<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExamRulesDistributedBigData2019_20online.pdf\">pdf<\/a>)<\/strong><\/li>\n<li>Exam rules Academic Year 2019-2020 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/02\/ExamRulesDistributedBigData2019_20.pdf\">pdf<\/a>)<\/li>\n<\/ul>\n<h3><span id=\"Announcements-1\">Announcements<\/span><\/h3>\n<ul>\n<li>(24\/02\/2020)\n<ul>\n<li><strong>No lab activities during the first week<\/strong><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3><span id=\"Slides-1\">Slides<\/span><\/h3>\n<ul>\n<li>Video lectures:\n<ul>\n<li>Teaching portal \/ Material\/ Virtual classroom<\/li>\n<li>or<\/li>\n<li>Teaching portal \/ Material\/ Link relativi al corso -&gt; Links to the dropbox copy of the video lectures<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul>\n<li>Introduction to the course content and exam rules (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/02\/00_Intro_DistributedBigData_1920_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/02\/00_Intro_DistributedBigData_1920_6x.pdf\">6 slides per page<\/a>) <!-- (<a href=\"http:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2017\/02\/00_Intro_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"http:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2016\/02\/00_Intro_BigData_6x.pdf\">6 slides per page<\/a>)-->\n<ul>\n<li>Question session &#8211; March 10, 2020 <\/li>\n<\/ul>\n<\/li>\n<li>Introduction to Big Data (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/01_Intro_BigData_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/01_Intro_BigData_BigData_6x.pdf\">6 slides per page<\/a>) <\/li>\n<li>Big Data Architectures (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/02_Architectures_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/02_Architectures_BigData_6x.pdf\">6 slides per page<\/a>) <\/li>\n<li>Hadoop and MapReduce\n<ul>\n<li>Introduction to Apache Hadoop and the MapReduce programming paradigm (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/03_Intro_HadoopAndMapReduce_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/03_Intro_HadoopAndMapReduce_BigData_6x.pdf\">6 slides per page<\/a>) \n<ul>\n<li>Interaction with HDFS and Hadoop by means of the command line (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/03b_HDFS_Hadoop_CommandLine_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/03b_HDFS_Hadoop_CommandLine_BigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Hadoop implementation of MapReduce (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/04_HadoopImplementationOfMapReduce_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/04_HadoopImplementationOfMapReduce_6x.pdf\">6 slides per page<\/a>) \n<ul>\n<li>Source code of the Word Count Ecplise project (<a href=\"http:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2016\/03\/WordCount.zip\">WordCount.zip<\/a>) &#8211; Use the import maven project option to import it in Eclipse<\/li>\n<li>PDF version of the code (i.e., PDF version of the java files) (<a href=\"http:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2016\/03\/WordCountPDF.zip\">WordCountPDF.zip<\/a>)<\/li>\n<li>BigData@Polito environment + Jupyter &#8211; How to submit MapReduce jobs on BigData@Polito (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/04b_ClusterJupyter_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/04b_ClusterJupyter_BigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>MapReduce &#8211; Design patterns &#8211; Part 1 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/05_MapReduce_Patterns_Part1_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/05_MapReduce_Patterns_Part1_BigData_6x.pdf\">6 slides per page<\/a>) <\/li>\n<li>MapReduce and Hadoop &#8211; Advanced Topics: Multiple inputs, Multiple outputs, Distributed cache (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/06_AdvancedTopicsMapReduce_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/06_AdvancedTopicsMapReduce_BigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li><strong>Updated on April 20, 2020<\/strong> with some more details on the Distributed cache topic <\/li>\n<\/ul>\n<\/li>\n<li>MapReduce &#8211; Design patterns &#8211; Part 2 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/07_MapReduce_Patterns_Part2_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/07_MapReduce_Patterns_Part2_BigData_6x.pdf\">6 slides per page<\/a>) <\/li>\n<li>MapReduce &#8211; Relational Algebra\/SQL operators (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/08_SQLOperators_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/08_SQLOperators_BigData_6x.pdf\">6 slides per page<\/a>) <\/li>\n<\/ul>\n<\/li>\n<li>Spark\n<ul>\n<li>Introduction to Apache Spark (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/10_SparkIntroduction_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/10_SparkIntroduction_DistributedBigData_6x.pdf\">6 slides per page<\/a>) \n<ul>\n<li>How to submit Spark applications (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/10b_SparkSubmit_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/10b_SparkSubmit_DistributedBigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<li>How to use Jupyter notebooks for your Spark applications (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/10c_JupyterNotebooks_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/10c_JupyterNotebooks_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>A useful online tutorial for those who want to install and run Spark locally on their PCs (tested for Linux)\n<ul>\n<li>&#8220;How to use PySpark on your computer&#8221; by <span class=\"aq b ar as at au r dt q\"><span class=\"aq cl fv as br fw fx fy fz ga dt\">Favio V\u00e1zquez (<a href=\"https:\/\/towardsdatascience.com\/how-to-use-pyspark-on-your-computer-9c7180075617\">link<\/a>)<\/span><\/span><\/li>\n<\/ul>\n<\/li>\n<li>Some comments and hints\n<ul>\n<li>Download the following pre-built version of Spark from spark.apache.org: <span id=\"spanDownloadLink\"><a href=\"https:\/\/www.apache.org\/dyn\/closer.lua\/spark\/spark-2.4.5\/spark-2.4.5-bin-hadoop2.7.tgz\">spark-2.4.5-bin-hadoop2.7.tgz<\/a><\/span><\/li>\n<li>Pay attention to install and use Python 3<\/li>\n<li>Install Java 8 (Spark 2.4.5 runs on Java 8)\n<ul>\n<li><span id=\"684a\" class=\"hy hz dt ar jl b fe jm jn r jo\" data-selectable-paragraph=\"\">e.g., sudo apt-get install openjdk-8-jdk<\/span><\/li>\n<\/ul>\n<\/li>\n<li>and then set the JAVA_HOME variable in your environment to the folder\u00a0 containing Java 8\n<ul>\n<li>e.g., export JAVA_HOME=&#8221;\/usr\/lib\/jvm\/java-8-openjdk-amd64\/&#8221;<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>RDD-based programs\n<ul>\n<li>RDDs: creation, basic transformations and actions (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/11_SparkRDDBasedProgramming_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/11_SparkRDDBasedProgramming_DistributedBigData_6x.pdf\">6 slides per page<\/a>) <\/li>\n<li>Key-value RDDs: transformations and actions on key-value RDDs (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/12_SparkPairRDD_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/12_SparkPairRDD_DistributedBigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<li>DoubleRDDs (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/13_SparkDoubleRDD_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/13_SparkDoubleRDD_DistributedBigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<li>Advanced Topics: Cache, accumulators, broadcast variables (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/14_SparkRDDBasedProgramming_AdvancedTopics_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/14_SparkRDDBasedProgramming_AdvancedTopics_DistributedBigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<li>Advanced Topics &#8211; Part II: Custom partitioners, broadcast join (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/15_SparkRDDBasedProgramming_AdvancedTopicsII_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/15_SparkRDDBasedProgramming_AdvancedTopicsII_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>RDD partition examples (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/RDDPartitionsExamples.zip\">RDDPartitionsExamples.zip<\/a>)<\/li>\n<li>PageRank example (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/RDDPageRank.zip\">RDDPageRank.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Spark SQL and DataFrames\n<ul>\n<li>Spark SQL (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/16_SparkSQL_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/16_SparkSQL_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Simple examples &#8211; Jupyter notebook (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/SparkSQLSimpleExamples.zip\">SparkSQLSimpleExamples.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Spark SQL &#8211; Part II (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/17_SparkSQL_PartII_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/17_SparkSQL_PartII_DistributedBigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Data mining and Machine learning algorithms with Spark\n<ul>\n<li>MLlib\n<ul>\n<li>Introduction and Preprocessing (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18a_SparkMLlib_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18a_SparkMLlib_DistributedBigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<li>Classification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18b_SparkMLlib_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18b_SparkMLlib_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Classification examples &#8211; Jupyter notebooks and sample data (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExampleClassificationMLlib.zip\">ExampleClassificationMLlib.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Clustering (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18c_SparkMLlib_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18c_SparkMLlib_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Clustering example &#8211; Jupyter notebook and sample data (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExampleClusteringMLlib.zip\">ExampleClusteringMLlib.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Regression (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18d_SparkMLlib_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18d_SparkMLlib_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Regression example &#8211; Jupyter notebook and sample data (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExampleRegressionMLlib.zip\">ExampleRegressionMLlib.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Itemset and Association rule mining (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18e_SparkMLlib_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/18e_SparkMLlib_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Itemset and Association rule mining example &#8211; Jupyter notebook and sample data\u00a0 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExampleItemsetMLlib.zip\">ExampleItemsetMLlib.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>GraphX\/GraphFrames\n<ul>\n<li>Introduction to GraphX and GraphFrames (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/19_SparkGraphFrame_PartI_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/19_SparkGraphFrame_PartI_DistributedBigData_6x.pdf\">6 slides per page<\/a>) <strong>Updated on May 16, 2020<\/strong><\/li>\n<li>Graph Algorithms with GraphFrames (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/20_SparkGraphFrame_Algorithms_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/20_SparkGraphFrame_Algorithms_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Simple example &#8211; Jupyter notebook (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/GraphFrameExamples.zip\">GraphFrameExamples.zip<\/a>)\n<ul>\n<li>Select kernel GraphFrames (Yarn) to run it on jupyter.polito.it<\/li>\n<li>Run &#8220;pyspark &#8211; -packages graphframes:graphframes:0.8.0-spark2.4-s_2.11&#8221; to run it locally on your PC<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Streaming data analytics\n<ul>\n<li>Spark Streaming\n<ul>\n<li>Spark Streaming (DStreams) (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/21_SparkStreaming_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/21_SparkStreaming_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Simple examples &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/SparkSteamingExamples.zip\">SparkSteamingExamples.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Structured Streaming (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/22_SparkStructuredStreaming_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/22_SparkStructuredStreaming_DistributedBigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Simple examples &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExampleStructutedStreaming.zip\">SparkStructutedStreamingExamples.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Introduction to other big stream processing frameworks: Apache Storm, Apache Flink, .. (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/23_StreamingFrameworks_DistributedBigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/23_StreamingFrameworks_DistributedBigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Relational and Non-relational databases for Big data\n<ul>\n<li>Introduction to relational and non-relational databases for Big data: Hive, HBase <strong>NOT COVERED THIS YEAR<\/strong><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3><span id=\"Exercises-1\">Exercises<\/span><\/h3>\n<ul>\n<li>MapReduce\n<ul>\n<li>MapReduce exercises (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/01_MapReduce_Exercises_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/01_MapReduce_Exercises_BigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Solutions of Exercises 1-12 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Solutions1_12.zip\">Solutions1_12.zip<\/a>)<\/li>\n<li>Solutions of Exercises 13-22 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Solutions13_22.zip\">Solutions13_22.zip<\/a>)<\/li>\n<li>Solutions of Exercises 23-29 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Solutions23_29.zip\">Solutions23_29.zip<\/a>)\n<ul>\n<li>Solution of Exercise 23 &#8211; Two Jobs &#8211; Version 2: Updated version (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/SolExercise23TwoJobsV2Cluster.zip\">SolExercise23TwoJobsV2Cluster.zip<\/a>). The former version does not find the cached file when it is executed on the cluster.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Basic project\n<ul>\n<li>Linux and macOS\n<ul>\n<li>Basic Eclipse project for MapReduce applications (based on maven) (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/MapReduceBasicProject.zip\">MapReduceBasicProject.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Windows\n<ul>\n<li>Setup instructions (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/ConfigureWindowsEnviroment.pdf\">ConfigureWindowsEnviroment.pdf<\/a>)<\/li>\n<li>Winutils executable (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/winutils.zip\">winutils.zip<\/a>)<\/li>\n<li>Basic Eclipse project for MapReduce applications (based on maven) (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/MapReduceBasicProjectWindows.zip\">MapReduceBasicProjectWindows.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Spark\n<ul>\n<li>Spark RDD-, DataFrame-based exercises (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/02_Spark_Exercises_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/02_Spark_Exercises_BigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Example data &#8211; One folder with (few) data for each exercise (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/ExSparkData.zip\">ExSparkData.zip<\/a>)<\/li>\n<li>Solutions of Exercises 30-36 &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/SparkNotebooksSol30_36.zip\">SparkNotebooksSol30_36.zip<\/a>)<\/li>\n<li>Solutions of Exercises 37-42 &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/SparkNotebooksSol37_42.zip\">SparkNotebooksSol37_42.zip<\/a>)<\/li>\n<li>Solutions of Exercises 43-46 &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/SparkNotebooksSol43_46.zip\">SparkNotebooksSol43_46.zip<\/a>)<\/li>\n<li>Spark SQL-based Solutions\n<ul>\n<li>Exercises 37-38 &#8211; Spark SQL-based solutions &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/SparkNotebooksSol37_38DataframeSQL.zip\">SparkNotebooksSol37_38DataframeSQL.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Spark SQL exercises (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/02_Spark_ExerciseSparkSQL_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/02_Spark_ExerciseSparkSQL_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Example data &#8211; One folder with (few) data for each exercise (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExSparkSQLData.zip\">ExSparkSQLData.zip<\/a>)<\/li>\n<li>Solutions of Exercises 47-48 &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/SparkNotebooksSol47_48.zip\">SparkNotebooksSol47_48.zip<\/a>)<\/li>\n<li>Solutions of Exercises 49-50 &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/SparkNotebooksSol49_50.zip\">SparkNotebooksSol49_50.zip<\/a>)\u00a0\u2013 The problem specifications of these two exercises are in\u00a0Spark RDD-, DataFrame-based exercises (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/02_Spark_Exercises_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/02_Spark_Exercises_BigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Spark MLlib exercises (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/03_MLlib_Exercises_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/03_MLlib_Exercises_BigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Example data &#8211; One folder with (few) data for each exercise (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExampleMLlibData.zip\">ExampleMLlibData.zip<\/a>)<\/li>\n<li>Solutions of Exercise 51 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/SparkNotebooksSol51.zip\">SparkNotebooksSol51.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>GraphFrame exercises (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/04_GraphFrame_Exercises_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/04_GraphFrame_Exercises_BigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Example data &#8211; One folder with (few) data for each exercise (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExampleGraphFrameData.zip\">ExampleGraphFrameData.zip<\/a>)<\/li>\n<li>Solutions of Exercises 52-57 &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/SparkNotebooksSol52_57.zip\">SparkNotebooksSol52_57.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Spark streaming exercises (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/05_SparkStreaming_Exercises_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/05_SparkStreaming_Exercises_BigData_6x.pdf\">6 slides per page<\/a>)\n<ul>\n<li>Example data &#8211; One folder with (few) data for each exercise (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/ExampleSparkStreamingData-1.zip\">ExampleSparkStreamingData.zip<\/a>)<\/li>\n<li>Solutions of Exercises 58-65 &#8211; Jupyter notebooks (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/SparkNotebooksSol58_65.zip\">SparkNotebooksSol58_65.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3><span id=\"Practices-1\">Practices<\/span><\/h3>\n<ul>\n<li>Lab1: Hadoop and MapReduce (<strong>Wednesday, March 18 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab1_BigData.pdf\">pdf<\/a>)<\/li>\n<li>How to import and run locally on your PC a MapReduce program by using Eclipse + maven (<a href=\"https:\/\/www.dropbox.com\/s\/niilmhv6k1130dt\/01_ImportProject_LocalRun.mp4?dl=0\">01_ImportProject_LocalRun.mp4<\/a>)<\/li>\n<li>How to create a jar file and execute your application on the remote cluster BigData@Polito (<a href=\"https:\/\/www.dropbox.com\/s\/65xy3hu9qvqp2oc\/02_Jar_ClusterExecution.mp4?dl=0\">02_Jar_ClusterExecution.mp4<\/a>)<\/li>\n<li>Basic project and small example data set\n<ul>\n<li>Linux and macOS (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab1.zip\">Lab1.zip<\/a>)<\/li>\n<li>Windows (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab1Windows.zip\">Lab1Windows.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Solution\n<ul>\n<li>Bonus track: <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab1_SolBonus_1920.zip\">Lab1_SolBonus_1920.zip<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab2: Filter with Hadoop MapReduce\u00a0 (<strong>Friday, March 20 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab2_1920.pdf\">pdf<\/a>)<\/li>\n<li>Skeleton Eclipse project Hadoop \u2013 MapReduce\n<ul>\n<li>Linux and macOS (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab2_Skeleton1920.zip\">Lab2_Skeleton1920.zip<\/a>)<\/li>\n<li>Windows (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab2Windows_Skeleton1920.zip\">Lab2Windows_Skeleton1920.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab2_Sol1920.zip\">Lab2_Sol1920.zip<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab3: Frequently bought\/reviewed together application with Hadoop MapReduce\u00a0 (<strong>Friday, March 27 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab3_1920.pdf\">pdf<\/a>)<\/li>\n<li>Skeleton Eclipse project Hadoop \u2013 MapReduce\n<ul>\n<li>Linux and macOS (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab3_Skeleton1920.zip\">Lab3_Skeleton1920.zip<\/a>)<\/li>\n<li>Windows (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab3Windows_Skeleton1920.zip\">Lab3Windows_Skeleton1920.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Sample file (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/AmazonTransposedDataset_Sample.txt\">AmazonTransposedDataset_Sample.txt<\/a>)<\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/Lab3_Sol1920.zip\">Lab3_Sol1920.zip<\/a>\u00a0\u2013 Three alternative solutions are provided (the solutions are characterized by a different efficiency)<\/li>\n<li>Comments on the three uploaded solutions (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2018\/04\/Lab3_DraftSolution_BigData_2x.pdf\">2 slides per page<\/a>, <a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2018\/04\/Lab3_DraftSolution_BigData_6x.pdf\">6 slides per page<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab4: Normalized ratings for product recommendations with Hadoop MapReduce \u00a0 (<strong>Friday, April 3 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/Lab4_1920.pdf\">pdf<\/a>)<\/li>\n<li>Sample dataset (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2016\/04\/ReviewsSample.csv\">ReviewsSample.csv<\/a>)<\/li>\n<li>Skeleton Eclipse project Hadoop \u2013 MapReduce\n<ul>\n<li>Linux and macOS (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/Lab4_Skeleton1920.zip\">Lab4_Skeleton1920.zip<\/a>)<\/li>\n<li>Windows (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/Lab4Windows_Skeleton1920.zip\">Lab4Windows_Skeleton1920.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/Lab4_Sol1920.zip\">Lab4_Sol1920.zip<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab5: Filter data and compute basic statistics with Apache Spark (<strong>Friday, April 17 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/Lab5_1920.pdf\">pdf<\/a>)<\/li>\n<li>Sample file (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2018\/04\/SampleLocalFile.csv\">SampleLocalFile.csv<\/a>)<\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/Lab5DBD_Sol1920.zip\">Lab5_Sol1920.zip<\/a> &#8211; Jupyter notebook (Lab5_Sol.ipynb) and Python script (Lab5_Sol.py)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab6: Frequently bought\/reviewed together application with Apache Spark (<strong>Friday, April 24 &#8211; 13:00-14:30)<\/strong>\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/Lab6_1920.pdf\">pdf<\/a>)<\/li>\n<li>Sample dataset (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2018\/05\/ReviewsSample.csv\">ReviewsSample.csv<\/a>)<\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/Lab6_1920Sol.zip\">Lab6_Sol1920.zip<\/a> &#8211; Jupyter notebook (Lab6_1920Sol.ipynb) and Python script (Lab6_1920Sol.py)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab7: Bike sharing data analysis (<strong>Wednesday, April 29 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/04\/Lab7_1920.pdf\">pdf<\/a>)<\/li>\n<li>Sample data (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2017\/05\/sampleData.zip\">zip<\/a>)<\/li>\n<li>Example KML file (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2017\/05\/example.zip\">zip<\/a>)<\/li>\n<li>Another KML visualizer that can be used to visualize on a map the result of your analysis: <a href=\"http:\/\/kmlviewer.nsspot.net\/\">http:\/\/kmlviewer.nsspot.net<\/a><\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/Lab7_1920Sol.zip\">Lab7_Sol1920.zip<\/a> &#8211; Jupyter notebook (Lab7_1920Sol.ipynb) and Python script (Lab7_1920Sol.py)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab8: Bike sharing data analysis based on Spark SQL (<strong>Friday, May 8 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/Lab8_1920.pdf\">pdf<\/a>)<\/li>\n<li>Sample data (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2017\/05\/sampleData.zip\">zip<\/a>)<\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/Lab8_Sol1920.zip\">Lab8_Sol1920.zip<\/a> &#8211; Jupyter notebook and Python script<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab9: A classification pipeline with MLlib + SparkSQL (<strong>Friday, May 15 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/Lab9_1920.pdf\">pdf<\/a>)<\/li>\n<li>Template (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/lab9_1920template.zip\">zip<\/a>)<\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/Lab9_Sol1920.zip\">Lab9_Sol1920.zip<\/a> &#8211; Jupyter notebooks<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab10: GraphFrame (<strong>Friday, May 22 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/Lab10_1920.pdf\">pdf<\/a>)<\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/Lab10_Sol1920.zip\">Lab10_Sol1920.zip<\/a> &#8211; Jupyter notebook<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Lab11: Tweet analysis \u2013 Spark streaming (<strong>Friday, May 29 &#8211; 13:00-14:30<\/strong>)\n<ul>\n<li>Problem specification (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/Lab11_1920.pdf\">pdf<\/a>)<\/li>\n<li>Example files \u2013 tweets (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2017\/05\/exampledata_tweets.zip\">exampledata_tweets.zip<\/a>)<\/li>\n<li>Solution\n<ul>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/05\/Lab11_Sol1920.zip\">Lab11_Sol1920.zip<\/a> &#8211; Jupyter notebook<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3><span id=\"Exam-Examples-1\">Exam Examples<\/span><\/h3>\n<ul>\n<li>Exam Example #1 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/DistrBD_ExamExample1.pdf\">pdf<\/a>)\n<ul>\n<li>Solution\n<ul>\n<li>Question 1: (d)<\/li>\n<li>Question 2: (c)<\/li>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/SolutionExamExample1.zip\">SolutionExamExample1.zip<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Exam Example #2 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/03\/DistrBD_ExamExample2.pdf\">pdf<\/a>)\n<ul>\n<li>Solution\n<ul>\n<li>Question 1: (d)<\/li>\n<li>Question 2: (c)<\/li>\n<li>\u00a0<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/SolutionExamExample2.zip\">SolutionExamExample2.zip<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Exam Example #3 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/DistrBD_ExamExample3.pdf\">pdf<\/a>)\n<ul>\n<li>Solution\n<ul>\n<li>Question 1: (c)<\/li>\n<li>Question 2: (c)<\/li>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/SolutionExamExample3.zip\">SolutionExamExample3.zip<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Exam Example #4 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/DistrBD_ExamExample4.pdf\">pdf<\/a>)\n<ul>\n<li>Solution\n<ul>\n<li>Question 1: (d)<\/li>\n<li>Question 2: (c)<\/li>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/SolutionExamExample4.zip\">SolutionExamExample4.zip<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Exam Example #5 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/DistrBD_ExamExample5.pdf\">pdf<\/a>)\n<ul>\n<li>Solution\n<ul>\n<li>Question 1: (b)<\/li>\n<li>Question 2: (b)<\/li>\n<li><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/SolutionExamExample5.zip\">SolutionExamExample5.zip<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Exam June 27, 2020 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/DBD_Exam20200627.pdf\">pdf<\/a>)\n<ul>\n<li>Solution\n<ul>\n<li>Question 1: (b)<\/li>\n<li>Question 2: (a)<\/li>\n<li>Part II: MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/06\/DBD_Exam20200627Sol.zip\">DBD_Exam20200627Sol.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Exam July 20, 2020 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/07\/DBD_Exam20200720.pdf\">pdf<\/a>)\n<ul>\n<li>Solution\n<ul>\n<li>Question 1: (d)<\/li>\n<li>Question 2: (b) \u2013 Note that there are three actions and hence the input file is read three times.<\/li>\n<li>Part II: MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/07\/DBD_Exam20200720Sol.zip\">DBD_Exam20200720<\/a><a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/07\/DBD_Exam20200720Sol.zip\">Sol.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Exam September 14, 2020 (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/09\/DBD_Exam20200914.pdf\">pdf<\/a>)\n<ul>\n<li>Solution\n<ul>\n<li>Question 1: (d)<\/li>\n<li>Question 2: (c)<\/li>\n<li>Part II: MapReduce and Spark (<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/wp-content\/uploads\/2020\/09\/DBD_Exam20200914Sol.zip\">DBD_Exam20200914Sol.zip<\/a>)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3><span id=\"Additional-material-1\">Additional material<\/span><\/h3>\n<ul>\n<li>Slides and screencasts about Java (kindly provided by prof. Torchiano) (<a href=\"http:\/\/dbdmg.polito.it\/~paolo\/JavaMaterials\/02JEY%20-%20Object%20Oriented%20Programming.html\">link<\/a>)\n<ul>\n<li>Suggested slides\/lectures for those students who have never used Java\n<ul>\n<li>OO Paradigm and UML (The UML part is not mandatory)<\/li>\n<li>The Java Environment<\/li>\n<li>Java Basic Features<\/li>\n<li>Java Inheritance<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<br class=\"fixfloat\" \/>","protected":false},"excerpt":{"rendered":"<p>Table of content General information Exam rules Announcements Slides Exercises Practices Exam Examples Additional material Pay attention that this page is the web page for\u00a0 to the academic year 2019\/2020 General information ECTS: 8 Professor: Teaching assistant: Martino Trevisan Exam rules Exam rules Academic Year 2019-2020 &#8211; ONLINE EXAMINATION SESSION (pdf) Exam rules Academic Year<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/teaching\/distributed-architectures-for-big-data-processing-and-analytics-2019-2020\/\">[&#8230;]<\/a><\/p>\n","protected":false},"author":3,"featured_media":0,"parent":96,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-15371","page","type-page","status-publish","hentry"],"_links":{"self":[{"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages\/15371","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/comments?post=15371"}],"version-history":[{"count":100,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages\/15371\/revisions"}],"predecessor-version":[{"id":17722,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages\/15371\/revisions\/17722"}],"up":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages\/96"}],"wp:attachment":[{"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/media?parent=15371"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}