{"id":342,"date":"2010-12-28T14:19:21","date_gmt":"2010-12-28T13:19:21","guid":{"rendered":"http:\/\/dbdmg.polito.it\/wordpress\/?page_id=342"},"modified":"2011-01-10T13:52:57","modified_gmt":"2011-01-10T12:52:57","slug":"index-support-for-itemset-mining-in-a-rdbms","status":"publish","type":"page","link":"https:\/\/dbdmg.polito.it\/wordpress\/research\/index-support-for-itemset-mining-in-a-rdbms\/","title":{"rendered":"Index Support for Itemset Mining in a RDBMS"},"content":{"rendered":"<h2>Motivation<\/h2>\n<p style=\"text-align: justify;\">Research activity in association rule mining has been initially focused  on defining efficient algorithms to perform the computationally  intensive knowledge extraction task (i.e., frequent itemset mining). The  data to be analyzed was (possibly) extracted from the DBMS and stored  into binary files (i.e., flat files). Many algorithms, both memory-based  and disk-based, are focused on specialized data structures and buffer  management strategies to efficiently extract the desired type of  knowledge from a flat dataset. However, no attempt to achieve complete  integration into a true relational DBMS of these external structures has  been proposed so far. DBMSs exploit indices to improve the performance  on complex queries. The intuition that the same approach could be  &#8220;exported&#8221; to the data mining domain is the driving force of this  project. A true integration of novel data mining indices into the PostgreSQL open source DBMS is proposed and advantages and disadvantages of the  proposed disk-based data structures are highlighted. For each index,  DBMS kernel primitives have been designed and implemented to materialize  the index structure on disk and to perform many frequent itemset  extraction sessions.<\/p>\n<h2>I-Mine index<\/h2>\n<p style=\"text-align: justify;\">The I-Mine index is a general and compact structure which  provides tight integration of itemset extraction in a relational DBMS.  Since no constraint is enforced during the index creation phase, I-Mine provides a complete representation of the original database. To  reduce the I\/O cost, data accessed together during the same extraction  phase are clustered on the same disk block. The I-Mine index structure  can be efficiently exploited by different itemset extraction algorithms.  In particular,  I-Mine data access methods currently support the  FP-growth and LCM v.2 algorithms, but they can straightforwardly support  the enforcement of various constraint categories. The I-Mine index has  been integrated into the PostgreSQL DBMS and exploits its physical level access methods.<\/p>\n<p style=\"text-align: justify;\">Experiments, run for both sparse and dense data distributions, show the  efficiency of the proposed index and its linear scalability also for  large datasets. Itemset mining supported by the I-Mine index shows  performance always comparable with, and often (especially for low  supports) better than, state of the art algorithms accessing data on  flat file.<\/p>\n<h2>I-Forest index<\/h2>\n<p style=\"text-align: justify;\">A technique for the incremental update of the index, suitable for  association rule extraction on evolving databases, has been proposed and  implemented. The content of evolving databases (e.g., transactional  data from large retail chains, web server logs, call-detail records) is  periodically updated through insertion (or deletion) of data blocks.  Data can be described as a sequence of incoming data blocks, where new  blocks arrive periodically. Different kinds of analysis could be  performed over such data like (i) mining all available data, (ii) mining  only the most recent data (e.g., last month data), (iii) mining  periodical data (e.g., quarterly data) and (iv) mining selected data  blocks (e.g., data related to the first month of last year and the first  month of this year). Itemset mining on sequences of incoming data  blocks may require a significant amount of main memory during the  extraction process. By means of the proposed index, called  I-Forest, incoming data blocks are stored on disk in appropriate  (compact) structures. During the mining phase, only the data required by  the current mining process is actually loaded in main memory. To allow  different kinds of analysis and easy incremental insertion of new blocks  (or deletion of obsolete ones), each data block is represented  separately and independently of all others. The proposed structure is a  covering index that represents transactional blocks in a succinct form.  During the creation phase no support constraint is enforced (i.e., the  index provides a complete representation of the evolving data). Hence,  item, support and time constraints may be enforced during the extraction  phase. Experiments have been run for both sparse and dense data  distributions. Experimental results show that frequent itemset mining  based on the proposed index is efficient both in terms of extraction  time and memory usage. The performance of the proposed algorithm  exploiting the index structure is always comparable with and for low  support threshold faster than the best implementation algorithm  accessing static data on flat file.<\/p>\n<h2>Publications<\/h2>\n<div class=\"teachpress_pub_list\"><form name=\"tppublistform\" method=\"get\"><a name=\"tppubs\" id=\"tppubs\"><\/a><\/form><div class=\"teachpress_message_error\"><p>Sorry, no publications matched your criteria.<\/p><\/div><\/div>\n<br class=\"fixfloat\" \/>","protected":false},"excerpt":{"rendered":"<p>Motivation Research activity in association rule mining has been initially focused on defining efficient algorithms to perform the computationally intensive knowledge extraction task (i.e., frequent itemset mining). The data to be analyzed was (possibly) extracted from the DBMS and stored into binary files (i.e., flat files). Many algorithms, both memory-based and disk-based, are focused on<a href=\"https:\/\/dbdmg.polito.it\/wordpress\/research\/index-support-for-itemset-mining-in-a-rdbms\/\">[&#8230;]<\/a><\/p>\n","protected":false},"author":2,"featured_media":425,"parent":98,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"class_list":["post-342","page","type-page","status-publish","has-post-thumbnail","hentry"],"_links":{"self":[{"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages\/342","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/comments?post=342"}],"version-history":[{"count":12,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages\/342\/revisions"}],"predecessor-version":[{"id":1058,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages\/342\/revisions\/1058"}],"up":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/pages\/98"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/media\/425"}],"wp:attachment":[{"href":"https:\/\/dbdmg.polito.it\/wordpress\/wp-json\/wp\/v2\/media?parent=342"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}