Research activity in association rule mining has been initially focused on defining efficient algorithms to perform the computationally intensive knowledge extraction task (i.e., frequent itemset mining). The data to be analyzed was (possibly) extracted from the DBMS and stored into binary files (i.e., flat files). Many algorithms, both memory-based and disk-based, are focused on specialized data structures and buffer management strategies to efficiently extract the desired type of knowledge from a flat dataset. However, no attempt to achieve complete integration into a true relational DBMS of these external structures has been proposed so far. DBMSs exploit indices to improve the performance on complex queries. The intuition that the same approach could be “exported” to the data mining domain is the driving force of this project. A true integration of novel data mining indices into the PostgreSQL open source DBMS is proposed and advantages and disadvantages of the proposed disk-based data structures are highlighted. For each index, DBMS kernel primitives have been designed and implemented to materialize the index structure on disk and to perform many frequent itemset extraction sessions.
The I-Mine index is a general and compact structure which provides tight integration of itemset extraction in a relational DBMS. Since no constraint is enforced during the index creation phase, I-Mine provides a complete representation of the original database. To reduce the I/O cost, data accessed together during the same extraction phase are clustered on the same disk block. The I-Mine index structure can be efficiently exploited by different itemset extraction algorithms. In particular, I-Mine data access methods currently support the FP-growth and LCM v.2 algorithms, but they can straightforwardly support the enforcement of various constraint categories. The I-Mine index has been integrated into the PostgreSQL DBMS and exploits its physical level access methods.
Experiments, run for both sparse and dense data distributions, show the efficiency of the proposed index and its linear scalability also for large datasets. Itemset mining supported by the I-Mine index shows performance always comparable with, and often (especially for low supports) better than, state of the art algorithms accessing data on flat file.
A technique for the incremental update of the index, suitable for association rule extraction on evolving databases, has been proposed and implemented. The content of evolving databases (e.g., transactional data from large retail chains, web server logs, call-detail records) is periodically updated through insertion (or deletion) of data blocks. Data can be described as a sequence of incoming data blocks, where new blocks arrive periodically. Different kinds of analysis could be performed over such data like (i) mining all available data, (ii) mining only the most recent data (e.g., last month data), (iii) mining periodical data (e.g., quarterly data) and (iv) mining selected data blocks (e.g., data related to the first month of last year and the first month of this year). Itemset mining on sequences of incoming data blocks may require a significant amount of main memory during the extraction process. By means of the proposed index, called I-Forest, incoming data blocks are stored on disk in appropriate (compact) structures. During the mining phase, only the data required by the current mining process is actually loaded in main memory. To allow different kinds of analysis and easy incremental insertion of new blocks (or deletion of obsolete ones), each data block is represented separately and independently of all others. The proposed structure is a covering index that represents transactional blocks in a succinct form. During the creation phase no support constraint is enforced (i.e., the index provides a complete representation of the evolving data). Hence, item, support and time constraints may be enforced during the extraction phase. Experiments have been run for both sparse and dense data distributions. Experimental results show that frequent itemset mining based on the proposed index is efficient both in terms of extraction time and memory usage. The performance of the proposed algorithm exploiting the index structure is always comparable with and for low support threshold faster than the best implementation algorithm accessing static data on flat file.