DataBase and Data Mining Group

Large-scale itemset mining

Itemset mining focuses on the extraction of useful knowledge from huge quantities of data. A wide range of different domains need to deal with the ever-growing amounts of gathered data (e.g., biological data, network traffic data, text mining, streams of sensor network data, spatio-temporal data). Traditional in-core mining algorithms do not scale well with large volumes of data and are hindered by critical issues such as main-memory exhaustion and long execution times. Scalable and alternative approaches have to be devised to efficiently perform large-scale data mining. In this research activity, innovative approaches exploiting disk-based data structures and memory-efficient algorithms to extract frequent itemsets are investigated.

Technical reports

TR-2-2012: Large scale itemset mining co-authored by Elena Baralis, Tania Cerquitelli, Silvia Chiusano, and Alberto Grand

Datasets

Real datasets

Wikipedia dataset (tar.gz archive)

Synthetic datasets

Script to generate synthetic datasets by means of the IBM Data Generator (tar.gz archive)

Publications

Elena Baralis, Tania Cerquitelli, Silvia Chiusano: A persistent HY-Tree to efficiently support itemset mining on large datasets. SAC 2010: 1060-1064

Master thesis

Alberto Grand. Master Thesis. Index support for itemset mining. Joint double-degree program between Politecnico di Torino and University of Illinois at Chicago. Master of Science in Electrical and Computer Engineering. November 2009 (pdf)