Large-scale itemset mining


This page has hierarchy - Parent page: Research

Itemset mining focuses on the extraction of useful knowledge from huge quantities of data. A wide range of different domains need to deal with the ever-growing amounts of gathered data  (e.g., biological data, network traffic data, text mining, streams of sensor network data, spatio-temporal data). Traditional in-core mining algorithms do not scale well with large volumes of data and are hindered by critical issues such as main-memory exhaustion and long execution times. Scalable and alternative approaches have to be devised to efficiently perform large-scale data mining. In this research activity, innovative approaches exploiting disk-based data structures and memory-efficient algorithms to extract frequent itemsets are investigated.

Technical reports

  • TR-2-2012: Large scale itemset mining co-authored by Elena Baralis, Tania Cerquitelli, Silvia Chiusano, and Alberto Grand

Datasets

Real datasets

  • Wikipedia dataset (tar.gz archive)

Synthetic datasets

  • Script to generate synthetic datasets by means of the IBM Data Generator (tar.gz archive)

 

Publications

Elena  Baralis, Tania Cerquitelli, Silvia Chiusano: A persistent HY-Tree to efficiently support itemset mining on large datasets. SAC 2010: 1060-1064

Master thesis

Alberto Grand. Master Thesis. Index support for itemset mining. Joint double-degree program between Politecnico di Torino and University of Illinois at Chicago. Master of Science in Electrical and Computer Engineering. November 2009 (pdf)