Thanks to the rapid growth of social networks and online communities, large social data collections are becoming more and more common, prompting the need for scalable and innovative data analysis solutions. Several data analytics tools rely on data mining algorithms to gain interesting insights into large data volumes. Generalized itemset mining is a well-known exploratory data mining technique used to discover interesting high level data correlations. Since it allows eﬀectively coping with sparse datasets, its application to the user-generated content published on Twitter is an appealing research issue. However, since patterns discovered at diﬀerent abstraction levels may be in constrast in terms of correlation type (positive, negative, or null), their manual inspection may become particularly interesting when a large number of speciﬁc (descendant) itemsets show correlation type changes with respect to their common ancestor.
This work presents a novel data mining approach to eﬀectively supporting Twitter data analysis by means of generalized itemsets. A novel kind of patterns, namely the Strong Flipping Generalized Itemsets (SFGIs), is extracted from Twitter post content and contextual information supplied with taxonomy hierarchies. Each SFGI is composed of a frequent generalized itemset X and the set of its descendants showing a correlation type change with respect to X. Hence, SFGIs highlight contrasting situations in the analyzed data, usually associated with interesting information. An algorithm to mine SFGIs at the top of the traditional generalized itemsets is also proposed.
EVALUATED REAL TWITTER DATASETS AND TAXONOMIES
The collection of evaluated Twitter datasets and the corresponding taxonomies is available here.
SYNTHETIC DATA AND TAXONOMY GENERATOR
The synthetic data generator is available here (this is the Linux version of the standard IBM generator). Use the tax option to generate both data and taxonomy.