Document summarization


This page has hierarchy - Parent page: Research

GraphSum: discovering correlations among multiple terms for graph-based summarization

A general-purpose, graph-based summarizer that exploits association rules to consider correlations among multiple terms during the summarization process.

Download the original news collections here and the summary examples here.

 

Multilingual document summarization

News collections (crawled from Sept. 2011 to May 2012): download them here.

 

 

Working Capital Grant Project “Web Summaries”

Brief description (English version):
“Web summaries” is a research project focused on the development of a platform for the automatic generation of succinct and easily manageable summaries from large collections of Web documents. Their generation, based on data mining techniques, is driven by the analysis of data retrieved from social networks and online communities. More specifically, the goal is to select topics and contexts (of usage or publication) of particular user’s interest and suit the Web document summaries to the real user needs and expectations.

Breve descrizione (in Italiano):
“Riassunti Web” è un progetto di ricerca finalizzato a costruire una piattaforma per la generazione automatica di riassunti brevi, e quindi facilmente accessibili, di grandi collezioni di documenti Web. La loro generazione, basata su tecniche di analisi dei dati, è orientata all’integrazione dei contenuti pubblicati sui social network. L’obiettivo è selezionare tematiche e contesti d’uso o di pubblicazione dei contenuti di particolare interesse per gli utenti e quindi adattare i contenuti dei riassunti alle reali esigenze e aspettative degli stessi.

Related material:

Short video presentation in .mp4 format (in Italian): Video link for download

News document summarization

A summary is a succinct and informative description of a data collection. In the context of multi-document summarization, the selection of the most relevant and not redundant sentences belonging to a collection of textual documents is definitely a challenging task. Frequent itemset mining is a well-established data mining technique to discover correlations among data.

Recently, a significant research effort has been devoted to the development of a novel multi-document summarizer,  namely PatTexSum (Pattern-based Text Summarizer), that is based on a pattern-based model, i.e., a model composed of frequent itemsets, extracted from the document collection. It automatically selects the most representative and not redundant sentences to include in the summary by considering both sentence coverage, with respect to a concise and highly informative itemset-based model, and a sentence relevance score, based on tf-idf statistics.

The effectiveness of the proposed summariers has been validated on collections on real-life collections of on-topic news documents.

Biological document summarization

The BioSumm project tackles the problem of managing and exploiting the huge mass of information contained in increasingly wider text repositories such as PubMed Central. BioSumm is a flexible and modular framework which analyzes large collections of unclassified biomedical texts and produces ad hoc summaries oriented to biological information. The summary generation is driven by a novel grading function, which biases sentence selection by means of an appropriate domain dictionary. In the current version of BioSumm, in order to focus on a biological target, the dictionary contains genes and proteins names and aliases. BioSumm is neither a traditional summarizer nor a extractor of dictionary terms. It is designed to be a summarizer oriented to the biological domain. Thus, its summaries have both the expressive power of the traditional summaries and the domain specificity of documents produced by a dictionary entry extractor. The final goal of the project is to become a powerful automatic instrument to support for both knowledge inference from scientific papers and biological validation of gene/proteins interactions obtained in different ways (e.g., with other data mining techniques).

More information may be found in the BioSumm specific page.

News summarization driven by User-Generated Content

The outstanding growth of the Internet has made available to analysts a huge and increasing amount of Web documents (e.g., news articles) and user-generated content (e.g., social network posts) coming from social networks and online communities that are worth considering together. On one hand, the need of novel and more effective approaches to summarize Web document collections makes the application of data mining techniques established in different research contexts more and more appealing. On the other hand, to generate appealing summaries the data mining and knowledge discovery process cannot disregard the major Web user’s interests.

An interesting research issue is the study of novel news document summarization systems that focus on generating succinct, not redundant, yet appealing summaries by means of a data mining and knowledge discovery process driven by messages posted on social networks. The usage of sentence relevance evaluators that take into account term significance in a collection of social network posts ranging over the same news topics may be exploited to improve the performance of the existing summarization system. This approach allows not disregarding sentences whose terms rarely occur in the news collection but are deemed relevant by Web users.

Publications