3.3.3. Summarizing
This block is the core of the BioSumm framework. It provides, separately for each cluster, an ad hoc summary, containing the sentences that are potentially more useful for inferring knowledge of gene/protein relationships. The summary is in the form of an extract where the sentences of the original document collection with highest scores are reported.
This block is a multi-document summarizer based on the Open Text Summarizer (OTS). Our summarizer scans each document and gives a score to each sentence based on a grading function. The sentence with the highest scores are selected to build a summary, containing a given percentage of the original text. This percentage, which is set by the user, is called summarization ratio.
The BioSumm summarizer is designed to look for the presence of some domain specific words. Therefore, we define a dictionary G which stores the domain word (i.e. gene and protein names). We built the dictionary by querying the Biogrid publicly available database, containing over 200000 interactions from six different species. The user can, however, choose his own database containing words in any specific subject.
The grading function estabilishes a score for each sentence, taking into account the presence of the domain specific words contained in the dictionary. See grading function parameters section for more info on how sentences score is calculated.