Advanced search
Start date

Pattern extraction from textual document collections using heterogeneous networks

Grant number: 11/12823-6
Support Opportunities:Scholarships in Brazil - Doctorate
Effective date (Start): October 01, 2011
Effective date (End): September 30, 2015
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Principal Investigator:Solange Oliveira Rezende
Grantee:Rafael Geraldeli Rossi
Host Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil


Due to the large amount of textual document collections available today, there is a need to develop techniques for automatic knowledge extraction and organization of these collections. Normally, documents are represented in a vector space model, in which each document is represented by a vector, and each position of this vector corresponds to a feature of the document, for example, the frequency of a word. The methods for pattern extraction using this form of representation assume that the documents in a collection as well as their characteristics are independent. Entretanto, this can lead to erroneous results. Trying to avoid this error, there are representations that model the textual documents through networks. However, in this type of representation, the traditional algorithms consider that the network are compounded by objects of the same type, as well as their relations, i.e., networks are homogeneous. This limitation can be overcome. To do this, text can be represented by heterogeneous networks, i.e., documents can be represented considering different types of objects, as the document terms or authors. Different types of relationships among these objects can also be represented. However, the use of relationships between objects of same type in a heterogeneous network is unusual. Our hypothesis is that this kind of relationship can also help the pattern extract. To prove this hypothesis, in this PhD project is proposed a representation of textual document collections using heterogeneous networks, in which an study about what are the ways to relate objects of the same type in a heterogeneous network that can produce better results for classification tasks and clustering of textual documents will be carried out. Algorithms will be adapted or developed for the extraction using the proposed representation. (AU)

News published in Agência FAPESP Newsletter about the scholarship:
Articles published in other media outlets (0 total):
More itemsLess items

Scientific publications (5)
(References retrieved automatically from Web of Science and SciELO through information on FAPESP grants and their corresponding numbers as mentioned in the publications by the authors)
ROSSI, RAFAEL GERALDELI; LOPES, ALNEU DE ANDRADE; REZENDE, SOLANGE OLIVEIRA. Using bipartite heterogeneous networks to speed up inductive semi-supervised learning and improve automatic text categorization. KNOWLEDGE-BASED SYSTEMS, v. 132, p. 94-118, . (11/12823-6, 14/08996-0, 15/14228-9)
SOUZA, VINICIUS M. A.; ROSSI, RAFAEL G.; BATISTA, GUSTAVO E. A. P. A.; REZENDE, SOLANGE O.. Unsupervised active learning techniques for labeling training sets: An experimental evaluation on sequential data. Intelligent Data Analysis, v. 21, n. 5, p. 1061+, . (14/08996-0, 11/12823-6, 11/17698-5)
ROSSI, RAFAEL GERALDELI; LOPES, ALNEU DE ANDRADE; REZENDE, SOLANGE OLIVEIRA. Optimization and label propagation in bipartite heterogeneous networks to improve transductive classification of texts. INFORMATION PROCESSING & MANAGEMENT, v. 52, n. 2, p. 217-257, . (11/12823-6, 11/22749-8, 14/08996-0)
FALEIROS, THIAGO DE PAULO; ROSSI, RAFAEL GERALDELI; LOPES, ALNEU DE ANDRADE. Optimizing the class information divergence for transductive classification of texts using propagation in bipartite graphs. PATTERN RECOGNITION LETTERS, v. 87, n. SI, p. 127-138, . (11/12823-6, 11/22749-8, 15/14228-9)
ROSSI, RAFAEL GERALDELI; LOPES, ALNEU DE ANDRADE; FALEIROS, THIAGO DE PAULO; REZENDE, SOLANGE OLIVEIRA. Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, v. 29, n. 3, p. 361-375, . (11/12823-6, 11/23689-9, 11/19850-9)
Academic Publications
(References retrieved automatically from State of São Paulo Research Institutions)
ROSSI, Rafael Geraldeli. Text automatic classification through machine learning based on networks. 2015. Doctoral Thesis - Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB) São Carlos.

Please report errors in scientific publications list by writing to: