Textual information can contain timely information which can aid business decisions. Textual sources can represent information which is difficult to represent in other formats. This information can be difficult to interpret because: 1. natural language can be ambiguous or contradictory, 2. natural language can contain: euphemisms and invented language and 3. single concepts can be represented by many words or terms.The field of Natural language Processing provides a number of methodologies to address this problem. Topic detection and sentiment classification are the most relevant methodologies for extracting information from textual sources, but they have their flaws. Topic detection can identify latent topics in a document collection, but the methodology does not contain any mechanism for applying the information to a specific problem. Sentiment classification and in particular feature based sentiment analysis allows the targeting of emotions in text to a specific feature of a product or target entity. Feature based sentiment analysis does not determine the importance or rank of the feature. For example, it is not possible to determine if a performance of a car is less or more important than its fuel consumption.This project seeks to advance the state of the art in extracting and ranking information in text to make inferences about an external problem. The external problem is the prediction of yields of future crop harvests. Agricultural news contains information from which an informed forecast of harvest yields can be made, for example weather reports or pest numbers. Textual information contains topics which are groups of words which are statistically related. The information which is contained in these topics and its direct relationship with crop yields is currently unknown. The project will seek to model these relationships by constructing a structured model of the specific domains from topics contained in agricultural news. The topics will contain features and events / sentiment which can be assigned to these features and therefore the topic can be labelled as negative or positive. The structured model will allow the prediction of relationships between topics. For example a certain sequence of weather conditions may increase the pest population. The structured model and its inferred relations can be used to make inferences specific to crop yields.This proposed approach addresses some weaknesses in the application of structured methods to predict crop yields. The literature review conducted for this project concluded that crop prediction Bayesian Networks are constructed from previously known factors; for example: weather or pesticide spraying regimes. The proposed project seeks to learn Bayesian Networks directly from news text through the identification of topics and their interrelations. This unsupervised / semi-supervised approach will allow identification of latent factors which may improve the predictive capability of a Bayesian Network.
News published in Agência FAPESP Newsletter about the scholarship: