Data Streams are data sources of undefined size, potentially infinite, that can generate examples with statistical distribution that changes over time. Such characteristics pose additional challenges to the creation and use of knowledge extraction algorithms, what prevents the direct use of traditional machine learning algorithms. In order to extract useful knowledge in dynamic environments, machine learning methods must be adapted to consider new data in a continuous manner. Since the last decade, more and more methods that apply a learning process in data streams have appeared. Since a large number of streams lack the label information, clustering algorithms are of great interest in this context. Among these, it is possible to identify two approaches: based on the on-line/off-line framework and based on data chunks. Recent work by the research group have shown that the flexibility provided by fuzzy clustering can bring benefits to knowledge extraction from data streams, although it has been scarcely explored in the literature. Following the line of research of the group, this work deals with the knowledge extraction from data streams by means of fuzzy clustering algorithms based on the data chunks approach. The objective is to study, implement and evaluate selected algorithms to generate behavior analysis, aiming at supporting comparisons among algorithms developed by other members of the research group in which this work is inserted. The candidate to this scholarship is already developing scientific initiation work, in which the implementation of basic algorithms of the adopted approach is being made. The work proposed here continuous the work that has been developed, focusing on methods that specifically deal with the issues of response to changes on the data distribution that can occur along the stream and of the dynamic definition of clusters number.The most used strategy to tackle the problem of changes in the data distribution is to lower the influence of the oldest data on the current clustering, what allows the clustering to adapt to the most recent changes. For this reason, two of the algorithms that have been selected to be studied in this work posses a decaying factor that defines the forgetting rate of old data. The other issue to be considered is the number of clusters in the clustering, since the extended algorithms are partition algorithms that require this number to be previously defined. The third algorithm selected to this work presents a simple proposal to dynamically define the number of clusters. The implementations will be done in the R language, and the experiments will use datasets from public repositories as well as from the research group repository, including labeled, unlabeled, stationary and non stationary data. The comparative analysis will include the algorithms studied in this work, the ones from the same approach implemented in the previous stage and the ones developed by other group members.It is expected, as a result, besides the implementation of the algorithms and a set of executions of experiments, meaningful and useful quantitative and qualitative analysis for this and for other works of the group. The execution of this work will allow the training of the student in a recent and relevant research field such as machine learning for data streams, enlarging his professional acting and evolution opportunities.
News published in Agência FAPESP Newsletter about the scholarship: