Advanced search
Start date

Polymer featurization strategies for machine learning tasks in the small-data regime

Grant number: 22/13536-5
Support Opportunities:Scholarships abroad - Research Internship - Doctorate
Effective date (Start): February 28, 2023
Effective date (End): February 27, 2024
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Principal Investigator:Marcos Gonçalves Quiles
Grantee:Gabriel Augusto Lins Leal Pinheiro
Supervisor: Cory Simon
Host Institution: Instituto de Ciência e Tecnologia (ICT). Universidade Federal de São Paulo (UNIFESP). Campus São José dos Campos. São José dos Campos , SP, Brazil
Research place: Oregon State University (OSU), United States  
Associated to the scholarship:21/08852-2 - Molecular property prediction with high accuracy: a semi-supervised learning approach, BP.DR


Machine learning (ML) methods to reduce time and cost for material discovery have exhibited remarkable achievements. Nevertheless, notably, significant advances have been made for small molecules. Therefore, fields like polymer informatics, which involves designing macromolecules, are still in the early stages of exploring such techniques to learn structure-property relationships. Among the challenges concerning ML for polymers are the small data regime and the lack of a natural representation in a machine-readable format. The featurization process, in particular, plays an essential role in an ML algorithm's success. A proper featurization strategy can impact the data and time necessary to train an ML algorithm. Here, researchers can employ domain knowledge and algorithms to learn from generic priors to accomplish such a task. A strategy to design featurization methods for polymers is to use information related to their constitutional repeat units (CRUs) and topology. In this context, this project aims to tackle these challenges by first developing a handcrafted descriptor based on graphlets to describe polymers by their composition in terms of CRUs and the precise connectivity of their CRUs. The main advantage of this approach is the process of building a fingerprint that (1) contains a representative set of substructures for the data set; and (2) enables to encode vertex- and edge-labels, rather than simply vertex as is commonly done in the literature. Secondly, we plan to extend our recent contrastive learning framework, SMICLR, to perform representation learning for polymers. Two critical challenges that our proposed framework will address are (1) adopting a neural network encoder to learn patterns from large molecules; and (2) reducing the faulty negative examples in contrastive tasks via our graphlet-based method. By doing this, we hope both proposed approaches will enable accurate ML tasks on polymers in the small data regime and thus accelerate the discovery of new polymers. (AU)

News published in Agência FAPESP Newsletter about the scholarship:
Articles published in other media outlets (0 total):
More itemsLess items

Please report errors in scientific publications list using this form.