Advanced search
Start date

Product2Vec: semantic representation for e-commerce products using machine learning


Consumers have been conducting longer and longer exploratory research considering an increasing number of different sources and types of information. E-commerce websites, youtube channels, offers aggregators, news, discussion forums, manufacturers and social networking sites are examples of the various sources of information that consumers can use to decide their purchase. These sources present different types of information, such as product descriptions, specifications, consumer and expert reviews, demonstration videos, product images, questions and answers, and other types of data. This large amount of information sprayed in various places has made the search and purchase decision journey increasingly longer and generated increasingly insecure consumers. Recent advances in multivision (multimodal) data representations in the area of Machine Learning and Deep Learning have supported new applications to facilitate the personalization and exploitation of this information. Consider these advances in the product domain (and its related content) e-commerce is a scientific and technical challenge and also the main objective of this research project. Technical and scientific research efforts have already been employed by Birdie to collect and structure data from this domain in order to develop technologies that enable applications to help consumers, such as semantic search and aggregation and customization of different types of information. The company has created a database with more than 50 million records between offers, reviews, images, questions and answers from 420 different sources, and more than 5 million prices are daily monitored and stored. These data are being used to create applications, and implement and evaluate traditional machine learning tasks such as classification of these records into categories, sentiment analysis assessments, structuring product descriptions, among others. These efforts have resulted in some products, such as the automatic matching (de-duplication) module of different offers of a same product, and the consolidation and aggregation of different product information. The demonstration of both as a consumer end product for Smartphones can be found at, traditional machine learning methods that rely heavily on human validation, such as sample labeling, dictionary creation, and reference lists (brands, categories and models), were used. Such limitations reduce the solution scalability and make it difficult to expand into several product categories. On the other hand, recent advances in the machine learning field that use concepts of Deep Learning, Similarities Regularization and Heterogeneous Networks Models require a large volume of data to properly work but also need a few labeled examples (semi-supervised learning), allowing for less human labeling efforts in the learning process and greater generality and scalability. Following these assumptions, this research project has as main objective to adapt and incorporate recent methods of machine learning that deal with heterogeneous data to structure the large amount of information contained in the e-commerce domain. The final result of the structuring of this information is named in this proposal as Product2Vec, in which a new representation on e-commerce products is obtained, integrating product specs, reviews, comments, and several other metadata. This new data representation can directly correlate different types of model information and provide greater flexibility and scalability to create applications related to the enterprise domain. (AU)

Articles published in Agência FAPESP Newsletter about the research grant:
Articles published in other media outlets (0 total):
More itemsLess items

Please report errors in scientific publications list using this form.