Advanced search
Start date
Betweenand

QUEST - a Zero-Shot Information retrieval and summarization system

Abstract

In phase 1 of the PIPE project, we developed a system for analyzing jurisprudence, based on deep learning models, which prove to be significantly superior to classic models, such as the BM25. In spite of its quality, this search system is limited, like competing systems, to showing the answer in the form of a list with "10 blue links". This requires the user to read and consolidate texts from different sources, commonly unrelated, or even contradictory. Therefore, a frequent market demand is for systems that consolidate and elaborate complex answers to different topics, in a concise way. In this project, we propose the development of a system for searching and consolidating information models using deep learning based on the zero-shot paradigm. More specifically, given a topic (ie, query/question) provided by the user, the proposed system performs three stages: 1) search and return of a large number of documents possibly relevant to the topic; 2) discovery of subtopics; 3) the mapping of representative documents for each subtopic. Finally, the system shows the user a document in a format similar to a Wikipedia article, with multiple sections and links to representative documents for each of them. We have solid experience in the development of the first and third stages: in addition to the positive results obtained in phase 1 of the PIPE project, our team won more than six international information retrieval competitions between 2020 and 2021 (Nogueira et al., 2020; Pradeep et al. 2021), all using the same system with small adaptations. The second stage is responsible for grouping and summarizing information. It will detect the most common subtopics covered by the returned documents in the first stage. The most representative documents of each subtopic will be selected and shown to the user. This stage will use Corpus2Question (Surita et al. 2020), which is a model developed in partnership with students we supervise at UNICAMP to detect topics and trends, with a higher quality than classic models such as the Latent Dirichlet Allocation (LDA). One of the main challenges in the development of this system is to have an objective evaluation methodology of the quality of the results. To this end, we will create, with the help of specialists, a validation dataset that contains examples of: 1) topics (queries/questions) of interest; 2) their respective subtopics; 3) the documents relevant to each subtopic. Another challenge is the lack of training data. Our strategy for this is the use of zero-shot models, that is, models trained on data from domains (or languages) different from those that are used at inference time. As they require no training in the specific domain, they can be readily used in new domains. Recent results from the scientific literature and our own experiments in phase 1 of the PIPE project demonstrate that these models perform better than models trained for a final task. (AU)

Articles published in Agência FAPESP Newsletter about the research grant:
Articles published in other media outlets (0 total):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)

Please report errors in scientific publications list by writing to: cdi@fapesp.br.