Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Der Steppenwolf

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler et al. “Retrieval-augmented generation for knowledge-intensive nlp tasks.” Advances in Neural Information Processing Systems 33 (2020): 9459-9474.

What problem does this paper address?

Combining parametric and non-parametric memory for knowledge-intensive tasks.

What is the motivation of this paper?

  • Pre-trained LMs
    • have limited ability to access and precisely manipulate knowledge
    • cannot update the knowledge stored in their memory
    • cannot provide insight into their predictions hallucination
  • Hybrid models with a differentiable access mechanism
    • have been only investigated for open-domain extractive QA

Research Questions:

  • How can non-parametric memory be integrated into pre-trained language models that possess parametric memory?
  • Which approach is better: conditioning all tokens in the generated sequence on the same retrieved passage (“per-output basis”) or conditioning each token on different retrieved passages (“per-token basis”)?

What are knowledge-intensive tasks?

Tasks that human could not reasonably be expected to perform without access to an external knowledge base (KB).

What are the main contributions of this paper?

  • This paper proposes Retrieval-Augmented Generation (RAG), a new general-purpose fine-tuning approach.
  • This new approach combines the model’s parametric knowledge obtained during pre-training with non-parametric knowledge stored in an external KB.
  • This paper proposes an end-to-end training paradigm for RAG.
  • This paper evaluates RAG on generative knowledge-intensive tasks.

How does RAG work?

Given a natural language query , a retriever encodes using a query encoder into a query embedding , and then retrieve the indices of the top-k most similar documents using a frozen dense passage retriever. Then, concatenate the query to each document as the input to the generator. The generator produces the output conditioned on the query , the document (or documents ), and all the previously generated tokens .

RAG-Sequence model vs. RAG-Token model

The generation process can rely on two different types of models: RAG-Token and RAG-Sequence.

RAG-Token:

  • a standard auto-regressive sequence-to-sequence generator
  • generates each target token in the output conditioned on a different latent document
  • generator iteratively produces a distribution for the next output token for each document
  • decoding using a single beam search
  • advantage: incorporate the information from multiple documents to compose the answer

RAG-Sequence:

  • generates the complete output sequence conditioned on a single document
  • generator produces the output sequence probability for each document via beam search and yield a score for each hypothesis
  • some of the hypotheses may not appeared in the beams of all documents
  • Thorough Decoding:
    • run an additional forward pass for each document for which the hypothesis does not appear in the beam,
    • compute , and
    • sum up the probabilities across beams
    • suitable for shorter output
  • Fast Decoding:
    • assume if was not generated during beam search
    • avoid the need to run additional forward passes once the candidate set of hypotheses has been generated
    • suitable for longer output

How is the retriever implemented?

  • a Dense Passage Retriever (DPR) with a bi-encoder architecture
  • a query encoder and a frozen document encoder , both implemented with BERT
  • takes the query as input and produces a query representation
  • takes a document as input and produces a dense document representation
  • use Maximum Inner Product Search (MIPS) to retrieve the top-k document indices mostly similar to the query
  • document index == non-parametric memory
  • DPR is pre-trained to retrieve documents on TriviaQA and Natural Questions datasets

How is the generator implemented?

  • encoder-decoder architecture
  • implemented with BART
  • parameters of BART == parametric memory

How are the retriever and generator trained?

  • Jointly train the retriever and generator in an unsupervised manner.
  • training data: (, ), : query, : ground-truth response
  • training objective: minimizing the negative marginal log-likelihood of each target:
  • stochastic gradient descent with Adam optimizer
  • Freeze the document encoder, only fine-tune the query encoder and the generator.

On which tasks and datasets is RAG evaluated?

Task 1: Open-domain Question Answering

  • Input: a question
  • Output: a generated text span that should exactly match a text span in the document
  • Task type: open-domain generative question answering
  • Datasets:
    • Natural Questions (NQ)
    • TriviaQA (TQA)
    • WebQuestions (WQ)
    • CuratedTrec (CT)
  • Evaluation metric: exact match
  • Baselines:
    • an extractive QA paradigm using REALM and DPR
      • text spans extracted from retrieved documents as answer
      • rely primarily on non-parametric knowledge
    • a Closed-Book QA approach using T5
      • no retrieval
      • rely purely on parametric knowledge

Task 2: Abstractive Question Answering

  • Input: a question
  • Output: a generated text that should meaningfully answer the question
  • Task type: free-form, open-domain abstractive text generation
  • Dataset: MSMARCO NLG Task v2.1
  • Evaluation metrics: Rouge-L, BLEU-1
  • Baseline: BART

Task 3: Jeopardy Question Generation

  • Input: a factual statement about an entity as the answer
  • Output: a non-trivial question to that answer
  • Task type: open-domain question generation
  • Dataset: SearchQA
  • Evaluation metrics:
    • Q-BLEU-1
    • human evaluation:
      • factuality: whether a statement can be corroborated by trusted external sources
      • specificity: mutual dependence between the input and output
  • Baseline: BART

Task 4: Fact Verification

  • Input: a natural language claim + a Wikipedia dump as KB
  • Output: a label from {supports, refutes, unverifiable}
  • Task type: multi-class classification (2-way & 3-way)
  • Dataset: FEVER
  • Evaluation metric: label accuracy
  • Baseline: BART

What are the results and conclusions?

Task 1: Open-domain Question Answering

  • Result: RAG achieves SOTA on all four datasets.
  • Conclusion:
    • RAG combines the generation flexibility of the “closed-book” (parametric only) approaches and the performance of “open-book” retrieval-based approaches.
    • Unlike REALM and T5+SSM, RAG enjoys strong results without expensive, specialized “salient span masking” pre-training.
    • Neither a re-ranker nor extractive reader is necessary for SOTA performance.
    • Advantages of generative QA over extractive QA:
      • Documents with clues about the answer but do not contain the answer verbatim can still contribute towards a correct answer being generated, leading to more effective marginalization over documents.
      • RAG can generate correct answers even when they are not in any retrieved document.

Task 2: Abstractive Question Answering

  • Result: Palm (SOTA) > RAG > BART
  • Conclusions:
    • Models adopting RAG hallucinate less and generate factually correct text more often than BART.
    • RAG generations are more diverse.

Task 3: Jeopardy Question Generation

  • Result: RAG-Token > RAG-Sequence > BART
  • Conclusions:
    • RAG generations are more factual most of the time.
    • RAG generations are more specific by a large margin.
    • The non-parametric memory helps to guide the generation, drawing out specific knowledge stored in the parametric memory.

Task 4:

  • Result: SemGraph (FEVER3) / EWC (FEVER2) > RAG > BART
  • Conclusion: RAG is competent in classification tasks.

What are the main advantages and limitations of this paper?

Advantages:

  1. RAG combines the flexibility of parametric memory and the performance of non-parametric memory.
  2. No pre-training necessary to obtain ability to retrieve non-parametric knowledge.

Limitations:

  1. DPR was trained on Natural Questions and TriviaQA (potential train-test-overlap).
  2. Chunking is crucial since it could cause information loss and inconsistency. This paper only applies one chunking method. More approaches to do the chunking should be investigated.
  • Title: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  • Author: Der Steppenwolf
  • Created at : 2025-01-06 21:35:06
  • Updated at : 2025-06-22 20:46:50
  • Link: https://st143575.github.io/steppenwolf.github.io/2025/01/06/RAG-for-Knowledge-Intensive-NLP-Tasks/
  • License: This work is licensed under CC BY-NC-SA 4.0.
Comments