Learning to retrieve Reasoning Paths over Wikipedia graph for Question Answering
keywords: reasoning, question answering, knowledge graph, path retrieval, BERT
This post will walk through an interesting paper by published in ICLR 2020 by Akari Asai, Kazuma Hashimoto, Hannaneh Hajishirzi, Richard Socher, Caiming Xiong: [arXiv]
This paper presents a new recurrent retrieval approach that learns to retrieve reasoning paths over Wikipedia graph to answer multi-hop open-domain questions. Authors present how interplay between a retriever and reader model enabled them outperform the state-of-the-art by 14 points on HotpotQA. The setting consists of two models — a recurrent retriever model that retrieves each evidence document conditioned on previously retrieved sequence of evidence documents retrieved to generate several reasoning paths, followed by a reading comprehension model to rank the retrieved reasoning paths and finding the answer span from best reasoning path.

Background
Open-Domain Question Answering task is the task of answering a question given a large collection of text documents or Knowledge graph. In recent years we have seen Neural Open-Domain Question Answering systems [ such as Chen et al. (2017)] — a typical setup of having a two step approach to tackle this — firstly, leveraging non-parameterized models based on TF-IDF, BM25, Okapi to retrieve a set of documents and then leverage a recurrent model to extract the answer span from this reduced set of retrieved documents. Since it is not possible to apply recurrent model on all the documents — the candidate set of documents is smaller when non-parameterized models are used as first step in such two step approaches for Question-Answering. The performance of such settings is bounded by the performance of the retriever since there is no interaction or interplay between the two models.
Even though such settings are widely used, they are mostly suitable for cases when question can be answered just using single paragraph. Such settings fail to answer questions that require multiple hops or looking at multiple paragraphs to answer the question. Looking at multiple paragraphs is for finding the connection between entities (usually called bridge entities) in context and finding the most relevant paragraph that contains the answer, this is the characteristic of multi-hop questions. In the set of paragraphs that need to be looked at in order to answer the question, some of those paragraphs might have little or no lexical overlap or semantic relationship to the original question. Thus is the reason setting that leverage non-parameterized models as first step followed by a recurrent reader model usually fail at multi-hop Question Answering.
Constructing the Wikipedia paragraph graph
The method presented in this paper learns to retrieve reasoning paths (or you can say graph walks) across a graph structure. Since the original question might have little or no lexical overlap with paragraphs in reasoning paths or the answer paragraph, in order for model to find such reasoning paths — a graph structure is needed. This graph is created using Wikipedia paragraphs. Each node in this Wikipedia graph is a single paragraph pi.
Internal Hyperlinks on Wikipedia are used to construct edges/relationships between articles in this Wikipedia graph. Also, links within same document i.e. one paragraph linking to another paragraph in same article is also leveraged in creating this Wikipedia paragraph graph. The resulting Wikipedia graph is densely connected and covers a wide range of topics that provide useful evidence for open-domain questions.
Framework

Graph based Recurrent Retriever
The retriever is a recurrent neural network that scores each reasoning path in this Wikipedia paragraph graph by maximizing the likelihood of selecting correct evidence paragraph at each timestep and at the same time fine-tuning the paragraph BERT encodings leveraged.

Each paragraph pi is encoded along with question q using BERT and [CLS] token representation is taken as its embedding. A RNN is used to retrieve each node in reasoning path ( i.e. paragraph at any given timestep). At t-th timestep, model selects a paragraph pi among candidate paragraphs Ci given the current hidden state ht of the RNN. Given the hidden state ht, probability P(pi|ht) is computed that indicates that paragraph pi is selected at this timestep. The conditioning on the paragraph selection history allows RNN to capture relationships between paragraphs in reasoning paths. The termination of reasoning path is indicated by [EOE] end-of-evidence symbol. This allows model to explore reasoning paths of arbitrary lengths.


Once a paragraph is selected by RNN at current timestep, the candidate set of paragraphs for next timestep i.e. Ct+1 will include all the paragraphs that have an edge from this selected paragraph node in Wikipedia paragraph graph. And also, in order to add flexibility for model to retrieve multiple paragraphs within candidate set at current timestep Ct, K-best paragraphs from Ct are added to Ct+1 based on probability. Thus in Fig.2 you can see that after paragraph B is selected, the candidate paragraphs for next timestep includes [C, D] even though direct link between B and D doesn’t exist.
Since the number of paragraphs in the Wikipedia paragraphs graph is in order of millions, it is computationally not possible to pass through BERT and get [CLS] token embedding as representation for question along with answer paragraph. Thus beam search is used to explore limited number of most probable reasoning paths at any given timestep. The Initial candidate set (t=0) C1 is initialized with paragraphs having highest TF-IDF scores with respect to input question. For t>1, Ct is expanded by picking paragraph nodes connected to currently input paragraph at this timestep. For each paragraph that made it to beam, a RNN is instantiated such that that paragraph is passed as input to BERT and then to RNN ( i.e. Here we are finding top B likely sequences or reasoning paths at any timestep with beam size B). The score of a reasoning path E=[pi,..pk] is the multiplication of probabilities of selecting those paragraphs as RNN unrolls. Finally, we obtain top B reasoning paths with highest score that are passed onto reader model.
Multi-task Reader model
The role of this reader model is take top B reasoning paths and output the answer span from the most likely reasoning path. The reader model is a muti-task learning of 2 tasks:
- Reading comprehension- that is responsible for extracting answer span from most likely reasoning path using BERT — where probability that a token is start or end of span is computed and based on that answer span is picked.
- Reasoning path reranking- ranking the reasoning paths by using the probability that the path includes the answer.For the task of reading comprehension, input to BERT is question text, separator and concatenated text from all paragraphs from this reasoning path. For both the tasks the token representation of [CLS] is used as encoding for question-answer pair.
Probability of reasoning path E given the question q is given by:

Best reasoning path Ebest and output answer span Sread is given by:

Where Pistart, Pjend denote the probability that i-th and j-th tokens in Ebest are start and end positions of the answer span as shown below:

Model Training
Recurrent retriever model is trained in a supervised manner using the evidence paragraphs annotated for each question. Single-hop QA has a single paragraph and for multi-hop QA we have multiple paragraphs for each question.
Data augmentation is used to generate more training samples for recurrent retriever model. As part of that, first the ground truth reasoning path g=[p1,…pg] is derived using annotation. pg is set to end-of-evidence (EOE) token for indicating termination of reasoning path. In order to stabilize the training process, authors have augmented training data with additional reasoning path that can still derive the answer but not necessarily the shortest paths. i.e. additional paragraphs node were added to ground truth reasoning paths for a question. For example, g=[pr,p1,…pg] is a new training path by adding paragraph pr such that it has high TF-IDF score and is linked to paragraph p1 in the ground truth path g. This addition of new training paths enables model to handle cases during inference time when the first paragraph in reasoning path does not necessarily appear in paragraph set used during initialization using TF-IDF scores with respect to question.
In order to make the recurrent retriever model differentiate between relevant and irrelevant paragraphs, Negative paragraph examples are used during training time along with the ground truth paragraphs. Authors have used two types of negative examples:
- (1) using TF-IDF: adding paragraphs with least TF-IDF scores
- (2) Hyperlink based ones. TF-IDF based examples are useful for single-hop QA, for multi-hop QA, both are used but (2) adds more value since — it is training model to prevent getting distracted by reasoning paths that do not contain the answer span.
The multi-task reader model also uses the same ground truth evidence paragraphs as used for training the retriever model. In addition reader model uses distantly supervised examples from TF-IDF retriever, as such approach is known to be effective (Chen et al. 2017). In order for reader model to discriminate between relevant and irrelevant reasoning paths, data augmentation is used with additional negative examples — specifically, for single-hop QA, the single ground truth paragraph is replaced TF-IDF based negative example; for multi-hop QA, a ground truth paragraph containing the actual answer span is selected and is swapped by TF-IDF based negative example. At training time, we essentially want to maximize the likelihood of ground truth reasoning path given the question and we want to minimize the likelihood of distorted reasoning path given the question.
Loss function
For retriever model, binary cross entropy loss is used to maximize the probability of all the possible reasoning paths. Loss function at reasoning path g is at t-th timestep is given by:

where Ct is the set of negative examples.
The objective function for reader model is the sum of cross entropy lossees for reranking and span prediction tasks. The loss for the question q and its evidence candidate E is given by:

where ystart and yend are the ground truth start and end indices. Lno_answer is the loss of the reranking model to discriminate the distorted paths with no answers. Pr is P(E|q) of E is the ground truth evidence.
Metrics, Experiments & Results
The metrics reported in this paper are F1 and EM(ExactMatch) scores for HotpotQA and SQuAD open and EM score for Natural Question open to evaluate overall QA accuracy to find the correct answers.
- For For HotpotQA, they are reporting Supporting Fact F1(SP F1) and Supporting Fact EM(SP EM) to evaluate the sentence-level support fact retrieval accuracy.
- To evaluate paragraph level accuracy, Answer Recall(AR) [Recall of the answer string amont top paragraphs], Paragraph Recall(PR) [if atleast one of the ground-truth paragraphs is included among the retrieved paragraphs] and Paragraph Exact Match(P EM) [if both of the ground truth paragraphs for multi-hop reasoning are included amont the retrieved paths].
The approach proposed in this paper has been evaluated on 3 open-domain QA Wikipedia sourced datasets: HotpotQA, SQuAD open and Natural questions open. Authors have used pre-trained BERT models using the uncased base configuration(d=768) for retriever and whole word masking uncased large(wwm) configuration(d=1024) for reader model. Authors have followed same TF-IDF based retriever model as is used by Chen et al.(2017). Hyper-parameter tuning of number of initial TF-IDF based paragraphs(F), beam size(B) is done using HotpotQA dev set.

The method presented in this paper achieves 14.5 F1 and 14.0 EM gains compared to state-of-the-art Semantic Retrieval (Nie et al. 2019) and 10.9 F1 gains over the concurrent Transformer-XH model (Zhao et al.2020).

On SQuAD, this model outperforms the concurrent state-of-the-art model (Wang et al., 2019b) by 2.9 F1 and 3.5 EM scores as shown in Table 3. You can find more details on Paragraph Exact Match and Answer Recall numbers in the paper.
Ablation Study
One interesting aspect presented in this paper as part of their Ablation Study shows how various components and strategies interplay to beat state-of-the-art and how effective each such choice made in the presented setting.
Retriever Ablation in following three ways:
- No recurrent module- there is no recurrence, the model simply computes the probability of each paragraph to be included in reasoning paths independently and select the path with highest joint probability path on the graph.
- No beam search- A greedy choice is made at each timestep.
- No link based negative examples- training retriever model without adding hyperlink based negative examples besides TF-IDF based negative examples.
Reader Ablation in following two ways:
- No reasoning path re-ranking- outputs answer only with the best reasoning path from the retriever model.
- No negative examples- training the model ony with gold paragraphs. At inference time, it reads all paths and outputs an answer with the highest probability answer.

- No recurrent module- The most critical component in retriever model, if removed EM drops by 17.4 points. Thus conditioning on paragraphs retrieved at previous timesteps is important.
- No link based negative examples- Training without hyperlink based negative examples results in second largest performance drop, indicating model could be distracted by reasoning paths that do not contain the correct answer.
- No beam search- This results in performance drop of 4 points on EM, indicating that at each timestep exploring the best possible reasoning path is important.
- No reasoning path re-ranking- Performance drop by removing re-ranking of reasoning paths indicate the importance of verifying reasoning paths in our reader model.
- No negative examples- Not using negative examples to train the reader model degrades the EM by more than 16 points.
Retriever and Reader interplay
In order to show how interplay between retriever and reader model, authors shared two examples: one indicating how reranking makes mistakes and how the approach as a whole still retrieves correct answer span. Second example shows how reranking helps approach to retrieve correct answer span even though retriever selects a wrong paragraph as the best reasoning path.

In Figure 3, the approach successfully retrieves the correct reasoning path and answers correctly, while reranking fails. The top 2 paragraphs next to the graph are the introductory paragraphs of the two entities on the reasoning path and the paragraph at the bottom shows the wrong paragraph selected by rerank. As we can see “Millwall F.C” has fewer lexical overlaps and the bridge entity “Millwall” is not stated in the given question. In Figure 4, the comparison between reasoning paths ranked highest by retriever model and reader is shared. Although gold path is included in top 8 paths selected by beam search, the retriever model selects a wrong paragraph as the best reasoning path. By reranking the reasoning paths, the reader eventually selects the correct reasoning path (“2017–18 Wigan Athletic F.C. season” -> “EFL Cup”).