Bias and Knowledge Conflicts in Retrieval-Augmented Language Models (RALM)

# Bias and Knowledge Conflicts in Retrieval-Augmented Language Models (RALM) https://arxiv.org/html/2405.15739v2 (2024): Large Language Models Reflect Human Citation Patterns with a Heightened Citation Bias - not super relevant or helpful https://arxiv.org/html/2403.08319v1 (2024): Knowledge Conflicts for LLMs: A Survey - summary of some surrounding research + commentary https://arxiv.org/abs/2310.14393 (2023): Merging Generated and Retrieved Knowledge for Open-Domain QA - The researchers found they can leverage the two sources of information (parametric and external knowledge) to achieve greater efficacy in scenarios with a higher degree of knowledge conflicts by first matching LLM-generated passages with retrieved counterparts into compatible pairs. Then a reader model handles passage pairs to generate the final answer. - A passage pair is COMPATIBLE if both passages contain the proper evidence to support answering the question correctly. - How to account for duplicates? I think: "evaluating the compatibility of all possible pairwise combinations of the retrieved and LLM-generated passages of a question" / "the Hungarian algorithm... This matching is optimal in the sense that it covers all passages while maximizing the sum of their compatibility scores." https://arxiv.org/abs/2310.00935 (2024): Resolving Knowledge Conflicts in Large Language Models - knowledge conflicts: scenarios where discrepancy arises between the internal parametric knowledge of LLMs and non-parametric information provided in the prompt context. - while LLMs perform well in identifying the existence of knowledge conflicts, they struggle to determine the specific conflicting knowledge and produce a response with distinct answers amidst conflicting information - [Xie et al. (2024)](https://arxiv.org/abs/2305.13300) (Adaptive Chameleon or Stubborn Sloth: REVEALING THE BEHAVIOR OF LARGE LANGUAGE MODELS IN KNOWLEDGE CONFLICTS) found that when large language models (LLMs) are presented with both supportive and contradictory evidence to their existing parametric knowledge, they tend to exhibit a strong confirmation bias and prioritize their parametric knowledge - The findings revealed that, on the one hand, LLMs can be highly receptive to external knowledge even when that conflicts with their parametric memory, given that the external knowledge is coherent and convincing. On the other hand, LLMs also demonstrate a strong confirmation bias when the external evidence contains some information that is consistent with their parametric memory, despite being presented with conflicting evidence at the same time. Translation: - They found LLMs demonstrate a deficiency in information integration - We find that LLMs are highly receptive to counter-memory when it is the only evidence presented in a coherent way. However, LLMs also demonstrate a strong confirmation bias toward parametric memory when both supportive and contradictory evidence to their parametric memory are present. In addition, we show that LLMs’ evidence preference is influenced by the popularity, order, and quantity of evidence, none of which may be a desired property for tool-augmented LLMs. - [Qian et al. (2023)](https://arxiv.org/abs/2309.08594) ("Merge Conflicts!" Exploring the Impacts of External Distractors to Parametric Knowledge Graphs) investigated how Large Language Models (LLMs) handle knowledge conflicts, particularly when external knowledge is introduced that contradicts the model's inherent parametric knowledge. The findings revealed that LLMs tend to deviate from their parametric knowledge to produce responses that resolve direct conflicts with external knowledge, even if they lead to factual errors. - We argue that LLMs should 1) identify knowledge conflicts, 2) pinpoint conflicting information segments, and 3) provide distinct answers in conflicting scenarios. https://arxiv.org/abs/2506.08500 (2025): DRAGged into Conflicts: Detecting and Addressing Conflicting Sources in Search‑Augmented LLMs - Provides a new taxonomy of RAG conflicts + solutions - Presents **CONFLICTS**, a benchmark dataset annotated for real-world conflicting-source scenarios - They found that LLMs often struggle to appropriately resolve conflicts between sources. - Prompting LLMs to explicitly reason about the potential conflict in the retrieved documents significantly improves the quality and appropriateness of their responses. - On building the dataset: To ensure coverage across different types of knowledge conflicts, we curated seed queries ... Following (Wan et al., 2024), we then extract the most relevant 512-token segments from each document by applying the TAS-B model (Hofstätter et al., 2021) across overlapping 512-token windows with a 256-token stride, and calculate the dot product between the window’s embedding and the query’s embedding. - https://arxiv.org/abs/2402.11782 (2024): What Evidence Do Language Models Find Convincing? - Created ConflictingQA dataset for yes/no queries + conflicting answers - Overall, we find that current models rely heavily on the relevance of a website to the query, while largely ignoring stylistic features that humans find important such as whether a text contains scientific references or is written with a neutral tone. - Our results show that today’s LLMs tend to overrely on relevancy and ignore many stylistic features of text that humans often deem important. - Our work highlights the gap between human and model judgements of text credibility. The solution to this, however, is not clear cut. For one, it is not clear the level of discretion models should have when making predictions. For example, one solution may be to limit retrieval to a set of trustworthy sources. - https://arxiv.org/abs/2104.06967 (2021): Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling - "neural IR community" - Instead of relying on more compute capability, we introduce an efficient topic-aware query and balanced margin sampling technique, called TAS-Balanced - We cluster queries once before training and sample queries out of a cluster per batch. We train our lightweight 6-layer DR model with a novel dual-teacher supervision that combines pairwise and in-batch negative teachers. Our method is trainable on a single consumer-grade GPU in under 48 hours (as opposed to a common configuration of 8x V100s). We show that our TAS-Balanced training method achieves state-of-the-art low-latency (64ms per query) results on two TREC Deep Learning Track query sets. Evaluated on NDCG@10, we outperform BM25 by 44%, a plainly trained DR by 19%, docT5query by 11%, and the previous best DR model by 5%. Additionally, TAS-Balanced produces the first dense retriever that outperforms every other method on recall at any cutoff on TREC-DL and allows more resource intensive re-ranking models to operate on fewer passages to improve results further. - NO idea what that means. https://arxiv.org/abs/2505.17762 (2025): Resolving Conflicting Evidence in Automated Fact-Checking: A Study on Retrieval-Augmented LLMs - Studies how LLMs augmented with retrieval should reconcile conflicting evidence from sources of varying credibility. - Build a new dataset CONFACT - Our results show that effectively incorporating source credibility significantly enhances the ability of RAG models to resolve conflicting evidence and improve fact-checking performance. https://arxiv.org/abs/2504.14905 (2025): CRAVE: A Conflicting Reasoning Approach for Explainable Claim Verification Using LLMs - Studies "automatic claim verification" - Develop CRAVE - Seems pretty complicated but involves using LLM's reasoning abilities to capture subtle inconsistencies in complex claims, improving both the accuracy and transparency of claim verification. https://arxiv.org/abs/2404.10198 (2025): ClashEval: Quantifying the tug-of-war between an LLM's internal prior and external evidence - "Retrieval augmented generation (RAG) is frequently used to mitigate hallucinations and provide up-to-date knowledge for large language models (LLMs). However, given that document retrieval is an imprecise task and sometimes results in erroneous or even harmful content being presented in context, this raises the question of how LLMs handle retrieved information: If the provided content is incorrect, does the model know to ignore it, or does it recapitulate the error? Conversely, when the model's initial response is incorrect, does it always know to use the retrieved information to correct itself, or does it insist on its wrong prior response?" - They created a new dataset - "We benchmark six top-performing LLMs, including GPT-4o, on this dataset and find that LLMs are susceptible to adopting incorrect retrieved content, overriding their own correct prior knowledge over 60% of the time." - "However, the more unrealistic the retrieved content is (i.e. more deviated from truth), the less likely the model is to adopt it. Also, the less confident a model is in its initial response (via measuring token probabilities), the more likely it is to adopt the information in the retrieved content. We exploit this finding and demonstrate simple methods for improving model accuracy where there is conflicting retrieved content." - LLMs are more likely to accept **external evidence when uncertain**, and become **more resistant to bad context** when they’re confident in their prior. These two probability-based methods operationalize that behavior to **improve arbitration** between conflicting sources. - Token Probability Correction (+7.8% Accuracy over baseline): - Compute the **average token probability** of the model’s answer _without_ context (i.e., the model's internal or “prior” response). - Compute the same for the model’s answer _with_ the retrieved context. - **Choose the answer** with the **higher token probability** (i.e., higher model confidence). - Calibrated Token Probability Correction (+13.9% Accuracy over baseline): - Rank the token probabilities (as percentiles) within their own distributions (prior vs. context). - Choose the response with the **higher percentile confidence**. Summary Table (for GPT-4o): |Method|Accuracy ↑|Context Bias ↓|Prior Bias ↑| |---|---|---|---| |No Correction (Baseline)|61.5%|30.4%|2.1%| |Token Probability Correction|69.3%|19.4%|4.3%| |**Calibrated** Correction|**75.4%**|**10.7%**|**8.5%**| - "A key finding is that even the most advanced LLMs like GPT-4o exhibit a strong context bias, overriding their own correct prior knowledge over 60% of the time when presented with incorrect information in the retrieved documents. However, this bias is not absolute - the degree to which the retrieved content deviates from truth negatively correlates with the context preference rate. Interestingly, each LLM exhibits a different prior distribution over truthfulness across domains, such that the same perturbation level affects each model differently. For instance, for a given magnitude of deviation, Claude Opus adheres to incorrect contextual information 30% less often than GPT-4o. While GPT-4o achieves state-of-the-art results on general-purpose tasks, it exhibits higher context bias compared to smaller models like Claude Sonnet. This finding suggests that performance on knowledge-based benchmarks may not automatically mean it is most suitable for RAG settings. Additionally, we find that LLMs are calibrated to selectively defer to external evidence when they are less certain about a given query. However, each model differs in how well-calibrated they are. While strong priors are not inherently problematic, the lack of explicit expectations around how models will decide to use contextual information remains a risk. We propose a simple method for improving models under ClashEval, and hope that future work can improve upon this baseline." https://arxiv.org/abs/2311.11482 (2025): Meta Prompting for AI Systems - Introduce Meta Prompting (MP)—a prompting framework that shifts focus from content-specific details to the structural and patterned aspects of problems and solutions - In particular, we show that Meta Prompting can decompose intricate reasoning tasks into simpler sub-problems, thereby improving token efficiency and enabling fairer comparisons with conventional few-shot techniques. - Seems particularly for math. And it's basically just a prompt. https://arxiv.org/abs/2402.14409 (2024): Tug-of-War Between Knowledge: Exploring and Resolving Knowledge Conflicts in Retrieval-Augmented Language Models - Unique finding: in "Conflicts between truthful, irrelevant and misleading evidence" (that is, external knowledge) ... "We reveal that RALMs [retrieval augmented language models] follow the principle of majority rule, leaning towards placing trust in evidence that appears more frequently." https://arxiv.org/abs/2410.07176 (2025): Astute RAG: Overcoming Imperfect Retrieval Augmentation and Knowledge Conflicts for Large Language Models - Knowledge conflicts between LLM-internal and external knowledge from retrieval is a bottleneck to overcome in the post-retrieval stage of RAG. - To render LLMs resilient to imperfect retrieval, we propose ASTUTE RAG, a novel RAG approach that adaptively elicits essential information from LLMs’ internal knowledge, iteratively consolidates internal and external knowledge with source-awareness, and finalizes the answer according to information reliability (not strictly RAG) https://arxiv.org/abs/2406.13997 (2024): "Global is Good, Local is Bad?": Understanding Brand Bias in LLMs - Investigates brand bias in LLMs for product recommendations - This study reveals that LLMs exhibit brand biases, particularly favoring global brands over local ones, which could affect consumer behavior and brand perception. - Additionally, socio-economic biases affect LLM recommendations between luxury and non-luxury brands, correlating with a country’s wealth. - Finally, we found global brand bias may reverse due to country-of-origin effects, revealing potential complications in broad LLM behavior generalizations --- Information Retrieval expert. ML researcher. Distill these notes (pasted) into a concise synthesis (with citations) about: How RALMs [retrieval augmented language models] handle knowledge conflicts between parametric vs external knowledge, including cases of supportive and contradictory external sources. This should include concise definitions of each term (RALM, knowledge conflict, parametric, external) that are consistent with the notes. --- The research reveals that while RALMs can identify knowledge conflicts, they struggle to pinpoint specific conflicting segments and provide appropriately nuanced responses, highlighting the need for more sophisticated conflict resolution mechanisms in retrieval-augmented systems. --- https://arxiv.org/abs/2504.07080