ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.04004
11
3

Understanding the Effect of Noise in LLM Training Data with Algorithmic Chains of Thought

6 February 2024
Alex Havrilla
Maia Iyer
ArXivPDFHTML
Abstract

During both pretraining and fine-tuning, Large Language Models (\textbf{LLMs}) are trained on trillions of tokens of text of widely varying quality. Both phases of training typically involve heuristically filtering out ``low-quality'' or \textit{noisy} training samples, yet little is known quantitatively about how the type or intensity of noise affects downstream performance. In this work, we study how noise in chain of thought (\textbf{CoT}) impacts task performance in the highly-controlled setting of algorithmically solvable tasks. First, we develop the Traced Integer (\textbf{TInt}) framework to generate highly customizable noised execution traces for any arithmetic function on lists of integers. We then define two types of noise: \textit{static} noise, a local form of noise which is applied after the CoT trace is computed, and \textit{dynamic} noise, a global form of noise which propagates errors in the trace as it is computed. We then evaluate the test performance of pretrained models both prompted and fine-tuned on noised datasets with varying levels of dataset contamination and intensity. We find fine-tuned models are extremely robust to high levels of static noise but struggle significantly more with lower levels of dynamic noise. In contrast, few-shot prompted models appear more sensitive to even static noise. We conclude with a discussion of how our findings impact noise filtering best-practices, in particular emphasizing the importance of removing samples containing destructive dynamic noise with global errors.

View on arXiv
Comments on this paper