**Team: Yunzhi Yao**@UCLA; Canyu Chen@Northwestern; Jia-Chen Gu@UCLA; Shumin Deng@NUS; Manling Li@Northwestern; Nanyun Peng@UCLA
This is only a progress report.
<aside> đź’ˇ
Knowledge editing has attracted much attention in recent years as it has the potential to efficiently and robustly update the knowledge in LLMs. Many great works have appeared, and the performance under several benchmarks is competitive, demonstrating the strong capacity of current techniques. Despite the success of current methods, knowledge editing is still in its infancy and has a long way to go. In this blog, we trace the development of knowledge editing from the small language models to the current era of large reasoning models(LRMs) and highlight some important aspects for further investigation in this field and some issues that have been neglected in the community. The main contents include:
Emerging around 2020 and accelerating with the rise of Large Language Models (LLMs), model editing and knowledge editing have attracted significant attention due to their potential to update model knowledge efficiently. These research fields have witnessed remarkable breakthroughs, with impressive works demonstrating strong performance on benchmarks like ZsRE$^{[1]}$ and CounterFact$^{[2]}$.
Despite this progress, significant challenges remain before these techniques can fully support downstream applications. In today's era of large reasoning models (LRMs), which demonstrate exceptional performance across many real-world tasks, we start to consider the next step in the editing area. In this blog, we aim to dissect these emerging challenges and scrutinize the limitations of existing evaluation methods and editing techniques.
We hope this blog could offer insights to guide future research in this critical domain as we move towards more capable and reliable AI systems.
Traditional benchmarks like ZsRE and CounterFact primarily consist of single-hop factual questions with simple prompts directly querying the edited fact (e.g., "The capital of France is...", "Steve Jobs was born in..."). Even the "generalization" subsets within these benchmarks often rely heavily on simple paraphrases of the original prompt.
While useful for initial validation, demonstrating success on these tasks is insufficient for gauging robust knowledge integration. The most popular editing method, AlphaEdit$^{[3]}$, has achieved high performance (e.g., 91.13% accuracy in sequential editing tested on Llama3-8B) on paraphrase generalization tests. While in practice, real-world generation for the edited model remains unsatisfactory despite those impressive numbers in current benchmarks. Real-world knowledge application is far more complex, often requiring multi-step reasoning, combining multiple pieces of information, resolving ambiguous references, and understanding implications. ****For example, after editing a model to "know" a new company CEO, it may answer a templated prompt correctly ("Who is the CEO of X?") yet fail to apply that knowledge in a different context. If the same fact is asked indirectly – say, as part of a story or a multi-hop question – the model often lapses back to the old answer or gets confused.
A growing body of evidence suggests that current methods fall short in these more demanding situations. Recently, the community has begun to introduce more challenging evaluation sets: