SML group at ETH | Copyright-Protected Language Generation via Adaptive Model Fusion

Paper Code

Large language models (LLMs) are remarkably good at generating text and code, but this success comes with a growing concern: these models can sometimes reproduce copyrighted material from their training data. As LLMs are deployed at scale, this behavior raises both legal and practical questions about how to mitigate copyright infringement while preserving model utility.

Most existing solutions tackle the problem during training or data curation. Training-time defenses are often costly and can degrade model quality, while aggressive data filtering is imperfect and may remove valuable examples. Inference-time methods offer an appealing alternative, but many existing approaches rely on heuristics or fail to scale to real-world language models.

In this work, we focus on a practical inference-time approach to copyright protection. Our goal is a method that:

operates directly during decoding, without retraining,
reduces verbatim and near-verbatim regurgitation of protected material, and
preserves the quality of generated text and code.

To this end, we introduce Copyright-Protecting Model Fusion (CP-Fuse), an inference-time method that adaptively combines the outputs of multiple language models during decoding (see Figure 1 for an intuitive illustration of the method). Across a range of benchmarks and memorization metrics, CP-Fuse reduces the reproduction of protected material by more than \(25\times\), consistently outperforming other baselines while maintaining generation quality.

Figure 1. Illustration of the CP-Fuse algorithm and toy example.

Copyright-Protecting Model Fusion

We consider standard autoregressive language models that, given a prompt \(x\), generate a sequence of tokens \(y = (y_0, ..., y_T)\) according to

\[p(y|x) = \prod_{t=0}^T p(y_t|y_{<t},x).\]

CP-Fuse is built on a simple assumption: copyrighted material can be separated across data sources. We train two language models, \(p^{(1)}\) and \(p^{(2)}\), on datasets associated with disjoint sets of copyrighted content. As a result, any specific protected sequence cannot be memorized by both models.

The goal of CP-Fuse is to construct, at inference time, a new model whose predictions remain close to both base models. At each decoding step \(t\), CP-Fuse constructs the next-token distribution by minimizing the worst-case divergence to the two models,

\[\min_{p^{*}} \; \max_{i \in \{1,2\}} \mathrm{KL}\!\left( p^{*}(\cdot) \,\|\, p^{(i)}(\cdot|y_{<t},x) \right),\]

where \(q\) ranges over valid probability distributions on the next token. This objective is closely related to the \(k\)-NAF guarantees [2].

Rather than solving this optimization directly, which would be computationally expensive, CP-Fuse exploits the structure of the objective. The optimal solution can be written in a model-fusion form (Lemma 3.1 in our paper [1]). At each token position \(t\), CP-Fuse samples from a distribution

\[\log p(y_t|y_{<t},x) = \alpha_t \log p^{(1)}(y_t|y_{<t},x) + \beta_t \log p^{(2)}(y_t|y_{<t},x) + c_t,\]

where \(\alpha_t, \beta_t \ge 0\) are adaptive coefficients and \(c_t\) is a normalizing constant.

Why CP-Fuse Reduces Regurgitation

The effectiveness of CP-Fuse stems from a simple mechanism: it prevents any single model from dominating the generation. Consider a prompt \(x\) and a partially generated sequence \(y_{<t}\). If one model explains the history much better than the other, for example \(p^{(1)}(y_{<t}|x) \gg p^{(2)}(y_{<t}|x)\), this imbalance can signal memorization. CP-Fuse then either reweights the next-token distribution so that both models assign comparable likelihood to the sequence, or falls back to the less dominant model for the next token. In both cases, generation is steered away from continuations that are strongly preferred by only one model.

This adaptive behavior induces a balancing effect (Lemma 3.2 in our paper [1]): throughout the generated sequence, the cumulative log-likelihoods under the two base models remain close. This property is crucial under the assumption that copyrighted material is separable across training datasets. If a protected sequence is memorized by only one base model, the other assigns it low probability, and CP-Fuse exploits this disagreement to suppress regurgitation.

In the next section, we show that this balancing property translates into large empirical reductions in both exact and approximate memorization, while preserving the utility of generated text and code.

Results

In our experiments, we split each dataset into two non-overlapping subsets and fine-tune a separate model on each subset. To evaluate the copyright protection capabilities of CP-Fuse, we simulate an extreme scenario by deliberately overfitting the models through fine-tuning for many epochs.

Memorization. CP-Fuse substantially reduces both verbatim and approximate memorization across text and code tasks (Figure 2). Compared to the overfitted models, CP-Fuse reduces exact regurgitation by more than \(25\times\) and consistently outperforms all baselines. System-prompt self-reminders fail to prevent memorization, while MemFree shortens copied segments but still produces sequences roughly twice as long as those generated by CP-Fuse, often avoiding exact matches only through superficial edits such as spacing or spelling changes (see Appendix A.10 in [1]). Token-wise CP-\(\Delta\) [2] is also prone to reproducing long memorized segments.

Approximate memorization metrics reinforce these findings. On code-generation tasks, CP-Fuse achieves near non-plagiarism: automated plagiarism detectors flag almost no copying, while all baselines exhibit clear infringement.

Figure 2. Copyright-infringement metrics averaged at the 95th percentile for the Python instructions, MathAbstracts, and WritingPrompts datasets across fine-tuning splits. We present results for the overfitted models, Self-reminder prompting (SystemPrompt), MemFree, CP-∆, and CP-Fuse. Metrics include Exact Matching (EM), BLEU score (BLE), Levenshtein Distance (LEV), and code plagiarism score JPlag (JP).

Utility. Reducing memorization does not come at the expense of utility (Figure 3). On held-out test data, CP-Fuse matches the performance of the overfitted models on downstream tasks. For code generation, it achieves comparable pass@1 accuracy. For story generation, the outputs show the same level of fluency as those produced by models that still memorize their training data. All baselines perform similarly in terms of utility, with the notable exception of MemFree, which consistently underperforms due to typos and hallucinations introduced by its filtering mechanism (see Appendix A.10 in [1]).

Figure 3. Utility metrics across datasets. Pass@1 is reported for APPS, MBPP, and HumanEval (HE); Fluency is reported for WritingPrompts (WP).

Overall, these results show that CP-Fuse effectively suppresses both exact and near-verbatim reproduction of training data while preserving the quality of generated text and code.

FAQ

What is the computational overhead of CP-Fuse?

Solving the optimization problem described above is inexpensive: it reduces to a grid search over a low-dimensional (2D) space. However, because CP-Fuse combines multiple models at inference time, it requires running forward passes for each model. These computations can be parallelized, and the additional communication overhead—when multiple devices are involved—is often negligible in practice. See Appendix D.1 of [1] for further discussion.

How should the base models be trained?

Our recommended strategy is to duplicate datasets for tasks that are not copyright-sensitive and include them in the training data of all models. For copyright-sensitive content, datasets should be partitioned so that each sensitive sample appears in only one model’s training set. This ensures that each model can independently perform well on the target tasks, so fusing them with CP-Fuse does not lead to a loss of utility.

How does CP-Fuse compare to training-time approaches such as the Goldfish loss?

The Goldfish loss [3] is a training-time strategy and is therefore orthogonal to CP-Fuse, which operates at inference time. In Section 4.3 of the paper [1], we also demonstrate a practical setting in which CP-Fuse is applied on top of models trained with the Goldfish loss, yielding additional protection and illustrating the seamless integration of the two approaches.

Is CP-Fuse robust to adversarial attacks?

Section 4.4 of the paper [1] evaluates the robustness of CP-Fuse against prefix-prompt extraction attacks. In this setting, an adversary has black-box access to CP-Fuse, which wraps two potentially copyright-infringing base models, and is given access to the prompts used during fine-tuning as well as a prefix of an original story, with the goal of inducing CP-Fuse to regurgitate a memorized continuation. We show that CP-Fuse provides strong protection in this scenario: even for very long prefixes, it does not reproduce long memorized continuations. Nonetheless, robustness to more sophisticated jailbreak strategies remains an important direction for future work.

What happens if the separability assumption does not hold?

If the separability assumption is violated, the balancing property that underpins CP-Fuse’s protection can no longer be guaranteed. That said, Appendix B of the paper [1] includes ablation studies across varying degrees of overlap between training datasets, demonstrating robustness to moderate violations of this assumption. Nevertheless, CP-Fuse provides no formal guarantees in this setting, and highly overlapping data may still lead to long verbatim reproductions. A theoretical analysis of worst-case behavior—such as bounds on the length of copied segments as overlap increases—remains an important direction for future work.

Are there ethical or legal considerations when deploying CP-Fuse?

We emphasize that real-world deployments of CP-Fuse may be subject to legal and contractual constraints related to dataset and model usage. Copyright protection for language models remains a complex and evolving area, with no clear legislative consensus on what constitutes infringement. Accordingly, we refrain from providing legal guidance and encourage practitioners to consult qualified legal experts when deploying such systems.

References

[1] Javier Abad, Konstantin Donhauser, Francesco Pinto, and Fanny Yang. “Copyright-Protected Language Generation via Adaptive Model Fusion.” International Conference on Learning Representations (ICLR) (2025).

[2] Nikhil Vyas, Sham Kakade, and Boaz Barak. “On Provable Copyright Protection for Generative Models.” International Conference on Machine Learning (ICML) (2023).

[3] Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele, and Tom Goldstein. “Be like a Goldfish, Don’t Memorize! Mitigating Memorization in Generative LLMs.” Neural Information Processing Systems (NeurIPS) (2024).