Merging LLM and testing with Private data sets

Dwarakanath Rao
4 min readJan 8, 2024

--

What is LLM Merging and why do we need it

Model merging is a technique that combines two or more LLMs into a single model. It’s a relatively new and experimental method to create new models for cheap (no GPU required). Model merging works surprisingly well and produced many state-of-the-art models on the Open LLM Leader board.

Here is my view on the latest mergekit library.

Credit to the creators

I have played around with some of the LLM on HF, found it very useful and then tested with IIT JEE Exam questions and mock exams. Found it much better

Merge algorithms

In this section, we will focus on four methods currently implemented in mergekit. Note that there are other methods, such as linear and Task Arithmetic.

1 1. SLERP

Spherical Linear Interpolation (SLERP) is a method used to smoothly interpolate between two vectors. It maintains a constant rate of change and preserves the geometric properties of the spherical space in which the vectors reside.

There are several reasons to prefer SLERP over a traditional linear interpolation. For example, in high-dimensional spaces, linear interpolation can lead to a decrease in the magnitude of the interpolated vector (i.e., it reduces the scale of weights). Moreover, the change in direction of the weights often represents more meaningful information (like feature learning and representation) than the magnitude of change.

SLERP is implemented using the following steps:

1. Normalize the input vectors to unit length, ensuring they represent directions rather than magnitudes

2. Calculate the angle between these vectors using their dot product.

3. If the vectors are nearly collinear, it defaults to linear interpolation for efficiency. Otherwise, SLERP computing scale factors based on the interpolation factor t (t=0 = 100% of the first vector, t=1 = 100% of model 2) and the angle between the vectors.

4. These factors are used to weigh the original vectors, which are then summed to obtain the interpolated vector.

SLERP is currently the most popular merging method, but it is limited to combining only two models at a time. It is still possible to hierarchically combine multiple models.

slices:

- sources:

- model: psmathur/orca_mini_v3_13b

layer_range: [0, 40]

- model: garage-bAInd/Platypus2–13B

layer_range: [0, 40]

# or, the equivalent models: syntax:

# models:

# — model: psmathur/orca_mini_v3_13b

# — model: garage-bAInd/Platypus2–13B

merge_method: slerp

base_model: psmathur/orca_mini_v3_13b

parameters:

t:

- filter: self_attn

value: [0, 0.5, 0.3, 0.7, 1]

- filter: mlp

value: [1, 0.5, 0.7, 0.3, 0]

- value: 0.5 # fallback for rest of tensors

dtype: float16

2 TIES

TIES-Merging is designed to efficiently merge multiple task-specific models into a single multitask model.

  • Redundancy in model parameters: It identifies and eliminates redundant parameters within task-specific models. This is achieved by focusing on the changes made during fine-tuning, identifying the top-k% most significant changes, and discarding the rest.
  • Disagreement between parameter signs: Conflicts arise when different models suggest opposing adjustments to the same parameter. TIES-Merging resolves these conflicts by creating a unified sign vector that represents the most dominant direction of change across all models.

TIES-Merging is divided into the following three steps:

  1. Trim: Reduces redundancy in task-specific models by retaining only a fraction the most significant parameters (density parameter) and resetting the rest to zero.
  2. Elect Sign: Resolves sign conflicts across different models by creating a unified sign vector based on the most dominant direction (positive or negative) in terms of cumulative magnitude.
  3. Disjoint Merge: Averages parameter values that align with the unified sign vector, excluding zero values.

Unlike SLERP, TIES can merge multiple models at a time.

3 DARE

DARE uses an approach similar to TIES with two main differences:

  • Pruning: DARE randomly reset fine-tuned weights to their original values (those of the base model).
  • Rescaling: DARE rescales the weights to keep the expectations of model outputs approximately unchanged. It adds the rescaled weights of both (or more) models to the weights of the base model with a scale factor.

Mergekit’s implementation of this method has two flavors: with the sign election step of TIES (dare_ties) or without (dare_linear).

4 Passthrough

The passthrough method differs significantly from the previous ones. By concatenating layers from different LLMs, it can produce models with an exotic number of parameters (e.g., 9B with two 7B parameter models). These models are often referred to as “frankenmerges” or “Frankenstein models” by the community.

This technique is very experimental, but it managed to create impressive models, like goliath-120b using two Llama 2 70B models. The recently released SOLAR-10.7B-v1.0 also uses the same idea, called depth-up scaling in their paper.

💻 Merge your own models

In this section, we will use mergekit to load a merge configuration, run it, and upload the resulting model to the Hugging Face Hub.

First of all, we install mergekit directly from source as follows:

!git clone https://github.com/cg123/mergekit.git
!cd mergekit && pip install -q -e .

In the following block, we load the merge configuration in a YAML format. We also specify the name of the merged model for future use. You can copy/paste any configuration from the previous section here.

This time, we will use two different models: Marcoroni-7B-v3 and Mistral-7B-Merge-14-v0.1 and merge them with the SLERP method. We save the config as a yaml file to be used as input in the merge command.

import torch

import yaml

from mergekit.config import MergeConfiguration

from mergekit.merge import MergeOptions, run_merge

with open(CONFIG_YML, “r”, encoding=”utf-8") as fp:

merge_config = MergeConfiguration.model_validate(yaml.safe_load(fp))

run_merge(

merge_config,

out_path=OUTPUT_PATH,

options=MergeOptions(

lora_merge_cache=LORA_MERGE_CACHE,

cuda=torch.cuda.is_available(),

copy_tokenizer=COPY_TOKENIZER,

lazy_unpickle=LAZY_UNPICKLE,

low_cpu_memory=LOW_CPU_MEMORY,

),

)

print(“Done!”)

You are done, you are now free to use a new merged model

Credit to https://github.com/cg123/mergekit#merge-methods

--

--