Skip to content

Llm finetuning

Example: fine-tuning a language model (GPT, GPT-2, CTRL, OPT, etc.) on a text dataset.

Large chunks of the code here are taken from this example script in the transformers GitHub repository.

If you haven't already, you should definitely check out this walkthrough of that script from the HuggingFace docs.

NetworkConfig #

Configuration options related to the choice of network.

When instantiated by Hydra, this calls the target function passed to the decorator. In this case, this creates pulls the pretrained network weights from the HuggingFace model hub.

TokenizerConfig #

Configuration options for the tokenizer.

DatasetConfig dataclass #

Configuration options related to the dataset preparation.

dataset_path instance-attribute #

dataset_path: str

Name of the dataset "family"?

For example, to load "wikitext/wikitext-103-v1", this would be "wikitext".

dataset_name class-attribute instance-attribute #

dataset_name: str | None = None

Name of the specific dataset?

For example, to load "wikitext/wikitext-103-v1", this would be "wikitext-103-v1".

validation_split_percentage class-attribute instance-attribute #

validation_split_percentage: int = 10

Fraction of the train dataset to use for validation if there isn't already a validation split.

LLMFinetuningExample #

Bases: LightningModule

Example of a lightning module used to fine-tune a huggingface model.

setup #

setup(stage: str)

Hook from Lightning that is called at the start of training, validation and testing.

TODO: Later perhaps we could do the preprocessing in a distributed manner like this: https://discuss.huggingface.co/t/how-to-save-datasets-as-distributed-with-save-to-disk/25674/2

configure_optimizers #

configure_optimizers()

Prepare optimizer and schedule (linear warmup and decay)