Llm finetuning

Example: fine-tuning a language model (GPT, GPT-2, CTRL, OPT, etc.) on a text dataset.

Large chunks of the code here are taken from this example script in the transformers GitHub repository.

If you haven't already, you should definitely check out this walkthrough of that script from the HuggingFace docs.

NetworkConfig #

Configuration options related to the choice of network.

When instantiated by Hydra, this calls the target function passed to the decorator. In this case, this creates pulls the pretrained network weights from the HuggingFace model hub.

TokenizerConfig #

Configuration options for the tokenizer.

DatasetConfig `dataclass` #

Configuration options related to the dataset preparation.

dataset_path `instance-attribute` #

dataset_path: str

Name of the dataset "family"?

For example, to load "wikitext/wikitext-103-v1", this would be "wikitext".

dataset_name `class-attribute` `instance-attribute` #

dataset_name: str | None = None

Name of the specific dataset?

For example, to load "wikitext/wikitext-103-v1", this would be "wikitext-103-v1".

validation_split_percentage `class-attribute` `instance-attribute` #

validation_split_percentage: int = 10

Fraction of the train dataset to use for validation if there isn't already a validation split.

LLMFinetuningExample #

Bases: LightningModule

Example of a lightning module used to fine-tune a huggingface model.

setup #

setup(stage: str)

Hook from Lightning that is called at the start of training, validation and testing.

TODO: Later perhaps we could do the preprocessing in a distributed manner like this: https://discuss.huggingface.co/t/how-to-save-datasets-as-distributed-with-save-to-disk/25674/2

configure_optimizers #

configure_optimizers()

Prepare optimizer and schedule (linear warmup and decay)