{"id":606,"date":"2024-12-05T05:37:17","date_gmt":"2024-12-05T05:37:17","guid":{"rendered":"https:\/\/www.alanknipmeyer.phd\/?p=606"},"modified":"2024-12-05T05:37:58","modified_gmt":"2024-12-05T05:37:58","slug":"adding-a-2nd-p40","status":"publish","type":"post","link":"https:\/\/www.alanknipmeyer.phd\/index.php\/2024\/12\/05\/adding-a-2nd-p40\/","title":{"rendered":"Adding a 2nd P40"},"content":{"rendered":"\n<p>Having already added a 2nd P100 to the other GPU server, it was time to maximize the 2nd R720 empty PCI slot and add a 2nd P40. This GPU whilst old, still does pretty well, the main thing for me the amount of available NVRAM to work on larger models &#8211; now a combined 48G of NVRAM dedicated to the Tesla Generation GPU. I didn&#8217;t really have a script that could run a simple GPU\/Memory stress test so about this so I could incremently make sure the memory and dual GPU function worked correctly. This is on Ubuntu 24.04, but should equally work on all CUDA based Python 3 platforms.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img fetchpriority=\"high\" decoding=\"async\" width=\"768\" height=\"1024\" src=\"https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/Number2_GPU_P40-768x1024.jpg\" alt=\"\" class=\"wp-image-605\" srcset=\"https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/Number2_GPU_P40-768x1023.jpg 768w, https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/Number2_GPU_P40-225x300.jpg 225w, https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/Number2_GPU_P40-1153x1536.jpg 1153w, https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/Number2_GPU_P40-1537x2048.jpg 1537w, https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/Number2_GPU_P40.jpg 1816w\" sizes=\"(max-width: 768px) 100vw, 768px\" \/><\/figure>\n\n\n\n<p>Create venv, install necessary pip packages<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>python3 -m venv work\nsource work\/bin\/activate\npip install torch transformers\n\n<\/code><\/pre>\n\n\n\n<p>Pytorch script to stress test available CPUs &#8211; adjust BATCH_SIZE to the amount of available memory<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import torch\nfrom torch import nn\nfrom torch.utils.data import DataLoader, Dataset\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\n# Parameters for stress test\nMODEL_NAME = \"gpt2\"  # You can use \"bert-base-uncased\" or another model\nNUM_GPUS = torch.cuda.device_count()\nBATCH_SIZE = 16  # Adjust to push GPU memory limits\nSEQ_LEN = 128    # Sequence length for dummy data\nSTEPS = 200      # Number of steps for the stress test\n\n# Dummy dataset for training\nclass DummyDataset(Dataset):\n    def __init__(self, size, seq_len, tokenizer):\n        self.size = size\n        self.seq_len = seq_len\n        self.tokenizer = tokenizer\n\n    def __len__(self):\n        return self.size\n\n    def __getitem__(self, idx):\n        text = \"This is a dummy sentence for stress testing GPUs.\"\n        tokens = self.tokenizer(text, max_length=self.seq_len, padding=\"max_length\", truncation=True, return_tensors=\"pt\")\n        return tokens&#91;\"input_ids\"].squeeze(0), tokens&#91;\"attention_mask\"].squeeze(0)\n\n# Create the model and move it to GPUs\ndef create_model():\n    model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)\n    if NUM_GPUS &gt; 1:\n        model = nn.DataParallel(model)  # For multi-GPU usage\n    model = model.to(\"cuda\")\n    return model\n\n# Training loop\ndef train(model, dataloader, optimizer, steps):\n    model.train()\n    criterion = nn.CrossEntropyLoss()\n    for step, (input_ids, attention_mask) in enumerate(dataloader):\n        if step &gt;= steps:\n            break\n        input_ids, attention_mask = input_ids.to(\"cuda\"), attention_mask.to(\"cuda\")\n        \n        # Shift inputs for causal LM\n        labels = input_ids.clone()\n        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)\n        loss = outputs.loss\n        \n        optimizer.zero_grad()\n        loss = loss.mean()\n        loss.backward()\n        optimizer.step()\n\n        if step % 10 == 0:\n            print(f\"Step {step}\/{steps}, Loss: {loss.item()}\")\n\n# Main\ndef main():\n    model = create_model()\n    print(f\"Using {NUM_GPUS} GPUs for stress test.\")\n    tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)\n    if tokenizer.pad_token is None:\n      tokenizer.add_special_tokens({'pad_token': '&#91;PAD]'})\n      model.module.resize_token_embeddings(len(tokenizer))  # Resize model embeddings to account for new token\n    else:\n        model.resize_token_embeddings(len(tokenizer))\n\n    dataset = DummyDataset(size=1000, seq_len=SEQ_LEN, tokenizer=tokenizer)\n    dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True)\n    \n    optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)\n    \n    train(model, dataloader, optimizer, steps=STEPS)\n\nif __name__ == \"__main__\":\n    main()<\/code><\/pre>\n\n\n\n<p>I then logged the results, adjusting the batch size until the GPUs ran out of memory, using nvtop to capture stats in real time.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img decoding=\"async\" width=\"1024\" height=\"517\" src=\"https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/image-1024x517.png\" alt=\"\" class=\"wp-image-607\" srcset=\"https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/image-1024x517.png 1024w, https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/image-300x151.png 300w, https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/image-768x388.png 768w, https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/image-1536x775.png 1536w, https:\/\/www.alanknipmeyer.phd\/wp-content\/uploads\/2024\/12\/image-2048x1034.png 2048w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<figure class=\"wp-block-table alignleft\"><table class=\"has-fixed-layout\"><tbody><tr><td>BATCH_SIZE<\/td><td>Memory Use max<\/td><td>Step Output<\/td><\/tr><tr><td>16<\/td><td>8G<\/td><td><code>Step 0\/200, Loss: 13.902944564819336<br>Step 10\/200, Loss: 10.364654541015625<br>Step 20\/200, Loss: 9.427839279174805<br>Step 30\/200, Loss: 8.033140182495117<br>Step 40\/200, Loss: 6.651190280914307<br>Step 50\/200, Loss: 5.285730361938477<br>Step 60\/200, Loss: 3.734560966491699<\/code><\/td><\/tr><tr><td>32<\/td><td>19G<\/td><td><code>Step 0\/200, Loss: 14.08126449584961<br>Step 10\/200, Loss: 10.370573043823242<br>Step 20\/200, Loss: 9.409626007080078<br>Step 30\/200, Loss: 8.021991729736328<\/code><\/td><\/tr><tr><td>64<\/td><td>21G<\/td><td><code>Step 0\/200, Loss: 14.005016326904297<br>Step 10\/200, Loss: 10.352890014648438<\/code><\/td><\/tr><tr><td>128<\/td><td>OOM<\/td><td><code>torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.07 GiB. <\/code><br><code>GPU 0 has a total capacity of 22.38 GiB of which 1.89 GiB is free. <\/code><br><code>Process 2856 has 250.00 MiB memory in use. Including non-PyTorch memory, this process has 20.24 GiB memory in use.<\/code><\/td><\/tr><tr><td>110<\/td><td>32G<\/td><td>Step 0\/200, Loss: 13.981659889221191<\/td><\/tr><tr><td>115<\/td><td>34G<\/td><td><code>Step 0\/200, Loss: 13.970483779907227<\/code><\/td><\/tr><tr><td>120<\/td><td>OOM<\/td><td><code>torch.OutOfMemoryError: CUDA out of memory. <\/code><br><code>Tried to allocate 2.88 GiB. <\/code><br><code>GPU 0 has a total capacity of 22.38 GiB of which 2.16 GiB is free.<\/code><\/td><\/tr><tr><td>119<\/td><td>OOM<\/td><td><\/td><\/tr><tr><td>118<\/td><td>OOM<\/td><td><\/td><\/tr><tr><td>116 <\/td><td>OOM<\/td><td><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>Following the stress testing, I&#8217;m confident that these are both genuine NVIDIA GPU&#8217;s and will be able to handle the workloads I use them for ! I&#8217;m doing some more &#8216;lab tidying&#8217; and will take thru the lab in upcoming post.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Having already added a 2nd P100 to the other GPU server, it was time to maximize the 2nd R720 empty PCI slot and add a 2nd P40. This GPU whilst old, still does pretty well, the main thing for me the amount of available NVRAM to work on larger models &#8211; now a combined 48G [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-606","post","type-post","status-publish","format-standard","hentry","category-uncategorised"],"_links":{"self":[{"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/posts\/606","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/comments?post=606"}],"version-history":[{"count":2,"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/posts\/606\/revisions"}],"predecessor-version":[{"id":609,"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/posts\/606\/revisions\/609"}],"wp:attachment":[{"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/media?parent=606"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/categories?post=606"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.alanknipmeyer.phd\/index.php\/wp-json\/wp\/v2\/tags?post=606"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}