DeadSimple: Pytorch+Cuda on your Laptop
Can it just f*cking work, please?
Getting torch and cuda working is ANNOYING. Here’s hopefully a really quick setup for you.
Will CUDA run on CPU/Integrated Graphics?
You can install CUDA runtime, but it will only use the CPU if you don’t have a dedicated NVIDIA graphics card. In the task manager, you should see an NVIDIA device:
- I recommend making sure it’s using the latest Nvidia drivers.
- If you have an AMD card, you can’t use CUDA.
Dont Install Cuda directly
So you can easily install CUDA onto your laptop, but if you’re using it with pytorch
, then guess what
“PyTorch doesn’t use the system’s CUDA library. When you install PyTorch using the precompiled binaries using either pip
or conda
it is shipped with a copy of the specified version of the CUDA library which is installed locally. In fact, you don't even need to install CUDA on your system to use PyTorch with CUDA support." ~ SO post
Now, obviously, you can do this if you're going to use it outside of a Python dev env.
Install CUDA via Pytorch
There’s a great configurator at https://pytorch.org/get-started/locally/. Let’s you choose your OS, package manager (ie. conda or pip) and cuda version.
This will output the command it install both PyTorch and Cuda and inform PyTorch to compile with the correct Cuda version. Here’s the exact commands I needed to get a working PyTorch setup on Windows:
conda create -n cuda-pytorch-12.1
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
It takes a while to download all the packages, but once working, test it simply with,
conda activate cuda-pytorch-12.1
python
>>> import torch
>>> torch.cuda.is_available()
...
... True
Or Just Pull a Docker Container
There are many Docker containers with PyTorch pre-configured. One that I like is from Chainguard, which has zero vulnerabilities.
docker pull cgr.dev/chainguard/pytorch-cuda12:latest
You can shell into it as follows — note the --gpus
flag which binds your host GPUs to the container. As mentioned above, CUDA will ignore integrated graphics, so only your external card will be used.
docker run --rm -it --gpus all cgr.dev/chainguard/pytorch-cuda12:latest
If you have multiple GPUs, you can specify one or more
docker run --rm -it --gpus '"device=0"' cgr.dev/chainguard/pytorch-cuda12:latest
From within the container, you can do the same torch.cuda.is_available()
command.
If you want to run an example program, here’s the one I use.
#gpu-test.py
import torch
class IntenseModel(torch.nn.Module):
def __init__(self):
super(IntenseModel, self).__init__()
# Increasing the layer size
self.linear1 = torch.nn.Linear(1000, 2000)
self.activation1 = torch.nn.ReLU()
self.batchnorm1 = torch.nn.BatchNorm1d(2000) # Adding batch normalization
self.linear2 = torch.nn.Linear(2000, 1000)
self.activation2 = torch.nn.ReLU()
self.batchnorm2 = torch.nn.BatchNorm1d(1000)
self.linear3 = torch.nn.Linear(1000, 500)
self.activation3 = torch.nn.ReLU()
self.linear4 = torch.nn.Linear(500, 10)
self.softmax = torch.nn.Softmax(dim=1) # Specify the dimension for Softmax
def forward(self, x):
x = self.linear1(x)
x = self.activation1(x)
x = self.batchnorm1(x)
x = self.linear2(x)
x = self.activation2(x)
x = self.batchnorm2(x)
x = self.linear3(x)
x = self.activation3(x)
x = self.linear4(x)
x = self.softmax(x)
return x
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available. This script requires a GPU.")
device = torch.device("cuda") # Directly set device to CUDA since we've checked its availability
intensemodel = IntenseModel().to(device)
print('The model:')
print(intensemodel)
print('\n\nModel params:')
for param in intensemodel.parameters():
print(param)
Running this locally in Conda or in the Docker container should result in spikes to your GPU utilization. To run in Docker container, either copy it into the container and shell in and run it manually, or do something like this:
docker run --rm -it /path/to/gpu-test.py:/gpu-test.py --gpus all cgr.dev/chainguard/pytorch-cuda12:latest -c "python /gpu-test.py"
Where /path/to/model_builder.py
is the path on your local machine.
When I run this, it results in quick bursts of utilization. Running it in succession results in pulses in task manager:
Bonus — Kill My Laptop!
Here’s a program that will utilize a ton of GPU memory and probably an OOM error.
#gpu-ultra-test.py
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
from torch.utils.data import DataLoader, TensorDataset
class UltraIntenseModel(nn.Module):
def __init__(self):
super(UltraIntenseModel, self).__init__()
# Enhanced convolutional blocks with more layers and features
self.features = nn.Sequential(
nn.Conv2d(3, 128, kernel_size=5, padding=2),
nn.ReLU(),
nn.Conv2d(128, 256, kernel_size=5, padding=2),
nn.ReLU(),
nn.MaxPool2d(2, 2),
nn.Conv2d(256, 512, kernel_size=3, padding=1),
nn.ReLU(),
nn.Conv2d(512, 512, kernel_size=3, padding=1),
nn.ReLU(),
nn.MaxPool2d(2, 2),
nn.Dropout(0.5)
)
# Classifier with large fully connected layers
self.classifier = nn.Sequential(
nn.Linear(512 * 56 * 56, 4096),
nn.ReLU(),
nn.Dropout(0.5),
nn.Linear(4096, 1024),
nn.ReLU(),
nn.Linear(1024, 10),
nn.Softmax(dim=1)
)
def forward(self, x):
x = self.features(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
# Check if CUDA is available, and crash if not
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available. This script requires a GPU.")
device = torch.device("cuda") # Directly set device to CUDA since we've checked its availability
# Generate random large high-resolution data
input_tensor = torch.randn(64, 3, 224, 224) # 64 images of 224x224 resolution
target_tensor = torch.randint(0, 10, (64,))
# Data loader
dataset = TensorDataset(input_tensor, target_tensor)
loader = DataLoader(dataset, batch_size=16, shuffle=True)
# Device setting
# Model
model = UltraIntenseModel().to(device)
# Optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
# Train the model
model.train()
for epoch in range(10): # Run more epochs to see sustained GPU usage
for data, target in loader:
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.cross_entropy(output, target)
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Loss: {loss.item()}")
print("Training complete")
Useful for really making hammering your GPU memory