As I’ve recently been using AI for generating static graphics via Stable Diffusion, I was interested to see if my setup could also generate video. As I’m running older series of GPU’s (Telsa P100 & P40), this might be a limitation on both memory and CUDA capability version.
I searched methods and found ‘Cog video‘ a fully open source text to video which should run on my series of GPU’s as it uses PyTorch which should work on my GPU / Cuda capability.
I followed the tutorial by Aleksander, and include download program snippet and test code I used to render below.
from huggingface_hub import snapshot_download
snapshot_download(repo_id="THUDM/CogVideoX-2b",local_dir="/home/alan/localdev/CogVideo/modelFiles")
You’ll need 13G and some decent bandwidth to download the files, I’m running about 900Mb of available download bandwidth here and it took about 10 minutes to get all the files.
The pip build on installing the requirements took some time as well on the Dell R720 and 10K SAS Drives, but worked first time. You’ll see below the code which includes empty_cache to help memory management. I had to disable all other gpu usage resources so the P100 could render the video, but it did work !
import torch
torch.cuda.empty_cache()
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
#prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
#prompt = "A female humanoid robot in an industrial complex standing and operating a machine producing other male humanoid robots. Other robots pass in the back ground looking to what is going on. In the distance we see a series of skycrapers and offices with other people observing the robots work"
prompt = "a cartoon rabbit trying to pull a huge carrot out a field surrounded by other rabbits watching and waitng. The sky is blue and the sun is out and the rabbit is getting hotter and hotter sweating in the sun trying to pull the carrot out"
pipe = CogVideoXPipeline.from_pretrained(
"/home/alan/localdev/CogVideo/modelFiles",
torch_dtype=torch.float16
)
torch.cuda.empty_cache()
pipe.enable_model_cpu_offload()
pipe.enable_sequential_cpu_offload()
torch.cuda.empty_cache()
#pipe.vae.enable_slicing()
#pipe.vae.enable_tiling()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=40,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda:0").manual_seed(42),
).frames[0]
export_to_video(video, "output3.mp4", fps=8)
P100 Time to render
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 2.51it/s]
Loading pipeline components...: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 2.66it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [30:55<00:00, 46.39s/it]
I was really impressed that the video rendered and not in an unreasonable amount of time. The P40 didn’t fare quite so well, with an estimate of 6 hours to complete the render. I will investigate this more as obviously the code is identical and the P40 has more memory, so it is a limitation of the GPU.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 2.50it/s]
Loading pipeline components...: 100%|█████████████| 5/5 [00:01<00:00, 2.63it/s]
28%|██████████▍ | 11/40 [1:54:15<5:00:50, 622.44s/it]^C^C^C^CKilled
I’ve since ordered another P100 for this server and an additional server for the P40, I will be adding these to the Kubernetes & Slurm Clusters I have built to test things out on. Once I have the 2nd P100 in the same server, I will see if I can modify the code to utilize both GPU’s to reduce the render time.
Undoubtedly maximizing the utilization and computational capabilities of these older GPU’s will help in my own research as I will be looking at scalable and symmetric processing with algorithms designed to work in such a way, so using ‘text to video’ is both an interesting and technically good place to start ahead of actual distributed cryptanalysis.
Here are the videos the script produced, I hope you like them !
Big thanks to Aleksandar as well, I’ve since subscribed to his youtube channel as it has alot of really interesting content on it !