Install NVIDIA driver in an ESXi VM
Recently, I wanted to set up a server for AI in my homelab. For I only have one motherboard that have enough space for a RTX 3090, and it is too expensive to buy a new set up, I decided to install EXSi on that server and use GPU passthrough.
The VM is configured on:
- EPYC 7302 * 48
- Based on ESXi-7.0u3
- NVIDIA GeForce RTX 3090
- 128GB memory
- Debian GNU/Linux 12 (bookworm) x86_64
edited at Nov 8 2024: I don’t know why I use so much memory and cpu cores, actually that doesn’t matter at all, AI models don’t use CPU and memory.
Progress
Firstly, create the VM in ESXi and enable the ability of passthrough NVIDIA GPU:
- pre-alloc all memory
- set
hypervisor.cpuid.v0=FALSE
in VM config file - set
pciPassthru0.msiEnabled=FALSE
in VM config file
If you create the VM without any extra config, GPU will not work. But anyway, you can find these operation in many otherwhere. Without any further ado, here is the installation progress:
Config apt to include non-free-firmware
. In /etc/apt/source.list
, add non-free-firmware
, like
deb https://deb.debian.org/debian/ bookworm main contrib non-free non-free-firmware
Then, update apt source.
sudo apt update
apt search ^nvidia-driver
If you nothing goes wrong, should be able to find the nvidia driver package like
nvidia-driver/unknown 545.23.06-1 amd64
NVIDIA metapackage
Issue
In normal physical machine, just run sudo apt install nvidia-driver
is enough. But in VM, that doesn’t work.
After installed cuda with
sudo apt-get install software-properties-common
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda-repo-debian12-12-3-local_12.3.0-545.23.06-1_amd64.deb
sudo dpkg -i cuda-repo-debian12-12-3-local_12.3.0-545.23.06-1_amd64.deb
sudo cp /var/cuda-repo-debian12-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo add-apt-repository contrib
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3
I run sudo apt install nvidia-driver
and reboot.
If trying to find the GPU with nvidia-smi
, there is nothing found.
So, I need to ensure the GPU is detected.
sudo apt install nvidia-detect
nvidia-detect
I got this confusing error
Detected NVIDIA GPUs:
1b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3090] [10de:2204] (rev a1)
Checking card: NVIDIA Corporation GA102 [GeForce RTX 3090] (rev a1)
Uh oh. Your card is not supported by any driver version up to 545.23.06.
A newer driver may add support for your card.
Newer driver releases may be available in backports, unstable or experimental.
There is no way that the latest driver doesn’t support RTX 3090. I check whether it supports or not in NVIDIA driver download page. Of course, is supports.
Solution
When I was confused and reboot the VM again and again, I found the key of the issue. I usually use ssh to connect the VM, but once I connected with VNC (VMRC), I see this error when booting:
[ 12.699654] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 530.41.03 Thu Mar 16 19:48:20 UTC 2023
[ 12.762447] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 530.41.03 Thu Mar 16 19:23:04 UTC 2023
[ 12.871331] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[ 12.972022] ACPI Warning: \_SB.PCI0.PE50.S1F0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20210730/nsarguments-61)
[ 13.732645] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x26:0x56:1474)
[ 13.732697] BUG: unable to handle page fault for address: 0000000000004628
[ 13.732784] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0
I found out the solution of this issue at nvidia forum.
In a word, I need to install the open kernel version not the default version. The answer in the forum show I can install the driver with .run
file with argument -m=kernel-open
.
edit at Nov 8 2024:
It is also possible to install the open kernel version in apt.
So, I cleaned the process installation.
sudo nvidia-uninstall
sudo apt purge -y '^nvidia-*' '^libnvidia-*'
sudo rm -r /var/lib/dkms/nvidia
sudo apt -y autoremove
sudo update-initramfs -c -k `uname -r`
sudo update-grub2
sudo reboot
And install the driver with open kernel.
sudo ./NVIDIA-Linux-x86_64-525.116.04.run -m=kernel-open
sudo update-initramfs -u
sudo reboot
Unfortunately, it still doesn’t solve the problem, Still nothing can be found in nvidia-smi
. But is does make some effect, there is no that error when booting.
After further searching, I finally find the solution in the forum.
The solution is add one line of config in /etc/modprobe.d/nvidia.conf
(if not exists, create the file).
options nvidia NVreg_OpenRmEnableUnsupportedGpus=1
Reboot, issue solved.