On-premise vs. cloud for 70B fine-tuning in India
Indian ML engineers increasingly face a pragmatic decision: rent H100 instances on AWS or Azure at ₹500–1,500 per GPU-hour, or build an on-premise workstation. For a team fine-tuning models regularly (weekly or monthly), on-premise breaks even in 3–6 months versus cloud. For researchers running occasional experiments, cloud is more cost-effective. This guide covers the on-premise path for teams who have crossed that break-even threshold.
GPU: dual RTX 5090 — VRAM capacity planning
Why VRAM is the primary constraint
LLM fine-tuning is entirely VRAM-bound. The model weights, optimizer state (for AdamW — the standard optimizer — this is 2× the model size in mixed precision), and activation gradients must all fit in VRAM simultaneously. A 70B parameter model in BF16 (16-bit floating point, the standard training precision) requires approximately 140 GB VRAM for fine-tuning. Two RTX 5090s provide 64 GB total VRAM — insufficient for full-precision 70B fine-tuning. However, QLoRA (Quantized Low-Rank Adaptation) loads the base model in 4-bit precision (~35 GB for 70B) while training only small LoRA adapter weights in 16-bit precision, bringing the total VRAM requirement to approximately 50–60 GB for 70B QLoRA fine-tuning. Two RTX 5090s handle this with 4–14 GB headroom depending on batch size.
RTX 5090 in India — pricing and sourcing
The NVIDIA RTX 5090 (Blackwell GB202, 32 GB GDDR7, 575W TDP) is priced approximately ₹2,00,000–2,30,000 per card in India through authorised ASUS, MSI, and Gigabyte AIB partners. Two cards total ₹4,00,000–4,60,000. The PSU must supply at least 200W headroom above dual-GPU TDP — a 1600W PSU (Seasonic TX-1600 or Corsair AX1600i at ₹30,000–45,000) is the correct choice for a dual-RTX-5090 system.
NVLink alternatives for consumer GPU multi-GPU training
Since RTX 5090 lacks NVLink, multi-GPU training uses PCIe 5.0 ×16 per GPU (via the TRX50 platform's ample PCIe lanes). PCIe 5.0 ×16 provides approximately 128 GB/s bidirectional bandwidth. For data-parallel training (each GPU processes different mini-batches and gradients are averaged across GPUs via all-reduce) this is sufficient — gradient synchronization during a typical training step is a fraction of total training time. For model-parallel training (the model is split across GPUs, with each layer on a different GPU) — which is required for models larger than a single GPU's VRAM — PCIe bandwidth becomes a bottleneck. The workaround for on-premise model-parallel 70B training is QLoRA combined with gradient checkpointing (recomputing intermediate activations during backpropagation instead of storing them, trading compute time for VRAM).
CPU: Threadripper 7980X for PCIe lane count
The AMD Threadripper 7980X on the TRX50 platform provides 88 usable PCIe 5.0 lanes. Two RTX 5090s each running at PCIe 5.0 ×16 occupy 32 lanes, leaving 56 lanes for NVMe arrays (training scratch), 10GbE networking, and future expansion. Consumer platforms (AM5 Intel Core Ultra 9) provide 24–44 PCIe lanes total — two GPUs would share bandwidth, with one or both dropping to ×8, reducing effective training throughput by 15–25% in bandwidth-sensitive workloads. The Threadripper 7980X costs approximately ₹2,20,000–2,60,000 in India.
RAM: 256 GB DDR5 for data preprocessing pipelines
Training pipeline preprocessing (tokenisation — converting raw text to integer token IDs, data augmentation, batch assembly) runs on the CPU while the GPU trains. For datasets above 50 GB (a typical instruction fine-tuning dataset for a 70B model runs 20–100 GB), 256 GB DDR5 enables loading the full dataset into memory once rather than re-reading from disk each epoch. This reduces epoch time significantly on datasets with slow NVMe I/O saturation. 256 GB (4×64 GB DDR5-4800) costs approximately ₹1,30,000–1,60,000 on the TRX50 platform through authorised distributors.
Thermal management for 24/7 ML training
Dual RTX 5090s at full training load produce approximately 1,150W of heat inside the case. At Indian summer ambient temperatures (38–42°C), this creates severe thermal stress. A 480mm or 420mm AIO for the CPU (to free case airflow for the GPUs), positive case pressure (front intake fans exceeding rear exhaust), and a full-tower case with mesh front panel are mandatory. GPU temperatures must stay below 83°C for sustained compute throughput — above this, NVIDIA GPUs throttle clock speeds to protect themselves, extending training time. If the training room does not have air conditioning, a dedicated portable AC unit pointed at the workstation intake is a practical solution during summer training runs. Our workstation repair service handles thermal diagnosis for ML systems with unexpectedly long training times — in many cases, thermal paste replacement and dust clearing restore training throughput without hardware replacement.