Build, break, learn

Why 4 GPUs Can Be Slower Than 1 on Budget Clouds

I rented a 4× RTX 4090 server to learn distributed training hands-on. The cost seemed reasonable—€0.58/hour compared to single GPU at €0.15/hour. I expected better performance with more GPUs. The results surprised me.

2 GPUs trained 3.4× slower than 1 GPU.

This is a story about discovering an infrastructure limitation through hands-on experience—something I couldn’t have learned from reading documentation or running on properly configured servers.

The Discovery

I was training a 1.7B parameter language model using PyTorch DDP. Single GPU: smooth 15.4 samples/second. Two GPUs: 4.5 samples/second. Four GPUs: 14.0 samples/second—barely matching single GPU performance.

Something was fundamentally broken.

The Investigation

I profiled the training loop. Forward and backward passes were fine—same speed on 1 or 2 GPUs. But the optimizer step, which includes gradient synchronization, exploded from 19ms to 400ms on 2 GPUs. A 21× slowdown.

I checked the GPU topology:

$ nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3
GPU0     X      PHB     SYS     SYS
GPU1    PHB      X      SYS     SYS
GPU2    SYS     SYS      X      PHB
GPU3    SYS     SYS     PHB      X

The GPUs were connected via PCIe Host Bridge (PHB)—not perfect, but should work. I tested different GPU pairs. Same result.

Then I found it in the NCCL logs:

NCCL INFO P2P is disabled between connected GPUs

Repeated 12 times—once for each GPU pair.

The Root Cause

P2P (Peer-to-Peer) GPU communication was disabled. Despite the GPUs being physically connected via PCIe, the system prevented direct GPU-to-GPU data transfer.

Instead of:

GPU 0 ──[PCIe Direct P2P]──> GPU 1
~50-80ms latency

The data routed through system memory:

GPU 0 ──[PCIe]──> RAM ──[PCIe]──> GPU 1
~400ms observed

I measured the bandwidth between GPU pairs:

# P2P bandwidth test results
GPU 0 -> GPU 1: 32 Gbps
GPU 0 -> GPU 2: 105 Gbps  
GPU 1 -> GPU 3: 87 Gbps

PCIe 4.0 x16 has ~200 Gbps practical bandwidth. My measurements showed 0.15-0.5× of this capability—significantly degraded by routing through system memory.

Verification on Second Provider

I validated this on RunPod to confirm it wasn’t Vast.ai-specific:

import torch
# Check P2P capability
for i in range(4):
    for j in range(4):
        if i != j:
            can_access = torch.cuda.can_device_access_peer(i, j)
            print(f"GPU {i} -> GPU {j}: {'✓' if can_access else '✗'}")

Result: All pairs returned False. P2P disabled across the board.

I didn’t run full training benchmarks on RunPod (cost), but the P2P check and bandwidth tests confirmed the same infrastructure limitation.

Why Is P2P Disabled?

Budget GPU cloud providers (Vast.ai, RunPod, etc.) run multi-tenant infrastructure. Customer A rents GPU 0+1, Customer B rents GPU 2+3, on the same physical machine.

P2P would allow direct GPU memory access across customers—a security requirement in multi-tenant environments means disabling it globally.

This is a reasonable security trade-off for shared infrastructure. It just has significant performance implications for distributed training.

The Performance Impact

Here’s what disabled P2P actually costs:

GPUsSamples/secSpeedupEfficiencyCost/hour
115.41.00×100%€0.15
24.50.29×15%€0.30
414.00.91×23%€0.58

Speedup: Actual throughput relative to single GPU
Efficiency: Speedup ÷ number of GPUs (ideal = 100%)

Working multi-GPU setups typically achieve 80-90% efficiency. With disabled P2P, you get 15-25% efficiency.

FSDP Impact

FSDP (Fully Sharded Data Parallel) suffers even more because it requires frequent all-gather operations:

StrategySamples/secvs SHARD_GRAD_OP
SHARD_GRAD_OP4.131.00× (baseline)
NO_SHARD3.880.94×
FULL_SHARD2.880.70×

FULL_SHARD requires gathering weights before forward pass. Without P2P, this dominates training time, making the most memory-efficient strategy the slowest.

The Implications

For Individual Learners

I wanted to learn distributed training through hands-on experimentation. Budget clouds are the only affordable option—premium providers charge 4-8× more.

Multi-GPU on budget clouds has significant limitations for distributed training workloads due to disabled P2P. The hardware exists, but the infrastructure constraints prevent efficient use.

For the Industry

ML infrastructure in practice:

The middle tier (2-4× GPU with working P2P) rarely exists in budget clouds. Companies do use 2-4 GPU setups for research, but typically on their own infrastructure or premium clouds where P2P works.

Budget clouds serve single-GPU workloads effectively, but multi-GPU configurations face fundamental constraints from their multi-tenant architecture. But I still use these providers for single-GPU experiments and dev environments - the price-to-performance for a single GPU is unbeatable.

What I Learned

Technical Lessons

  1. Infrastructure constraints matter as much as code - Perfect DDP implementation can’t overcome disabled P2P
  2. Topology alone doesn’t guarantee performance - Need P2P enabled + good topology
  3. FSDP amplifies communication overhead - More frequent all-gathers make P2P even more critical
  4. Always measure before scaling - More hardware can mean worse performance

Strategic Lessons

  1. Always verify infrastructure assumptions - Don’t assume multi-GPU means working distributed training
  2. Budget clouds optimize for specific use cases - Single-GPU workloads work great, multi-GPU has trade-offs
  3. Test infrastructure early - Simple checks reveal constraints before significant investment

Personal Lessons

  1. Initial investigation yields exponential insights - First €20 of experiments taught more than the next €500 would
  2. Understanding constraints is practical knowledge - Knowing what won’t work is as valuable as knowing what will
  3. Hands-on experimentation reveals hidden assumptions - Reading docs can’t replace actual testing

Recommendations

For training on budget clouds:

If your model fits on a single GPU (e.g., within 80GB), use one high-end GPU:

For multi-GPU training:

Premium providers (Lambda Labs, AWS, GCP, Azure, Nebius) typically have working P2P, but cost 4-8× more. I didn’t test these for this investigation.

Conclusion

Four GPUs aren’t slower than one GPU because of physics. They’re slower because of infrastructure trade-offs—security requirements that disable direct GPU-to-GPU communication.

These trade-offs make sense from the provider’s perspective. They just have significant performance implications worth understanding upfront.

The diagnostic check from earlier in the post is the first thing worth running after spinning up a multi-GPU instance. Most providers don’t list P2P status in their specifications, so verifying this early can save time and money.


Full technical details, reproduction instructions, and raw data: GitHub Repository