background top linesbackground top lines
background right lines

Atlas

Atlas

Transformer Inference Server

  • 10x Better Performance/$ than DGX-H100

Available May ‘24
background big renderbackground big render

Software & Systems Overview

  • Chat interface example

  • Github

    OpenAI compatible LLM API

  • Load balancer and scheduler

  • Transformer engine

  • Configurable accelerator
    with field updates

Switch
System SWServer
Atlas
Atlas
Atlas
Atlas
Atlas
Atlas
Atlas
Atlas

Positron Atlas Hardware

Network
Scale-Up IOTransformer engine
Sys MemHost
CPU
AI Math
Accelerator
Mem

Positron Performance and Efficiency Advantages

Model
Inference Server
Performance(batch = 8 tokens/sec/user)
Price
Power
Performance per Watt Advantage
Performance per Dollar Advantage
LLama-2
(70B)
Positron Atlas
NVIDIA DGX-H100
151.9
46.8

$175K

$309K
1,800W
3,800W

6.9x

6.5x

Mixtral
(8x7B)
Positron Atlas
NVIDIA DGX-H100
319.4
73.4

$175K

$309K
1,800W
3,800W

10.3x

8.7x

Estimate Performance and Cost

I want to host a model at batch size across
tensor-parallel threads, with sequence length
.

Estimate:

Model
Prefill
Input tokens per second
Output Tokens
Per Second per User
Aggregate Output Tokens per Second
Price
Mixtral (8x7B)
15031.3082
126.1405
8072.9942
$175,000.00
01render 1
02render 2
03render 3