Software & Systems Overview
Chat interface example
OpenAI compatible LLM API
Load balancer and scheduler
Transformer engine
Configurable accelerator
with field updates
Switch
System SWServer
Atlas
Atlas
Atlas
Atlas
Atlas
Atlas
Atlas
Atlas
Positron Atlas Hardware
Network
Scale-Up IOTransformer engine
Sys MemHost
CPU
CPU
AI Math
Accelerator
Accelerator
Mem
Positron Performance and Efficiency Advantages
Model
Inference Server
Performance(batch = 8 tokens/sec/user)
Price
Power
Performance per Watt Advantage
Performance per Dollar Advantage
LLama-2
(70B)
(70B)
Positron Atlas
NVIDIA DGX-H100
151.9
46.8
$175K
$309K
1,800W
3,800W
6.9x
6.5x
Mixtral
(8x7B)
(8x7B)
Positron Atlas
NVIDIA DGX-H100
319.4
73.4
$175K
$309K
1,800W
3,800W
10.3x
8.7x
Estimate Performance and Cost
I want to host a model at batch size across
tensor-parallel threads, with sequence length
.
Estimate:
Model
Prefill
Input tokens per second
Input tokens per second
Output Tokens
Per Second per User
Per Second per User
Aggregate Output Tokens per Second
Price
Mixtral (8x7B)
15031.3082
126.1405
8072.9942
$175,000.00
01
02
03