background top linesbackground top lines
background right lines

Cloud

Cloud

Managed Transformer Inference

  • High performance low latency model inference.

Now Available
background big renderbackground big render

Positron Performance and Efficiency Advantages in Software V1.x

August 2024
September 2024
Software Release
Models Benchmarked
Relative
Performance
Performance
per Watt
Advantage
Performance
per $
Advantage
Confidence
V1.0
Mixtral 8x7B
0.65*
2.1
1.5
Measured
V1.1
Mixtral 8x7B
Llama 3.1 70B
1.1*
3.9
2.6
In-dev, measured.
* Nvidia performance is based on vLLM 0.5.4 for both Mixtral 8x7B, Llama 3.1 8B, and Llama 3.1 70B.

Software & Systems Overview

  • Chat interface example

  • Github

    OpenAI compatible LLM API

  • Load balancer and scheduler

  • Transformer engine

  • Configurable accelerator
    with field updates

Switch
System SWServer
Atlas
Atlas
Atlas
Atlas
Atlas
Atlas
Atlas
Atlas

Positron Atlas Hardware

Network
Scale-Up IOTransformer engine
Sys MemHost
CPU
AI Math
Accelerator
Mem

Every Transformer Runs on Positron

Supports all Transformer models seamlessly with zero time and zero effort

Model Deployment on Positron in 4 Easy Steps

Positron maps any trained HuggingFace Transformers Library model directly onto hardware for maximum performance and ease of use

  • Develop or procure a model using the HuggingFace Transformers Library.

  • Upload or link trained model file (.pt or .safetensors) to Positron Model Manager.

  • Update client applications to use Positron’s OpenAI API-compliant endpoint.

  • Issue API requests and receive the best performance.

GCS

GCS

Amazon S3

Amazon S3

Files

.pt

.safetensors

Drag & Drop to uploadorBROWSE FILES

“mistralai/Mixtral-8x7B-Instruct-v0.1”

Hugging Face
Positron

Rest API { }

Model Manager

Model Loader

HF Model Fetcher

from openai import OpenAI
client = OpenAI(uri="api.positron.ai")

client.chat.completions
.create(
model="mixtral8x7b"
)

OpenAI-compatible

Python client

01render 1
02render 2
03render 3