Skip to main content Groq provides ultra-fast AI inference through their custom LPU™ (Language Processing Unit) architecture, purpose-built for inference rather than adapted from training hardware. Groq hosts open-source models from various providers including OpenAI, Meta, DeepSeek, Moonshot AI, and others.
Website: https://groq.com/
Getting an API Key
Sign Up/Sign In: Go to Groq and create an account or sign in.
Navigate to Console: Go to the Groq Console to access your dashboard.
Create a Key: Navigate to the API Keys section and create a new API key. Give your key a descriptive name (e.g., “Cline”).
Copy the Key: Copy the API key immediately. You will not be able to see it again. Store it securely.
Supported Models
Cline supports the following Groq models:
llama-3.3-70b-versatile (Meta) - Balanced performance with 131K context
llama-3.1-8b-instant (Meta) - Fast inference with 131K context
openai/gpt-oss-120b (OpenAI) - Featured flagship model with 131K context
openai/gpt-oss-20b (OpenAI) - Featured compact model with 131K context
moonshotai/kimi-k2-instruct (Moonshot AI) - 1 trillion parameter model with prompt caching
deepseek-r1-distill-llama-70b (DeepSeek/Meta) - Reasoning-optimized model
qwen/qwen3-32b (Alibaba Cloud) - Enhanced for Q&A tasks
meta-llama/llama-4-maverick-17b-128e-instruct (Meta) - Latest Llama 4 variant
meta-llama/llama-4-scout-17b-16e-instruct (Meta) - Latest Llama 4 variant
Configuration in Cline
Open Cline Settings: Click the settings icon (⚙️) in the Cline panel.
Select Provider: Choose “Groq” from the “API Provider” dropdown.
Enter API Key: Paste your Groq API key into the “Groq API Key” field.
Select Model: Choose your desired model from the “Model” dropdown.
Groq’s Speed Revolution
Groq’s LPU architecture delivers several key advantages over traditional GPU-based inference:
LPU Architecture
Unlike GPUs that are adapted from training workloads, Groq’s LPU is purpose-built for inference. This eliminates architectural bottlenecks that create latency in traditional systems.
Unmatched Speed
Sub-millisecond latency that stays consistent across traffic, regions, and workloads
Static scheduling with pre-computed execution graphs eliminates runtime coordination delays
Tensor parallelism optimized for low-latency single responses rather than high-throughput batching
Quality Without Tradeoffs
TruePoint numerics reduce precision only in areas that don’t affect accuracy
100-bit intermediate accumulation ensures lossless computation
Strategic precision control maintains quality while achieving 2-4× speedup over BF16
Memory Architecture
SRAM as primary storage (not cache) with hundreds of megabytes on-chip
Eliminates DRAM/HBM latency that plagues traditional accelerators
Enables true tensor parallelism by splitting layers across multiple chips
Learn more about Groq’s technology in their LPU architecture blog post .
Special Features
Prompt Caching
The Kimi K2 model supports prompt caching, which can significantly reduce costs and latency for repeated prompts.
Vision Support
Select models support image inputs and vision capabilities. Check the model details in the Groq Console for specific capabilities.
Reasoning Models
Some models like DeepSeek variants offer enhanced reasoning capabilities with step-by-step thought processes.
Tips and Notes
Model Selection: Choose models based on your specific use case and performance requirements.
Speed Advantage: Groq excels at single-request latency rather than high-throughput batch processing.
OSS Model Provider: Groq hosts open-source models from multiple providers (OpenAI, Meta, DeepSeek, etc.) on their fast infrastructure.
Context Windows: Most models offer large context windows (up to 131K tokens) for including substantial code and context.
Pricing: Groq offers competitive pricing with their speed advantages. Check the Groq Pricing page for current rates.
Rate Limits: Groq has generous rate limits, but check their documentation for current limits based on your usage tier.