💻⚡ Taalas bakes AI models into silicon: 17,000 tokens per second
24 February 2026. Inside this issue:
Canadian startup hardwires Llama 3.1 8B directly into a chip, no external memory
HC1 delivers 17,000 tokens/second - nearly 10x faster than state of the art
Production cost 20x lower, power consumption 10x lower than GPU inference
✍️ Essentials
Taalas, founded by former Tenstorrent co-founders in Toronto, dropped out of stealth on 19 February 2026. Its approach: encode model weights directly into transistors. No HBM (high-bandwidth memory), no liquid cooling, no advanced packaging.
First product HC1 - a hardwired Llama 3.1 8B on TSMC 6nm, 53 billion transistors. It generates 17,000 tokens/second per user. NVIDIA’s H200 does roughly 230. The trade-off: one chip runs one model. But LoRA fine-tuning and configurable context windows are supported. New model to silicon in two months.
The company has 25 employees and over $200 million in total funding. A mid-sized reasoning model is expected this spring. Frontier model on HC2 silicon planned for winter.
🐻 Bear’s take
If inference costs dominate your AI budget, a chip 10x faster and 20x cheaper matters - even locked to one model. For any business running a stable model at scale, this could radically cut costs. The two-month turnaround is the real unlock.
🚨 Bear in mind
GPU cloud providers face margin pressure. HC1 runs a mid-2024 model - this is a demonstrator, not a replacement yet. If you operate high-volume inference, request early access and benchmark.


