💻⚡ Taalas bakes AI models into silicon: 17,000 tokens per second

Feb 24, 2026

24 February 2026. Inside this issue:

Canadian startup hardwires Llama 3.1 8B directly into a chip, no external memory
HC1 delivers 17,000 tokens/second - nearly 10x faster than state of the art
Production cost 20x lower, power consumption 10x lower than GPU inference

✍️ Essentials

Taalas, founded by former Tenstorrent co-founders in Toronto, dropped out of stealth on 19 February 2026. Its approach: encode model weights directly into transistors. No HBM (high-bandwidth memory), no liquid cooling, no advanced packaging.

First product HC1 - a hardwired Llama 3.1 8B on TSMC 6nm, 53 billion transistors. It generates 17,000 tokens/second per user. NVIDIA’s H200 does roughly 230. The trade-off: one chip runs one model. But LoRA fine-tuning and configurable context windows are supported. New model to silicon in two months.

The company has 25 employees and over $200 million in total funding. A mid-sized reasoning model is expected this spring. Frontier model on HC2 silicon planned for winter.

🐻 Bear’s take

If inference costs dominate your AI budget, a chip 10x faster and 20x cheaper matters - even locked to one model. For any business running a stable model at scale, this could radically cut costs. The two-month turnaround is the real unlock.

🚨 Bear in mind

GPU cloud providers face margin pressure. HC1 runs a mid-2024 model - this is a demonstrator, not a replacement yet. If you operate high-volume inference, request early access and benchmark.

AI-Bear

Discussion about this post

Ready for more?