The explosion of machine learning and its many applications has motivated a variety of new
domain-specific architectures to accelerate these deep learning workloads. The Groq Tensor Streaming Processor (TSP) is based on a deterministic instruction set architecture (ISA) with a single large core. The ISA exposes temporal information indicating the number of cycles the instruction requires to produce the output stream — its functional latency. Determinism allows the compiler to reason about program correctness and track the exact spatial and temporal position of every tensor on the chip. Events in a deterministic system cannot be permuted by the underlying hardware — that is, the total program order is the interleaving of individual instruction queues of each functional unit. This total ordering is entirely software controlled and the underlying hardware cannot reorder these events and they must complete in a fixed amount of time. This has implications for hardware design: it removes use of hardware interlocks to coordinate between functional units, or use any “reactive components” in the data path such as arbiters, caching agents, replay or retransmission mechanisms, etc. It also has several consequences for system design: zero variance latency, low latency and high throughput at batch size 1, reduced total cost of ownership (TCO) for data centers with diverse service level agreements (SLAs) and the ability to scale to large training and inference systems without traversing networking switches. In this talk we discuss the TSP and the design implications of its architecture.