Taiwanese firm announces 384GB PCIe AI accelerator running 700B LLMs locally at 240W.
Fitting 384GB of memory into a single 240W PCIe envelope drastically alters the hardware requirements for local LLM inference. By eliminating the need for multi-GPU clusters to handle 700B parameter models, this architecture significantly lowers both the thermal and financial barriers for enterprise AI deployments. If the memory bandwidth can keep up with the capacity, this could commoditize on-premise inference.
A Taiwanese hardware manufacturer has unveiled a new PCIe AI accelerator card designed to run massive Large Language Models (LLMs) locally. The standout specifications include 384GB of onboard memory and a remarkably low power draw of just 240W—less than half the power consumption of Nvidia's RTX PRO 6000 Blackwell workstation GPU.
Technical Details To run a 700B parameter model locally, memory capacity is the primary bottleneck. At 4-bit quantization, a 700B model demands roughly 350-400GB of memory. Traditionally, achieving this memory pool requires networking 4 to 8 high-end GPUs, which introduces immense power requirements (kilowatts), complex interconnect overhead, and massive thermal output. Packing 384GB of memory onto a single 240W PCIe card suggests the use of high-density LPDDR or an innovative unified memory architecture rather than traditional power-hungry HBM or GDDR6, trading raw bandwidth for massive capacity and power efficiency.
Why It Matters From an engineering standpoint, this is a disruptive shift for enterprise and edge AI. The ability to deploy frontier-class LLMs on a standard workstation chassis without specialized cooling or dedicated server room power delivery lowers the barrier to entry for on-premise AI. It shifts the paradigm from "compute-constrained" to "capacity-enabled," allowing organizations with strict data privacy requirements to run massive models entirely offline at a fraction of the hardware cost.
What to Watch Next The critical missing metric here is memory bandwidth and the resulting tokens-per-second (TPS) generation speed. While the 384GB capacity allows the model to fit in memory, a 240W power envelope heavily restricts the memory bus speed. Engineers should watch for independent benchmarks detailing the actual inference latency and throughput. If the TPS is viable for production workloads, this class of accelerator could seriously threaten the lower-end of the multi-GPU workstation market.