Nvidia today announced the release of TensorRT 8, the latest version of its software development kit (SDK) designed for AI and machine learning inference. Built for deploying AI models that can power search engines, ad recommendations, chatbots, and more, Nvidia claims that TensorRT 8 cuts inference time in half for language queries compared with the previous release of TensorRT.
Models are growing increasingly complex, and demand is on the rise for real-time deep learning applications. According to a recent O’Reilly survey, 86.7% of organizations are now considering, evaluating, or putting into production AI products. And Deloitte reports that 53% of enterprises adopting AI spent more than $20 million in 2019 and 2020 on technology and talent.
TensorRT essentially dials a model’s mathematical coordinates to a balance of the smallest model size with the highest accuracy for the system it’ll run on. Nvidia claims that TensorRT-based apps perform up to 40 times faster than CPU-only platforms during inference, and that TensorRT 8-specific optimizations allow BERT-Large — one of the most popular Transformer-based models — to run in 1.2 milliseconds.