Accelerate AI applications using VITIS AI on Xilinx ZynqMP UltraScale+ FPGA

VITIS is a unified software platform for developing SW (BSP, OS, Drivers, Frameworks, and Applications) and HW (RTL, HLS, Ips, etc.) using Vivado and other components for Xilinx FPGA SoC platforms like ZynqMP UltraScale+ and Alveo cards. The key component of VITIS SDK, the VITIS AI runtime (VART), provides a unified interface for the deployment of end ML/AI applications on Edge and Cloud.

Vitis™ AI components:

  • Optimized IP cores
  • Tools
  • Libraries
  • Models
  • Example Reference Designs

Inference in machine learning is computation-intensive and requires high memory bandwidth and high performance compute to meet the low-latency and high-throughput requirements of various end applications.

Vitis AI Workflow

Xilinx Vitis AI provides an innovative workflow to deploy deep learning inference applications on Xilinx Deep Learning Processing Unit (DPU) using a simple process:

Source: Xilinx

  • The Deep Processing Unit (DPU) is a configurable computation engine optimized for convolution neural networks for deep learning inference applications and placed in programmable logic (PL). DPU contains efficient and scalable IP cores that can be customized to meet many different applications’ needs. The DPU defines its own instruction set, and the Vitis AI compiler generates instructions.
  • VITIS AI compiler schedules the instructions in an optimized manner to get the maximum performance possible.
  • Typical workflow to run any AI Application on Xilinx ZynqMP UltraScale+ SoC platform comprises:
  1. Model Quantization
  2. Model Compilation
  3. Model Optimization (Optional)
  4. Build DPU executable
  5. Build software application
  6. Integrate VITIS AI Unified APIs
  7. Compile and link the hybrid DPU application
  8. Deploy the hybrid DPU executable on FPGA
AI Quantizer

AI Quantizer is a compression tool for the quantization process by converting 32-bit floating-point weights and activations to fixed point INT8. It can reduce the computing complexity without losing accurate information for the model. The fixed point model needs less memory, thus providing faster execution and higher power efficiency than floating-point implementation.

AI Quantizer

AI Compiler

AI compiler maps a network model to a highly efficient instruction set and data flow. Input to the compiler is Quantized 8-bit neural network, and output is DPU kernel, the executable which will run on the DPU. Here, the unsupported layers need to be deployed in the CPU OR model can be customized to replace and remove those unsupported operations. It also performs sophisticated optimizations such as layer fusion, instruction scheduling and reuses on-chip memory as much as possible.

Once we get Executable for the DPU, one needs to use Vitis AI unified APIs to Initialize the data structure, initialize the DPU, implement the layers not supported by the DPU on CPU & Add the pre-processing and post-processing on a need basis on PL/PS.

AI Compiler

AI Optimiser

With its world-leading model compression technology, AI Optimizer can reduce model complexity by 5x to 50x with minimal impact on accuracy. This deep compression takes inference performance to the next level.

We can achieve desired sparsity and reduce runtime by 2.5x.

AI Optimizer

AI Profiler

AI Profiler can help profiling inference to find caveats causing a bottleneck in the end-to-end pipeline.

Profiler gives a designer a common timeline for DPU/CPU/Memory. This process doesn’t change any code and can also trace the functions and do profiling.

AI Profiler

AI Runtime

VITIS AI runtime (VART) enables applications to use unified high-level runtime APIs for both edge and cloud deployments, making it seamless and efficient. Some of the key features are:

  • Asynchronous job submission
  • Asynchronous job collection
  • C++ and Python implementations
  • Multi-threading and multi-process execution

Vitis AI also offers DSight, DExplorer, DDump, & DLet, etc., for various task execution.

DSight & DExplorer
DPU IP offers a number of configurations to specific cores to choose as per the network model. DSight tells us the percentage utilization of each DPU core. It also gives the efficiency of the scheduler so that we could tune user threads. One can also see performance numbers like MOPS, Runtime, memory bandwidth for each layer & each DPU node.

Softnautics have a wide range of expertise on various edge and cloud platforms, including vision and image processing on VLIW SIMD vector processor, FPGA, Linux kernel driver development, platform and power management multimedia development. We provide end-to-end ML/AI solutions from dataset preparation to application deployment on edge and cloud and including maintenance.

We chose the Xilinx ZynqMP UltraScale+ platform for high-performance to compute deployments. It provides the best application processing, highly configurable FPGA acceleration capabilities, and  VITIS SDK to accelerate high-performance ML/AI inferencing. One such application we targeted was face-mask detection for Covid-19 screening. The intention was to deploy multi-stream inferencing for Covid-19 screening of people wearing masks and identify non-compliance in real time, as mandated by various governments for Covid-19 precautions guidelines.

We prepared a dataset and selected pre-trained weights to design a model for mask detection and screening. We trained and pruned our custom models via the TensorFlow framework. It was a two-stage deployment of face detection followed by mask detection. The trained model thus obtained was passed through VITIS AI workflow covered in earlier sections. We observed 10x speed in inference time as compared to CPU. Xilinx provides different debugging tools and utilities that are very helpful during initial development and deployments. During our initial deployment stage, we were not getting detections for mask and Non-mask categories. We tried to match PC-based inference output with the output from one of the debug utilities called Dexplorer with debug mode & root-caused the issue to debug this further. Upon running the quantizer, we could tune the output with greater calibration images and iterations and get detections with approximation. 96% accuracy on the video feed. We also tried to identify the bottleneck in the pipeline using AI profiler and then taking corrective actions to remove the bottleneck by various means, like using HLS acceleration to compute bottleneck in post-processing.

Face Detection via AI

Read our success stories related to Machine Learning expertise to know more about our services for accelerated AI solutions.

Contact us at for any queries related to your solution or for consultancy.


Author: Vaibhav Kothari

Vaibhav is Associate Principal Engineer at Softnautics and has been key member in various Embedded Software projects across different domains for over nine years and last 3-4 years for machine learning and deep learning field. He has worked on audio/video/wireless domain solutions for Audience(Now Knowles), Xilinx, Lattice Semiconductor, Microsemi, etc, chipsets. He is passionate for enabling practical & real-world AI solutions on Embedded Edge platforms and various FPGA devices. He enjoys watching Netflix as well as playing keyboard and guitar in his free time. You can also find him on Twitter

Scroll to Top