Business | June 10, 2011

High-level synthesis walks the talk

Staff Editor

The shift to a higher abstraction is becoming mandatory to address today’s ASIC and SoC design challenges. Just as design teams transitioned from gates to RTL in the mid-90s, new thresholds in design complexity are calling for the move from RTL to C++ and SystemC-based modeling, verification, and synthesis.

Consequently, during the past couple of years, high-level synthesis (HLS) has become much more prevalent in design flows, widened its applicability, and entered the mainstream of hardware design. However, designers need the know-how to put it into practice in the best possible way. In this paper, we will show how this is done by describing how a complete graphics processing pipeline was implemented using an HLS methodology. We will demonstrate how today’s mature HLS technologies handle the complex mix of control logic, datapaths, interfaces, and hierarchy. We will share the best coding style and suitable abstractions for each of these parts of the design, compare the modeling requirements for the various portions of the system, and provide guidelines for choosing abstraction levels. First, we’ll review the primary objectives of designing at higher levels of abstraction. Guiding principles The various abstraction levels serve different design needs; for this reason they complement each other to great advantage in a “full-chip” HLS flow. But how does one choose the proper modeling style and most efficient abstraction-level for specific design tasks? The answer to these questions is found in the reason HLS flows are being adopted in the first place. The goal of HLS is to increase design and verification productivity. This primary objective must be kept in mind when making modeling decisions at higher levels of abstraction. To help with design productivity, models must be kept as abstract as possible. This makes them simpler to write (less lines of code, fewer chances of errors), easier to debug (less details to worry about), and faster to simulate (less simulation overhead). To help with verification productivity, enough detail must be kept where it matters so design behavior can be predictable and consistent throughout the flow. As a result, the RTL will be guaranteed to match the high-level specification, greatly reducing the burden on the RTL verification team. The principles of simplicity and sufficient detail are dependent upon two essential parameters that can be abstracted when moving up to a higher level: timing and structure. When determining the levels of timing and structural information to be coded in the source, one should keep these two productivity principles in mind and answer these two basic questions: - Is the functionality time-dependent or not,and if so, to what extent? - Do I want to lock down hierarchy and parallelism, or do I want to be able to explore different solutions? In the following sections, we will show how to answer these questions for the various parts of a complete imaging pipeline and how to most efficiently write the code. An image signal processor With the emergence of smart phones and broadband wireless networks, cameras have quickly evolved from niche features to mandatory functionality for handheld devices. Tightly coupled to the CMOS image sensor, the image signal processor (ISP) defines the image quality of the handheld camera subsystem. In this very dynamic market, differentiation is achieved through proprietary algorithms for defect correction and image improvement. Our reference design implements canonical ISP functions—such as pixel defect correction, white balancing, color filter array (CFA) interpolation, resizing—and various lens artifact correction functions—such as pincushion and barrel distortion. Our design also provides a standard AMBA AHB interface to transfer the image from the ISP to the rest of the system (Fig.1).

In the rest of this article, we will focus on two particular blocks: the image resizer and the AHB bus. These two blocks exhibit the different properties and requirements of algorithmic units and control-logic blocks. As such, they are representative and pedagogical examples. The image resizer The resizer block takes an input image and resizes it to a new height and width. The algorithm performs a 4x4 bicubic interpolation; it estimates the color of a pixel in the resized image based on 16 pixels surrounding the closest corresponding pixel in the source image. Line buffers are used to cache the incoming image data and provide the appropriate 16 pixels in parallel to the bicubic kernel. (Fig.2). This allows the resizer to sustain a throughput of 1 pixel per clock on the output. The inputs and outputs of this block are in the form of point-to- point (P2P) pixel streams.

Structure of the High-Level Model In RTL, a similar block would be decomposed into several sub-blocks and many processes, corresponding to the line buffers and scaling function. The same structural decomposition using dedicated modules and processes is possible in a language like SystemC. With the SC_MODULE macro, SystemC provides a way to explicitly model hierarchy in a way that is easy to understand, looks like Verilog, and is familiar to hardware designers. However, the decomposition of the design into sub-modules becomes counterproductive when taken too far because, by hard- coding structure and parallelism in the source, the potential to explore different implementation alternatives is severely restricted. Moreover, adding superfluous processes and threads in a SystemC model significantly slows down simulation performance due to increased context switching. All of this greatly diminishes some of the major benefits of an HLS flow. Instead of hard-coding structure and parallelism, modern HLS tools allow users to create arbitrary hierarchical design boundaries from an abstract model. The HLS tool leverages user constraints to partition loops or functions into separate concurrent blocks. Because they are not hard-coded in the source, the boundaries for hierarchy are much more flexible than if expressed with SystemC modules. This flexibility is an advantage when optimizing for performance and area in order to improve QoR. Following our guiding principles, we kept our model as abstract as possible, avoiding any unnecessary details. The entire resizer is modeled as a single SC_MODULE with only one SC_THREAD implementing both the line buffering and the bicubic interpolation.

The full whitepaper can be found on Mentor Graphics website. ----- Author: Thomas Bollaert, Mentor Graphics Note: All images © Mentor Graphics