ACM TECS CAPA’08 Special Issue
Configuring Algorithms, Processes and Architecture
Guest Editor: London, UK
Associate Guest Editors:
University of Bretagne Occidentale, France
Queensland University, Australia
Issue I: Configuring Algorithms and Processes
Part I: Configuring Algorithms
Reconfiguration of IIR Filters in Response to Computer Resource Availability
Georgia Tech., USA
This paper explores methods to reconfigure Infinite Impulse Response (IIR) filters in processes that utilize computer resource management. A high performance mode uses a full-length IIR filter, while low resources or a desire to conserve power triggers a low performance mode where a smaller length filter is used. Two alternatives for algorithms are explored: switching between filter banks of varying lengths, and anytime filters where the later stages can be skipped if computational resources are scarce. Two main challenges are addressed in this paper. The first is how to configure an IIR filter as an anytime algorithm, and the second is how to initialize the states of the filters in order to mitigate transients induced by reconfiguration. The performance and feasibility of several common transient management schemes are investigated for use in this application. It is found that an anytime IIR filter outperforms the filter banks in almost every case studied due to its inherently smaller transients and the fact that the reconfiguration happens at the speed of resource management and not at a lower speed typically associated with reconfiguration due to physical system properties.
A Truly Two Dimensional Systolic Array FPGA Implementation of QR Decomposition
*Airvana, **Northeastern University, USA
We have implemented a two dimensional systolic array QR decomposition on a Xilinx Virtex-5 FPGA using the Givens rotation algorithm. QR decomposition is a key step in many DSP applications including sonar beamforming, channel equalization, and 3G wireless communication. Compared to previous work that implements Givens rotations using a one dimensional systolic array, our implementation uses a truly two dimensional systolic array architecture. As a result, latency scales well for larger matrices. In addition, prior work avoids divide and square root operations in the Givens rotation algorithm by using special operations such as CORDIC or special number systems such as the logarithmic number system (LNS). In contrast, our design uses straightforward floating-point divide and square root implementations which makes it easier to be used within a larger system. In our design, the input matrix size can be configured at compile-time to many different sizes, making it easily scalable to future large FPGAs or over multiple FPGAs. The QR module is fully pipelined with a throughput of over 130MHz for the IEEE single-precision floating-point format. The peak performance for a 12 × 12 input matrix is approximately 35 GFLOPs
Application Development with the FlexWAFE Real-time Stream Processing Architecture for FPGAs
Technical University of Braunschweig, Germany
The challenges posed by complex real-time digital image processing at high resolutions cannot be met by current state-of-the-art general purpose or DSP processors, due to the lack of processing power. On the other hand, large arrays of FPGA-based accelerators are too inefficient to cover the needs of cost sensitive professional markets. We present a new architecture composed of a network of configurable flexible weakly-programmable processing elements, Flexible Weakly-programmable Advanced Film Engine (FlexWAFE). This architecture delivers both programmability and high efficiency when implemented on an FPGA basis. We demonstrate these claims using a professional next generation noise reducer with more than 170G image operations/s at 80% FPGA area utilization on four Virtex II-Pro FPGAs. This paper will focus on the FlexWAFE architecture principle and implementation on a PCI-Express board.
Part II: Configuring Processes
An Approximation Algorithm for Scheduling on Heterogeneous Reconfigurable Resources
*UCLA, **EPFL, ***UC Davis, USA
Dynamic reconfiguration imposes significant penalties in terms of performance and energy. Scheduling the execution of tasks on a dynamically reconfigurable device is therefore of critical importance. Likewise, other application domains have cost models that are effectively the same as dynamic reconfiguration; examples include: data transmission across multiprocessor systems; dynamic code updating and reprogramming of motes in sensor networks; and module allocation, wherein the sharing of resources effectively eliminates inherent reconfiguration costs. This paper contributes a fully polynomial time approximation algorithm for the problem of scheduling independent tasks onto a fixed number of heterogeneous reconfigurable resources, where each task has a different hardware and software latency on each device; the reconfiguration latencies can also vary between resources. The notion of solution dominance is used to consider all, or approximately all, assignments of tasks to resources. This approach can also extend to task graph scheduling in which one task may be dependent on several others. A general-purpose processor and an FPGA were used to experimentally validate the proposed technique using a pair of encryption algorithms. The latencies of the schedules obtained by the approximation scheme were at most 1.1x longer than the optimal solution, which was found using integer linear programming; this result is better than the theoretical worst-case guarantee of the approximation algorithm, which wa 1.999x. The length of the schedules obtained using list scheduling, a well-known polynomialtime heuristic, were at most 2.6x longer than optimal, demonstrating that the proposed approximation algorithm is useful both in theory and practice.
Slot-less Module-based Reconfiguration of Embedded FPGAs
Virginia Tech., USA
The dificult aspect of hardware reconfiguration is not creating the computational blocks, which are generally available from FPGA vendors and third parties, but linking the blocks in a manner that suits each application's unique connectivity, bandwidth and latency requirements. Our approach uses the standard Xilinx implementation tools to generate dynamic module partial bitstreams, but choosing the module's coordinates and completing connections to other modules are run-time operations. Scripts automatically add interface wrappers to dynamic modules and generate a library of relocatable partial bitstreams. The library is used by an eficient run-time system that completes application requests for instancing and connecting modules, efectively insulating the designer from FPGA reconfiguration complexities. In this way, a large sandbox may be allocated to dynamic modules rather than fixed module slots and interconnect. Application engineers interact with the Wires on Demand (WoD) system through a run-time software API, and do not have to master hardware description languages and implementation tools.
A Packet-Switched Network Architecture for Reconfigurable Computing
Brigham Young University, USA
A packet-switched network architecture named Qnet and programming interface is presented that simplifies the integration of reconfigurable computing modules within a Field-Programmable Gate Array (FPGA). Qnet provides an abstraction layer to the designer of FPGA accelerator modules that hides the complexities of the system, while supporting a high degree of parallelism and performance. The architecture facilitates system design with pluggable, reusable modules. A network protocol is described that supports a 3-party communication scheme between an initiator, a sender and a receiver. This protocol allows a master device to manage data flow and other devices within the system.
ReconOS: Multithreaded Programming for Reconfigurable Computers
University of Paderborn, Germany
Rising logic densities together with the inclusion of dedicated processor cores push reconfigurable devices from being applied for glue logic and prototyping towards implementing complete reconfigurable systems-on-chip. The mix of fast CPU cores and fine-grained reconfigurable logic allows to map both sequential, control-dominated code and highly parallel data-centric computations onto one platform. However, traditional design techniques which view specialized hardware circuits as passive coprocessors are ill-suited for programming these reconfigurable computers. In particular, the programming models for software — running on an embedded operating system — and digital hardware — synthesized to an FPGA — lack commonalities which hinders design space exploration and severely impairs the potential for code re-use. In this paper, we present ReconOS, an execution environment based on existing embedded operating systems which extends the multithreaded programming model established in the software domain to reconfigurable hardware. Using threads and common synchronization and communication services as an abstraction layer, ReconOS allows for the creation of portable and flexible multithreaded applications targeting CPU/FPGA systems. This work discusses the ReconOS programming model and its execution environment, presents implementations based on modern platform FPGAs and the operating systems eCos and Linux, evaluates time and area overheads of the proposed mechanisms and, finally, demonstrates the feasibility of the multithreading design approach on several case studies.
Scalable FPGA-based Architecture for DCT Computation Using Dynamic Partial Reconfiguration
University of Central Florida, USA
In this paper, we propose FPGA-based scalable architecture for DCT computation using dynamic partial reconfiguration. Our architecture can achieve quality scalability using dynamic partial reconfiguration. This is important for some critical applications that need continuous hardware servicing. Our scalable architecture has thre features. First, the architecture can perform DCT computations for eight diferent zones, i.e., from 1£1 DCT to 8£8 DCT. Second, the architecture can change the configuration of processing elements to trade off the precisions of DCT coeficients with computational complexity. Third, unused PEs for DCT can be used for motion estimation computations. Using dynamic partial reconfiguration with 2.3 MB bitstreams, 80 distinct hardware architectures can be implemented. We show the experimental results and comparisons between diferent configurations using both partial reconfiguration and non-partial reconfiguration process. The detailed trade-offs among visual quality, power consumption, processing clock cycles, and reconfiguration overhead are analyzed in the paper.
Issue II: Configuring Hardware Architecture
Part III. Configuring Hardware Architecture
REDEFINE: Runtime Reconfigurable Polymorphic ASIC
Indian Institute of Science, SERC
Emerging embedded applications are based on evolving standards (ex. MPEG2/4, H.264/265, IEEE802.11a/b/g/n). Since most of these applications run on handheld devices, there is an increasing need for a single chip solution that can dynamically interoperate between diferent standards and their derivatives. In order to achieve high resource utilization and low power dissipation, we propose REDEFINE, a Polymorphic ASIC in which specialized hardware units are replaced with basic hardware units that can create the same functionality by runtime recomposition. It is a \future-proof" custom hardware solution for multiple applications and their derivatives in a domain. In this paper, we describe a compiler framework and supporting hardware comprising compute, storage and communication resources.
Applications described in High Level Language (for ex: C) are compiled into application substructures. For each application substructure a set of Compute Elements on the hardware are interconnected during runtime to form a pattern that closely matches the communication pattern of that particular application. The advantage is that the bounded CEs are neither processor cores nor logic elements as in FPGAs. Hence REDEFINE o_ers the power and performance advantage of an ASIC and the hardware recon_gurability and programmability of that of an FPGA/Instruction Set Processor.
In addition, the hardware supports Custom Instruction pipelining. Existing Custom Instruction extensions determine a sequence of instructions that repeatedly occur within the application and create Custom Instructions at design time to speed up the execution of this sequence. We extend this scheme further, where a kernel is compiled into Custom Instructions that bear strong producerconsumer relationship (and not limited to frequently occurring sequences of instructions). Custom Instructions realized as hardware compositions e_ected at runtime, allow several instances of the same to be active in parallel. A key distinguishing factor in majority of the emerging embedded applications is stream processing. To reduce the overheads of data transfer between Custom Instructions direct communication paths are employed among Custom Instructions. In this paper, we present the overview of the hardware-aware compiler framework which determines the NoC-aware schedule of transports of the data exchanged between the Custom Instructions on the interconnect. The results for the FFT kernel indicate a 25% reduction in the number of loads/stores and throughput improves by log(n) for n-point FFT when compared to sequential implementation. Overall REDEFINE o_ers exibility and a runtime recon_gurability at the expense of 1:16_ in power and 8_ in area overhead of that of an ASIC. REDEFINE implementation consumes 0.1_ the power of an FPGA implementation. In addition the con_guration overhead of the FPGA implementation is 1000_ more than that of REDEFINE.
FPGA Placement using Space Filling Curves: Theory Meets Practice
*Indian Statistical Institute, India; **Synopsis (India) Pvt. Ltd., India
Research in VLSI placement, an NP-hard problem, has branched in two di®erent directions. The first one employs iterative heuristics with many tunable parameters to produce near-optimal solution but without theoretical guarantee on its quality. The other one considers placement as graph embedding and designs approximation algorithms with provable bounds on the quality of the solution. In this paper, we aim at unifying the above two directions. First, we extend the existing approximation algorithms for graph embedding in 1D and 2D grid to those for hypergraphs, which typically model circuits to be placed on a FPGA. We prove an approximation bound of O(dplog n log log n) for 1D, i.e., linear arrangement and O(d log n log log n) for the 2D grid, where d is the maximum degree of hyperedges and n, the number of vertices in the hypergraph. Next, we propose an e±cient method based on linear arrangement of the CLBs, and the notion of space ¯lling curves, for placing the con¯gurable logic blocks (CLBs) of a netlist on island-style FPGAs with an approximation guarantee of O( 4plog npkd log log n), where k is the number of nets. For the set of FPGA placement benchmarks, the running time is near-linear in the number of CLBs, thus allowing for scalability towards large circuits. We obtained on an average a 33£ speedup with only 1:31£ degradation in the quality of solution compared to that produced by the popular FPGA tool VPR, thereby demonstrating the suitability of this very fast method for FPGA placement, with a provable performance guarantee.
Power Scalability in a Mesh-connected Reconfigurable Architecture
RMIT University, Australia
We analyze power-area-performance tradeoffs within a hypothetical mesh-connected reconfigurable architecture. A new analytic model relating area, power and performance, based on a simple VLSI complexity metric, is used to determine the behavior of some computing functions mapped to the platform. Although it might reasonably be expected that entirely local connectivity in the array would impose severe delay overheads, thus making performance-power tradeoffs more dificult, it was found that the flexibility of the reconfigurable platform, in which logic and interconnect are (mostly) interchangeable can result in compact layouts which tends to offset the impact of the interconnect delay.
Spin Transfer Torque (STT)-MRAM based Run Time Reconfiguration FPGA circuit
*Université Paris Sud, IEF, France; **STMicroelectronics
As the minimum fabrication technology of CMOS transistor shrink down to 90nm or below, the high standby power has become one of the major critical issues for the SRAM based FPGA circuit due to the increasing leakage currents in the configuration memory. The integration of MRAM in FPGA in stead of SRAM is one of the most promising solutions to overcome this issue, because its non-volatility and high write/read speed allow to power down completely the logic blocks in .idle. states in the FPGA circuit. MRAM based FPGA promises as well as some advanced reconfiguration methods such as Run Time Reconfiguration and multi-context configuration. However, the conventional MRAM technology based on Field Induced Magnetic Switching (FIMS) writing approach consumes very high power, large circuit surface and produces high disturbance between memory cells. These drawbacks prevent FIMS-MRAM.s further development in memory and logic circuit. Spin Transfer Torque (STT) based MRAM is then evaluated to address these issues, some design techniques and novel computing architecture for FPGA logic circuits based on STT-MRAM technology are presented in this paper. By using STMicroelectronics CMOS 90nm technology and a STT-MTJ spice model, some chip characteristic results as the programming latency and power have been calculated and simulated to demonstrate the expected performance of STT-MRAM based FPGA logic circuits.