FPGA – Hardware Development best practices

FPGAs have emerged as a great choice for systems requiring high-performance DSP functionality. They can implement parallelism computation (allow to easily parallelize the computational-intensive portions of the design) eg for critical filtering functions in DSP-Microprocessor-based application, freeing the DSP processor to perform algorithmically complex operations.

FPGAs have embedded memory, DSP blocks, embedded multipliers & embedded processors (soft/hard) that are well suited for implementing DSP functions such as FIR filters, FFTs, Encoders/Decoders, Arithmetic functions etc.

Embedded DSP blocks (or DSP slices) provide other functionality such as Multiply and Accumulate (MAC), Multiply & Add (MultAdd), Addition/Subtraction etc that are common arithmetic operations in DSP functions.

The 7 series FPGA DSP48E1 slice is functionally equivalent and fully compatible with the Virtex-5/6 & contain a 25bit Pre-adder, 25x18bit two’s complement Multiplier, 48-bit Accumulator/Logic Unit, Pattern Detector & Pipeline registers. Two DSP48E1 slices and dedicated interconnect form a DSP48E1 tile. The block RAM in 7 series can be split into two 18K block RAMs.

The DSP slice within the UltraScale architecture is defined using the DSP48E2 primitive (backwards compatible with the 7 series FPGA DSP48E1 slice). It contains a 27bit Pre-adder, improved 27x18bit Multiplier, 48-bit Accumulator/Logic Unit, Pattern Detector & Pipeline registers. The DSP48E2 blocks use signed arithmetic implementation. One 36K Block RAM (can split into 2x18K Block RAM), CLBs / dedicated Interconnect & 2 DSP48E2 slices form each DSP tile. To best match the resource capabilities and to get the most efficient mapping, code must be written using signed values in the HDL source. For migration purposes, designs created for the 25×18 multiplier in the 7 series FPGAs may need to be sign extended for the 27×18 multiplier in the UltraScale architecture.

Applications of the DSP slice include: Fixed and floating point FFT functions, Systolic FIR filters, MultiRate FIR filters, CIC filters & Wide real/complex multipliers/accumulators.

In a typical filter application, incoming data samples combine with filter coefficients through carefully synchronized mathematical operations which are dependent on the filter type and implementation strategy and then move on to the next processing stage. If the data source and destination are analog signals, then the samples must first pass through an ADC and the results fed through a DAC. An Analog LPF could be used before the ADC to filter out any unwanted high-frequency content signals above FNyquist (=fs/2) entering the ADC and avoid aliasing. Various design tools are available to help select the ideal length of the eg Digital LPF and the coefficient values. Also, decimation (or downsampling) at the output of the ADC, if required, can be carried out inside the FIR (Digital LPF) filter itself.

The goal is to select the appropriate parameters to achieve the required filter performance. The most popular design tool for choosing these parameters is MATLAB. [ FFi see my post :
https://www.kevnugent.com/2020/10/18/matlab-blogpost_005/ ]

The embedded memory in FPGAs meets external memory requirements and also eliminates the need for external memory devices in certain cases.

Embedded processors in FPGAs provide overall system integration & flexibility while partitioning the system between hardware & software. Designers can implement the system software components in the Embedded Processing System (PS) & implement hardware components in the FPGA Programmable Logic resources (PL).

FPGAs can implement Hardware Accelerators for each application allowing the designer to achieve best performance from Hardware Acceleration. The designer can implement hardware acceleration blocks by designing such blocks using parameterizable IP blocks or from scratch using HDL.

IP cores for Hardware acceleration could be : General cores (eg FIR, IIR), Modulation cores (eg QPSK, Equilizer etc), Encryption cores (eg DES), Error Correction cores ( eg Convolutional Encoder/Viterbi Decoder, CRC, Reed Solomon Encoder/Decoder etc)

This provides maximum flexibility, allowing designers to customize IP without changing a designs source code. Designers can integrate a parameterized IP core in HDL, can also port the IP to new FPGA families leading to a higher performance & lower cost.

If the system sample rate is below few KHz and is a single channel implementation, the DSP processor may be the obvious choice. However, as sample rates increase beyond a couple of MHz or if the system requires more than one single channel, FPGAs become more and more attractive. This means in a multiple-channel or high speed system we can take advantage of the parallelism within the FPGA device to maximize performance.

The DSP processor falls behind in raw data processing power for certain data intensive DSP functions such as Convolutional Encoding/Viterbi Decoding & FIR filters. At high data rates, the DSP may struggle to capture, process and output the data without any loss. This is due to the many shared resources, buses and even the core within the processor. The FPGA, however can dedicate resources to each of these functions.

DSPs are instruction based, not clock based. Typically 3-4 instructions are required for any mathematical operation on a single sample. The data must first be captured at the input, then forwarded to the processing core, cycled through that core for each operation and then released through the output.

There is also no level of customization for the design needs on DSP processor devices eg Hardware Accelerator (Co-processor) block such as Viterbi Co-processor, turbo co-processor & Enhanced Filter coprocessor. Such hardware blocks are fixed. DSP processors only allow a limited number of multipliers. Various DSP applications use external memory devices to manage large amounts of data processing. MAC operation is usually the performance bottleneck in most DSP applications.

In contrast, the FPGA is clock based, so every clock cycle has the potential ability to perform a mathematical operation on the incoming data stream.