IMPLEMENTATION OF DSP FFT BASICS + FFT PIPLELINES

\_

# PIPELINE IMPLEMENTATIONS OF THE FAST FOURIER TRANSFORM (FFT)

#### • Consulted work:

Chiueh, T.D. and P.Y. Tsai, *OFDM Baseband Receiver Design for Wireless Communications*, John Wiley and Sons Asia, (2007). Second edition of 2012 available as e-book via UT Library.

| © Sabih H. Gerez, University of Twente, The Netherlands |   |               |
|---------------------------------------------------------|---|---------------|
| IMPLEMENTATION OF DSP                                   | • | 3             |
| UT. FFT BASICS + FFT PIPLELINES                         |   | April 6, 2018 |

### **DISCRETE FOURIER TRANSFORM (DFT)**

• Consider a block of *N* samples of a (possibly complex-valued) data stream:

 $x[n], n = 0, 1, \dots N - 1$ 

• The discrete Fourier transform of this block is given by:

$$X[k] = \sum_{n=0}^{N-1} x[n] e^{-j\frac{2\pi k}{N}n}, k = 0, 1, \dots N-1$$

UT. IMPLEMENTATION OF DSP

April 6, 2018

## TOPICS

- Discrete Fourier Transform
- Fast Fourier Transform
  - Decimation in time
  - Decimation in frequency
- FFT pipelines:
  - Radix-2 multi-path delay commutator
  - Radix-2 single-path delay feedback
  - Delay buffer implementation
  - Radix-4 algorithms



# INVERSE DFT (IDFT)

• The inverse DFT is almost the same computation as the DFT:

$$x[n] = \frac{1}{N} \sum_{k=0}^{N-1} X[k] e^{j\frac{2\pi n}{N}k}, n = 0, 1, \dots N-1$$

IMPLEMENTATION OF DSP

**JI.** FFT BASICS + FFT PIPLELINES

5 April 6, 2018

#### TWIDDLE FACTORS

- Define:  $W_N = e^{-j\frac{2\pi}{N}}$
- Then DFT becomes:  $X[k] = \sum_{n=0}^{N-1} x[n] W_N^{kn}$
- $W_N^{kn}$  is called a *twiddle factor* (it is a number on the unit circle in the complex plane).
- Multiplying with a twiddle factor is a *vector rotation*. How to implement?



# DFT VISUALIZATION

- The DFT basically matches frequencies.
- The following picture originates from Wikipedia (entry: "DFT

matrix"):





#### April 6, 2018

#### MATRIX REPRESENTATION OF DFT

• The DFT can be expressed in matrix form:



 Number of complex multiplications involved (including trivial ones with 1, j, etc.): N<sup>2</sup>



# FAST FOURIER TRANSFORM

- Reduces the number of calculations in a DFT from O(N<sup>2</sup>) to O(N log N).
- First published by Cooley and Tukey in 1965.
- Check e.g. Wikipedia page for historical background dating back to Gauss in 1805 (earlier than Fourier!).
- Two variants:
  - Decimation in time
  - Decimation in frequency.





IMPLEMENTATION OF DSP FFT BASICS + FFT PIPLELINES

April 6, 2018

17

# **DECIMATION-IN-TIME FFT (8)**

- One *decimation-in-time* step defines an *N-point* DFT in terms of two N/2-point DFTs. They are combined by means of butterflies.
- Applying the principle *recursively*, results in a computation that consists of butterflies only.



# **COMPLEXITY REDUCTION**

- Number of computations has been reduced: ٠ – From  $\mathcal{O}(N^2)$  for the DFT
  - To  $\mathcal{O}(N \log N)$  for the FFT



#### IMPLEMENTATION OF DSP

FFT BASICS + FFT PIPLELINES

April 6, 2018

## IMPLEMENTATION ISSUES

- Buffering of input is needed because of block-based nature.
- One could e.g. use a *ping-pong memory*:
  - While the input is filling one memory, the FFT could consume samples of the other memory.
  - After processing one block of samples, the memories change roles.
- Note the *bit reversal* of the addresses:
  - Address order can be found from increasing binary addresses read in reverse order  $4 = 100_2$  e.g. becomes  $1 = 001_2$ .

UT. IMPLEMENTATION OF DSP

\_\_\_\_

21 April 6, 2018

# **DECIMATION-IN-FREQUENCY FFT (1)**

• Split input block in first and second half and consider the outputs with even index:

$$X[2m] = \sum_{\substack{n=0\\N-1\\\sum\\n=N/2}}^{N/2-1} x[n]W_N^{2mn} + \sum_{\substack{n=N/2\\n=N/2}}^{N-1} x[n]W_N^{2mn} + \sum_{\substack{n=N/2\\m=0,1,\dots,\frac{N}{2}-1}}^{N-1}$$

$$m = 0, 1, \dots, \frac{N}{2} - 1$$

$$MPLEMENTATION OF DSP = 23$$

$$MPLEMENTATION OF DSP = 23$$

$$April 6, 2018$$

#### **DECIMATION-IN-FREQUENCY FFT (3)**

• Using simplification rules mentioned earlier:

$$X[2m] = \sum_{n=0}^{N/2-1} (x[n] + x[n + N/2]) W_{N/2}^{mn}$$
$$m = 0, 1, \dots, \frac{N}{2} - 1$$

$$\underbrace{\text{PFT BASICS + FFT PIPLELINES}}_{\text{FFT BASICS + FFT PIPLELINES}} = 22$$

$$\underbrace{\text{April 6, 2018}}_{\text{April 6, 2018}}$$

$$\underbrace{\text{DECIMATION-IN-FREQUENCY FFT (2)}}_{\text{C} = N/2 - 1} x[n]W_N^{2mn} + N/2]W_N^{2mn} + N/2 W_N^{2mn} + N/$$

• This is a half-size DFT applied to an input stream consisting of pairs of the original input stream:

$$X[2m] = \sum_{n=0}^{N/2-1} (x[n] + x[n + N/2]) W_{N/2}^{mn}$$
$$m = 0, 1, \dots, \frac{N}{2} - 1$$

© Sabih H. Gerez, University of Twente, The Netherlands

IMPLEMENTATION OF DSP FFT BASICS + FFT PIPLELINES

April 6, 2018

25

#### **DECIMATION-IN-FREQUENCY FFT (5)**

• Now consider the outputs with odd index:

$$X[2m+1] = \sum_{n=0}^{N/2-1} x[n]W_N^{n(2m+1)} + \sum_{\substack{N/2-1\\ \sum \\ n=0}}^{N/2-1} x[n+N/2]W_N^{(2m+1)(n+N/2)}$$
$$m = 0, 1, \dots, \frac{N}{2} - 1$$
  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, University of Twente, The Netherlands  
• estable H. Gerez, Univ



26 April 6, 2018

#### **DECIMATION-IN-FREQUENCY FFT (6)**

• With the usual type of simplifications:

$$X[2m+1] = \sum_{n=0}^{N/2-1} (x[n] - x[n+N/2]) W_N^n W_{N/2}^m$$
$$m = 0, 1, \dots, \frac{N}{2} - 1$$

• This is a half-size DFT applied on a sequence obtained by taking the difference of pairs of the input and multiplying them with factor  $W_N^n$ .



© Sabih H. Gerez, University of Twente, The Netherlands

 $W_{N}^{N/2-1}$ 

N/2-point

DFT

X[N-1]

Ξ

x[N-1]





Originally proposed in:

Rabiner, L.R. and B. Gold, *Theory and Application of Digital Signal Processing*, Prentice-Hall, Englewood Cliffs, New Jersey, (1975).

UT. IMPLEMENTATION OF DSP

33 April 6, 2018

#### **R2MDC FOR 8-POINT FFT**

- · Consists of
  - Commutators, switches that send the data either straight ahead or crisscross
  - Butterfly blocks (either for DIT or DIF)
  - Delay buffers



UT. FFT BASICS + FFT PIPLELINES

April 6, 2018

# **R2MDC EFFICIENCY**

- If implemented as presented, the pipeline has a 50% hardware utilization:
  - The first butterfly is idle half of the time until it can process new pairs of inputs.
  - All hardware operates at input sample frequency
- Situation can be easily improved by duplicating input buffer and operating it in ping-pong fashion:
  - Hardware utilization jumps to 100%.
  - All hardware operates at half the sample frequency.
- As there are no feedback paths, hardware can be further pipelined to cope with possibly long computational delays.





• FFT BASICS + FFT PIPLELINES

April 6, 2018

36

### R2MDC WITH PING-PONG INPUT BUFFER





· Alternative pipeline solution, with optimized memory size

Originally proposed in:

٠

1970).

**RADIX-2 SINGLE-PATH DELAY** 

**FEEDBACK (R2SDF)** 

Groginsky, H.L. and G.A. Works, A Pipeline Fast Fourier Transform, IEEE Transactions on Computers, Vol.C-19(11), pp. 1015-1019, (November

37

38 April 6, 2018

#### **R2SDF ELEMENT**

Element:

UT.

- Either shifts first half of input in delay buffer and second half of output out of delay buffer (blue);

IMPLEMENTATION OF DSP

FFT BASICS + FFT PIPLELINES

- Or shifts second half of input into butterfly together with first half from delay buffer, first half of output to next stage and second half output into delay buffer (red).







© Sabih H. Gerez, University of Twente, The Netherlands



40



Dual-port RAM

44

April 6, 2018

Write

data

Read

data

42

April 6, 2018





© Sabih H. Gerez, University of Twente, The Netherlands

IMPLEMENTATION OF DSP

\_

53 April 6, 2018

## **RADIX-2<sup>2</sup> BUTTERFLY EVALUATION**

- Radix-2<sup>2</sup> has the same number of multipliers as radix-4.
- It has fewer additions: N log<sub>2</sub> N

FFT BASICS + FFT PIPLELINES

 In a similar way, a radix-8 butterfly can be decomposed in a radix-2<sup>3</sup> butterfly.



April 6, 2018

54

# RADIX-4 MULTI-PATH DELAY COMMUTATOR (R4MDC) PIPELINE FOR 16-POINT FFT



#### © Sabih H. Gerez, University of Twente, The Netherlands IMPLEMENTATION OF DSP 55 UT. FFT BASICS + FFT PIPLELINES April 6, 2018 **RADIX-4 SINGLE-PATH DELAY FEEDBACK (R4SDF) PIPELINE FOR 16-POINT FFT** 4D 1D 4D 1D 1D 4D Butterfly Butterfly

IMPLEMENTATION OF DSP

UT. FFT BASICS + FFT PIPLELINES

April 6, 2018

56

# **EVALUATION RADIX-4 PIPELINES**

- The multi-path structure:
  - Has a high memory overhead;
  - Has 100% multiplier utilization;
  - Can operate at one quarter of sample frequency.
- The single-path structure:
  - Has minimal memory;
  - Has a 75% multiplier utilization (multiplier drawn outside butterfly!)
  - Operates at sample frequency.