

\* *functional units:* they can perform one or more computations, e.g. addition, multiplication, comparison, ALU.

Sabih H. Gerez, University of Twente



February 12, 2012

put.

February 12, 2012

 $o = i_k, \ k = 2c_1 + c_0$ 



## **ONE-TO-ONE MAPPING ON HARDWARE**

*One-to-one mapping:* a direct mapping from data-flow graph (DFG) onto hardware.

- \* There is a separate *functional unit* (FU) in the hardware for each computational node in the DFG.
- \* Connections between FUs in the hardware are directly derived from the edges of the DFG.
- \* Specification style is close to RTL.
- \* The mapping basically results in a *data path*; *controller* is relatively simple.

FUNCTION MULTIPLEXING AND TIME

FOLDING (2)

\* When n > 1, one can say that the DFG needs to be cut in parts in such a way that each part should be mapped on one and the same

Each part of the DFG can be considered a separate function, hence the name *function multiplexing*. Mapping multiple functions on the same hardware will in general require hardware multiplexers in the

Because one traverses the data path multiple times for one iteration, one could say that the "linear" time in the DFG folds back several



IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING ARCHITECTURAL SYNTHESIS

### FUNCTION MULTIPLEXING AND TIME FOLDING (1)

- \* The signal samples to be processed arrive every  $T_0$  time units, the *iteration period*.
- \* Call the clock period of the hardware clock (assume that there is only one)  $T_c$ ,  $T_0 = nT_c$ , *n* being an integer.
- \* If n = 1, one-to-one mapping is the only option.
- \* If n > 1, there are opportunities to save hardware by reusing the same FUs for multiple computations in the DFG (one can e.g. perform 3 additions per iteration on a single adder when n = 3).
- \* *n* is called the *multiplex factor* or *reuse factor*.

 Sabih H. Gerez, University of Twente
 February 12, 2012

 Sabih H. Gerez, University of Twente
 February 12, 2012

## FOLDING AN FIR FILTER



How would this DFG fold on hardware with a single *multiply-add-accumulate* (MAC)?

#### time on the hardware, hence the name time folding.

\* The controller becomes more complex.

\* The issue is, of course: how to do it optimally?

data path.

data path.





#### **OPTIMIZATION CRITERIA**

Most commonly used:

- \* *Time-constrained synthesis:* given the iteration period  $T_0$ , use as few processors as possible or as little hardware as possible (typical for DSP).
- \* *Resource-constrained synthesis:* given a multiprocessor configuration or a set of hardware resources on chip, minimize  $T_{0r}$

Another important issue:

\* Minimization of power.

ARCHITECTURAL SYNTHESIS

## SCHEDULING TERMINOLOGY

- \* *Static* scheduling means: mapping to time and processor (functional unit, register, etc.) is identical in all iterations.
- \* A static schedule is either *overlapped* (exploiting *interiteration* parallelism) or *nonoverlapped*.



Parhi, K.K. and D.G. Messerschmitt, "Static Rate-Optimal Scheduling of Iterative Data-Flow Programs via Optimum Unfolding", IEEE Transactions on Computers, Vol. 40(2), pp 178–195, (February 1991).



13

February 12, 2012

IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING ARCHITECTURAL SYNTHESIS

# TERMINOLOGY

Subtasks in high-level synthesis:

- \* *Scheduling:* determine for each operation the time at which it should be performed such that no precedence constraint is violated.
- \* Allocation: specify the hardware resources that will be necessary.
- \* Assignment: provide a mapping from each operation to a specific functional unit and from each variable to a register.

Remarks:

- \* The subproblems are strongly interrelated; they are, however, often solved separately.
- \* Traditionally, scheduling precedes assignment. However, starting with assignment may give better results (more regular data path).
- \* Scheduling (except for a few versions) is NP-complete ⇒ heuristics have to be used.

#### Sabih H. Gerez, University of Twente

February 12, 2012

**1**6

6

IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING ARCHITECTURAL SYNTHESIS

## SCHEDULING TERMINOLOGY (Ctd.)

- \* Overlapped scheduling is also called: *loop folding, software pipelin-ing.*
- \* The delay between consumption of input and production of output is called the *latency*  $\lambda$ . In general  $\lambda \neq T_0$ .
- \* An overlapped schedule may allow shorter iteration period or hardware utilization, but:
- \* the search space is larger and finding optimal solutions harder.

Not covered in this presentation:

- \* cyclostatic schedules
- \* *dynamic* schedules (requires a run-time scheduler).



A cyclostatic schedule



### SOFTWARE PIPELINING EXAMPLE

```
without software pipeline: with
for (i=1; i<n; i++) {
    a[i] = f(x[i]); for (i=
    y[i] = g(a[i]); a[i]
    y[i-1]
}
```

```
with software pipeline:
a[1] = f(x[1]);
for (i=2; i<n; i++) {
    a[i] = f(x[i]);
    y[i-1] = g(a[i-1]);
}
y[n-1] = g(a[n-1]);
```

- \* In the situation without software pipeline, there is a data dependency between the two statements; they cannot be parallelized.
- \* In the situation with software pipeline, the loop body can be parallelized at the expense of some initial and termination code. When the parallelization has been implemented, iterations of the original loop are overlapping in time.

Sabih H. Gerez, University of Twente

February 12, 2012

IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING ARCHITECTURAL SYNTHESIS **1**9

17

### **EXAMPLE SCHEDULE**

The schedule and operation assignment with an allocation of one adder and one multiplier:





## **EXAMPLE DATA-FLOW GRAPH**

The second-order filter section made acyclic (delay elements replaced by output-input pairs):



Sabih H. Gerez, University of Twente

February 12, 2012

20



IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING ARCHITECTURAL SYNTHESIS

### **EXAMPLE DATA PATH**

The resulting data path after register assignment (the specification of a controller completes the design):





February 12, 2012

February 12, 2012



#### **CRITICAL LOOP**

For a general DFG,  $T_{0_{\rm min}}$  is given by:

$$T_{0_{\min}} = \max_{all\ loops\ l} \left| \begin{array}{c} \sum_{c \in l} \delta(c) \\ \sum_{d \in l} \mu(d) \end{array} \right|$$

- Provide the ACM, Vol. 15(4), pp 590–599, (October 1968)
- IF Renfors, M. and Y. Neuvo, "The Maximum Sampling Rate of Digital Filters Under Hardware Speed Constraints", IEEE Transactions on Circuits and Systems, Vol. CAS-28(3), pp 196-202, (March 1981).

- Many polynomial-time algorithms have been published; survey in: \*
- Dasdan, A., S.S. Irani and R.K. Gupta, "Efficient Algorithms for Optimum Cycle Mean and Optimum F Cost to Time Ratio Problems", 36th Design Automation Conference, (1999).
- \* An easy to understand but not very efficient method is based on "matrix multiplication".
- F Gerez, S.H., S.M. Heemstra de Groot and O.E. Herrmann, "A Polynomial-Time Algorithm for the Computation of the Iteration-Period Bound in Recursive Data-Flow Graphs", IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, Vol. 39(1), pp 49-52, (January 1992).



#### EXAMPLE



February 12, 2012

28

### JLATION (1)

- putational nodes ( $C \subset V$ ); D
- + a library F of functional units (FUs) or resource types.
- \*  $\Omega$  is the set of all operations; the operation type of a computational node is given by  $\gamma : C \rightarrow \Omega$ .
- \* The delay for the fastest execution of an operation is given by  $\delta: C \to \mathbb{N}.$
- \* The multiplicity of a delay element is given by  $\mu: D \to \mathbb{N}$ .
- An FU can execute one or more operations given by  $\Gamma: F \to 2^{\Omega}$  (2<sup> $\Omega$ </sup> \* is the *power set* of  $\Omega$ , the set of all its subsets).



#### **PROBLEM FORMULATION (2)**

- \* Goal is to find:
  - + a scheduling  $\sigma: C \rightarrow \mathbb{Z}$ , where  $\mathbb{Z}$  is the set of integers (time instants).
  - + an assignment  $a: C \rightarrow F$ , such that  $\gamma(c) \subset \Gamma(a(c))$ .
- \* Such that:
  - + all precedence relations are satisifed;
  - + the cost of the resulting hardware is minimal for a given iteration period, or
  - + the iteration period is minimal for a given hardware configuration.



IPLEMENTATION OF DIGITAL SIGNAL PROCESSING ARCHITECTURAL SYNTHESIS

#### FORWARD PRECEDENCE RELATIONS

\* Inequalities for starting time Example: derived from edges connecting two computational nodes.

> $\forall c_i, c_j \in C, (c_i, c_j) \in E:$  $\sigma(c_i) \ge \sigma(c_i) + \delta(c_i)$

 $\begin{array}{c} \underbrace{\overset{c_2}{\leftarrow} \overset{c_7}{\leftarrow} \sigma(c_7) \geq \sigma(c_2) + \delta(c_2)}_{c_8} \\ \underbrace{\overset{c_9}{\leftarrow} \overset{c_8}{\leftarrow} \sigma(c_7) \geq \sigma(c_8) + \delta(c_8)}_{\sigma(c_8) \geq \sigma(c_9) + \delta(c_9)} \\ \underbrace{\overset{c_{10}}{\leftarrow} \sigma(c_8) \geq \sigma(c_{10}) + \delta(c_{10})}_{\sigma(c_8) \geq \sigma(c_{10}) + \delta(c_{10})} \end{array}$ 

Freemstra de Groot, S.M., S.H. Gerez and O.E. Herrmann, "Range-Chart-Guided Iterative Data-Flow-Graph Scheduling", IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, Vol. 39(5), pp 351-364, (May 1992).

Sabih H. Gerez, University of Twente

February 12, 2012

32

Sabih H. Gerez, University of Twente February 12, 2012 MPLEMENTATION OF DIGITAL SIGNAL PROCESSING 31 ARCHITECTURAL SYNTHESIS

## BACKWARD PRECEDENCE

For all paths *R* connecting two nodes  $c_i$ ,  $c_i$  via delay nodes:

$$\sigma(c_j) \ge \sigma(c_i) + \delta(c_i) - T_0 \sum_{d \in R} \mu(d)$$

Example:



29

IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING ARCHITECTURAL SYNTHESIS

## THE INEQUALITY GRAPH

\* Given  $T_0$ , all inequalities can be written as:

$$\sigma(c_a) - \sigma(c_b) \ge \kappa_{ab}$$

where  $\kappa_{ab}$  is a constant.

- \* For all input nodes,  $i \in I$ , set  $\sigma(i) = 0$ .
- \* For all output nodes,  $o \in O$ , set  $\sigma(o) = \lambda$ , where  $\lambda$  is the *latency*.
- \* Create an *inequality graph* with a vertex for each computational, input and output node in the DFG.
- \* All inequalities can be represented in the graph with edges having weight  $\kappa_{ab}$ .

30



The inequality graph of the second-order filter section



## SOLVING THE INEQUALITIES

- Lower bounds for the node scheduling times  $\tilde{\sigma}_{-}(v)$  are found by solving the longestpath problem, with the input nodes as the sources.
- \* Upper bounds  $\tilde{\sigma}_{+}(v)$  are found analogously, by computing the longest paths traversing the graph in the reverse direction starting from the output nodes. \*
- \* Scheduling all operations on their lower bounds, gives an

ASAP (as soon as possible) schedule and the upper bounds lead to an ALAP (as late as possible) schedule.

- \* A partial schedule  $\tilde{\sigma}: V \rightarrow [\mathbb{Z}, \mathbb{Z}]$  assigns a scheduling range or time *frame* to each  $v \in V$ ;  $\tilde{\sigma}(v) = [\tilde{\sigma}_{-}(v), \tilde{\sigma}_{+}(v)].$
- The initial scheduling range is given by ASAP/ALAP schedule.

| Sabih H. Gerez, University of Twente                                   | February 12, 2012 | Sabih H. Gerez, University of Twente                                   | February 12, 2012 |
|------------------------------------------------------------------------|-------------------|------------------------------------------------------------------------|-------------------|
| IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING<br>ARCHITECTURAL SYNTHESIS | ▶ 35              | IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING<br>ARCHITECTURAL SYNTHESIS | ▶ 36              |

33

### MOBILITY-BASED SCHEDULING

- \* The difference  $\tilde{\sigma}_{+}(v) \tilde{\sigma}_{-}(v)$  is called *v*'s *mobility* (or *freedom*).
- \* It can be used as the basis of many scheduling heuristics (most scheduling problems are NP-complete).
- \* Finding a schedule can be seen as the generation of a sequence of partial schedules until all mobility has diappeared.

Principles of a general mobility-based heuristic:

- \* determine scheduling ranges of all operations;
- decide to shorten scheduling range of at least one operation;
- update ranges of all affected operations;
- \* repeat until all operations have mobility zero.

## **EXAMPLE HEURISTIC**

Hybrid genetic algorithm:

- \* Design a simple heuristic that can be parametrized by a permutation of all operations and that guickly generates a schedule.
- \* Use a genetic algorithm to find the permutation that leads to the best schedule.
- IF Heijligers, M.J.M. and J.A.G. Jess, "High-Level Synthesis Scheduling and Allocation Using Genetic Algorithms Based on Constructive Topological Scheduling Techniques", International Conference on Evolutionary Computation, Perth, Australia, (1995).

IMPLEMENTATION OF DIGITAL SIGNAL PROCESSING ARCHITECTURAL SYNTHESIS

### **UPDATING THE RANGES**

The best known method requires O(|C|) time per update.

MPLEMENTATION OF DIGITAL SIGNAL PROCESSING

**ARCHITECTURAL SYNTHESIS** 

*v*<sub>6</sub>(1)

v<sub>5</sub>(2)

 $V_{7}(1)$ 

 $V_{4}(2)$ 

+

v<sub>3</sub>(4)

v<sub>2</sub>(3)

v₁(5)

- \* Precompute *all-pairs longest paths* in inequality graph using e.g. the Floyd-Warshall or Johnson algorithm and store the path lengths in a *distance matrix*  $D_{T_0}$ .
- \* Suppose that operation *v* has been scheduled at time  $\sigma(v)$ ; then for all  $u \in C$ :

 $\tilde{\sigma}_{-}(u) \leftarrow \max(\tilde{\sigma}_{-}(u), \sigma(v) + D_{T_0}[v, u])$  $\tilde{\sigma}_{+}(u) \leftarrow \min(\tilde{\sigma}_{+}(u), \sigma(v) - D_{T_0}[u, v])$ 

IP Heijligers, M.J.M., "The Application of Genetic Algorithms to High-Level Synthesis", Ph.D. Thesis, Eindhoven University of Technology, Department of Electrical Engineering, (October 1996).

LIST SCHEDULING EXAMPLE

ALU #2  $|V_4| V_2$ 

 $V_1$ 

ALU #1



## LIST SCHEDULING

- \* A resource-constrained scheduling method.
- \* Start at time zero and increase time until all operations have been scheduled.
- \* The *ready list*  $L_t$  contains all operations that can start their execution at time t or later.
- \* If more operations are ready than there are resources available, use some priority function to choose, e.g. the longest-path to the output node ⇒ *critical-path list scheduling*.

Sabih H. Gerez, University of Twente

February 12, 2012

38

MPLEMENTATION OF DIGITAL SIGNAL PROCESSING ARCHITECTURAL SYNTHESIS

#### **4**0

#### THE ASSIGNMENT PROBLEM

Subtasks in assignment:

- \* operation-to-FU assignment
- \* value grouping
- value-to-register assignment
- \* transfer-to-wire assignment
- \* wire to FU-port assignment

In general: task-to-agent assignment

Sabih H. Gerez, University of Twente

37

February 12, 2012

 $v_6 | v_7$ 

 $V_5$ 

 $V_3$ 

23

**3**9



Sabih H. Gerez, University of Twente

February 12, 2012

Sabih H. Gerez, University of Twente

February 12, 2012

