# ED29116 <br> DESIGNING <br> WITH THE Am29116 <br> 16-BIT <br> BIPOLAR MICROPROCESSOR 

Barbara Albert<br>Arndt Bode, Dr. rer. nat.<br>J.W. Locke, Ph.D.

March 1983

Customer Education Center
Advanced Micro Devices, Inc.

## Copyright by

ADVANCED MICRO DEVICES, INC.
$9 \not 1$ Thompson Place
Sunnyvale
California ..... $94 \emptyset 86$
1983

INDEX to VOLUME II
Day 2 ..... page
3. Exercises on the Am29116, part 1 ..... 3.10
Exercises ..... $3.2 \emptyset$
Solutions ..... $3.8 \emptyset$
4. Am29116 Bit-Mapped Graphics Controllers ..... 4.10
Introduction ..... 4.20
Drawing a Vector ..... 4.30
Fast Vector-Plotting Algorithm ..... $4.5 \emptyset$
Am29116 Microcode for Vector Generation ..... $4.8 \emptyset$

- Improving the Vector Algorithm ..... 4.110

5. Application of the Am29116 for intelligent controllers ..... $5.1 \varnothing$
Intelligent controller structures ..... 5.15

- low speed version ..... $5.3 \emptyset$
- high speed version ..... 5.110
- comparison with an Am2901 based solution ..... $5.17 \emptyset$
- very high speed solution ..... $5.22 \emptyset$
- Am952ø burst error processor ..... $5.24 \emptyset$
- Microprogramming the controller ..... 5.425

6. Application of the Am29116 for general purpose CPUS ..... $6.1 \varnothing$
A Microprogrammed CPU using Am29116 ..... 6.20
System overview ..... $6.3 \emptyset$
System Organization ..... $6.4 \emptyset$
Instruction Formats ..... 6.60
Timing analysis ..... $6.13 \emptyset$
Pipelining at the Macro level ..... 6.19ø
Comparison with Super-16 ..... $6.27 \emptyset$

- Macro Instruction Execution ..... $6.34 \emptyset$
Comparison for 29ø1-292ø3-29116 solutions ..... $6.43 \varnothing$
Performance Analysis ..... $6.45 \emptyset$

7. Exercises on the Am29116, part 2 ..... 7.10
Exercises (Microprogramming the Am29116) ..... $7.2 \emptyset$
Solutions ..... 7.30

## DAY 2

## CHAPTER 3

Exercises, Part 1

## Exercises - Part 1

True or false:

1. The Am29116 is externally TTL compatible, but uses ECL ciruitry internally.
2. The Am29116 is expandable (i.e. two can be hooked together).
3. The Am29116 is for 8 -bit or 16-bit intelligent controllers.
4. The Am29116 can perform conditional testing on its status register.
5. The barrel shifter rotates 1 to 15 bits up or down in one microcycle.
6. The Am29116 must be used with an Am29ø4.
7. The Am29116 can perform immediate operations.
8. The Am29116 has a choice of four input sources to its data MUX's which in turn provide three ALU inputs, $R, S, U$.
9. The Am29116 can perform three-address instructions.
10. Fast clock speed is synonymous with high throughput.
11. The Am29116 can generate remainders up to 16 bits long from CRC polynomials.

## Exercises - Part 1 (continued)

12. The Am29116 always has its ALU output at $Y_{i}$.
13. The ALU destinations are RAM, ACC, D-Latch.
14. Single-operand instructions are PASS, COMPLEMENT, INCREMENT, and TWO's COMPLEMENT.
15. $D(D E)$ ( $D$ with zero extend) is used for two's-complement arithmetic.
16. " $\bar{R} \quad-\quad$ Dest" calculates one's-complement, and " $\bar{R}+1-->$ Dest" calculates two's complement.
17. The Am29116 can perform NAND, NOR, EXOR in one microcycle.
18. Shift up can use $\emptyset, 1$ or the QLINK bit as input to the LSB.
19. Shift down uses $\emptyset, 1$ or the QLINK bit as the only input choices to the MSB.

2Ø. Rotate operates in byte or word mode.
21. Rotate uses the U-input to the ALU.
22. Load $2^{n}$ causes a mask ( 1 in a field of $\emptyset$ 's) to be generated and can be used for loading RAM, ACC.
23. Read $Y$-bus, change a bit, output to $Y$-bus is possible in one microcycle with the Am29116.
24. If you perform a bit-oriented instruction on the ACC, the destination is the ACC or the RAM.
25. There are 17 possible results when priority encoding a word.
26. Byte-mode prioritizing ignores bits 8-15.

## Exercises - Part 1 (continued)

27. The Am29116 can perform operations on polynomials of degree 16 or less.
28. $95 \%$ of CRC calculations use polynomials with 16 -bit remainders.
29. The CRC calculations can be done in a forward or reverse mode. (i.e. either transmit bit $\varnothing$ first or transmit bit 15 first)
30. CRC remainders can be calculated in byte or word mode.
31. The status word can be loaded from D, RAM, or ACC.
32. The $Z, C, N$ and $O V R$ status bits can be loaded without affecting LINK or the user flags.
33. You can set or reset the entire status word.
34. You can set or reset the ALU flags individually.
35. On the Am29116, you can load both the status register and the ACC in the same microcycle if the RAM is the source.
36. If the status-register enable is high, the status register is "frozen". That is, no operation can alter status.
37. All conditional testing is performed on the stored values in the status register.

## Exercises - Part 1 (continued)

Fill in or answer:
38. How many bits are in the Am29116 status register?
39. How many conditional tests can be made?
40. Can you perform a conditional test during another instruction? If so, how?
41. Can byte operations be performed in either the upper byte or the lower byte of a register?
42. Can the Am29116 support a $1 \varnothing \emptyset$ ns microcycle time?
43. List three possible sources for single operand instructions.
44. Can you load a byte into the D-latch?
45. List three possible source pairs for two-operand instructions.
46. Show the content of this register after a word-mode rotate with $n=2$.

| 15 | 14 | 13 | 12 | 11 | $1 \varnothing$ | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | $\emptyset$ |
| ---: | ---: | ---: | ---: | ---: | ---: | ---: | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| 1 | 1 | $\emptyset$ | $\emptyset$ | 1 | $\emptyset$ | 1 | $\emptyset$ | $\emptyset$ | $\emptyset$ | 1 | 1 | $\emptyset$ | 1 | $\emptyset$ | 1 |

47. Does the Am29116 allow you to rotate R3 and place the result back into R3?
48. Does the Am29116 allow you to rotate R3 and place the result in R7?
49. If there is no active priority request, what result is produced by a prioritize instruction in word mode?

## Exercises - Part 1 (continued)

5ø. If bit 15 is active, what result is produced by a prioritize instruction in word mode?
51. If bit 15 is active, what result is produced by a prioritize instruction in byte mode?
52. Is it true that the Am29116 is an order of magnitude faster than the Am29ø1 which in turn is an order of magnitude faster than the AmZ8ФФФ for certain controller-oriented operations?
53. Can the Am29116 be used to do multiply? If so, outline the required code.
54. Can the Am29116 be used for bit operations?
55. Can the Am29116 be used for rotate operations?
56. Does the Am29116 have an ALU?
57. Can the Am29116 be used to build a CPU?
58. If a mask bit is zero in a ROTATE-and-MERGE instruction, from which operand is the corresponding bit passed to the destination?
59. If $U=\emptyset \emptyset 11 \quad \varnothing \varnothing \emptyset 1 \quad \emptyset 1 \emptyset 1 \quad \emptyset 11 \emptyset$
$R=1 \phi 1 \phi \quad 1 \phi 1 \phi \quad 1 \phi 1 \phi \quad 1 \phi 1 \phi$
mask $=\varnothing 1 \emptyset 1 \quad 1 \varnothing 1 \varnothing \quad \varnothing 11 \varnothing 1 \varnothing \emptyset 1$
what bit pattern is produced by a word-mode ROTATE-and-MERGE instruction with $n=4$ ?

6Ø. If the highest bit position with a one is position 7 and the mask is $1 \varnothing 1 \emptyset$ HEX, what is the result of a word-mode prioritize instruction?
61. If the highest bit position with a one is position 7 and the mask is $1 \varnothing 1 \varnothing$ HEX, what is the result of a byte-mode prioritize instruction?

## Exercises - Part 1 (continued)

62. What is the result of loading the complement of $2^{n}$ ? What type of instruction allows you to do this?
63. Suppose you want to do a word-rotate down five bit positions. How do you do this on an Am29116?
64. What is QLINK?
65. Can you set and reset the ALU status bits individually as you can with the Am29ø4 micro-status register?
66. Can you set \& reset the LINK and FLAG status bits individually?
67. In byte mode, are all 8 bits of the status register loaded?
68. Which instructions do not cause the ALU status bits to be updated?
69. When are the upper four status bits (LINK, FLAG1, FLAG2, FLAG3) changed?

SOLUTIONS

## Solutions for Exercises - Part 1 (continued)

1. True
2. False
3. True - almost every instruction operates in byte or word mode.
4. True - using either the instruction lines $I_{1}-I_{4}$ or the $T_{1}-T_{4}$ lines.
5. False - the barrel shifter only rotates up 1 to 15 bits. An effective down rotate is achieved by choosing an appropriate number of bits to rotate up such that the result is equivalent to the desired down rotation.
(16-i in word mode; 8-i in byte mode)
6. False - but an Am29ø4 is useful for emulations because of its bit-settable status register.
7. True - requires two microwords and two microcycles.
8. True - ACC, D-Latch, RAM
and the instruction lines (for immediate data).
9. False - the Am29116 can be extended only to a two-register-address structure (using an external MUX, an additional 5-bit microinstruction field and extended timing).
10. False! - recall the problems that occur on branching with a double-pipelined CPU based on Am291ø \& Am29ø1 or Am29ø3 and the need for extra NOP instructions.
11. True
Solutions for Exercises - Part 1 (continued)
12. True if $\overline{O E}_{y}$ is enabled. Otherwise, false.
13. True with reservations - the D-1atch can be used butrequires some tricky timing. You could generate a racecondition if the D-latch is also a source for theoperation. The D-latch is not intended to be adestination, and its use as such is not recommended."Normal" destinations are RAM, ACC or NONE.
14. True. (PASS is another term for MOVE).
15. False - $D(S E)$ ( $D$ sign-extended from bit 7) is used for two's complement arithmetic
16. True
17. True
18. True - shift uses the $\emptyset, 1$ or QLINK. The inputs to the carry MUX should not be confused with the inputs to the shift MUX.
19. False - QC, QN $\because Q O V R$ and QLINK are also available. Do not confuse the use of Q in the Am29116 context (designating status-register contents) with the Q-shifter or Q-register of Am29ø1, Am29ø3 and Am292ø3.
2ø. True
20. True
21. True - also goes to internal Y-bus
22. True - be careful with the timing
23. False - ACC only.
BONR instruction has common source/destination field.
24. True - none and 1 thru 16.
25. True

## Solutions for Exercises - Part 1 (continued)

27. True
28. True - according to an AMD survey.
29. True
30. False - Word mode only. But short polynomials that produce an 8 -bit remainder can be used.
31. True - the status can also be loaded from immediate data as well.
32. True
33. True
34. False - if you need this use an Am29ø4.
35. False - this is the one source which precludes loading both the ACC and the status register.
36. True
37. True

Answers to Fill-in or Answer Section:
38. Eight - $Z, C, N, O V R$, LINK, FLAG1, FLAG2, FLAG3
39. There are 12 condition code test signals -

You can test all 8 status bits individually. You can force a LOW. And you can test 3 combinations: $Z+\bar{C}, N \oplus O V R,(N \oplus O V R)+Z$.
40. Yes - by using the T-bus lines as input. This requires a wider microword to control the T-bus.

## Solutions for Exercises - Part 1 (continued)

41. In byte mode, instructions alter only the lower byte.
42. Yes - Very carefully!

This requires a register between the sequencer and the control store as well as the pipeline register at the output of the control store. 125 ns is more easily achieved.
43. RAM, ACC, D, I, Zero.
44. No.
45. RAM - ACC, RAM - I, D - RAM, D - ACC, ACC - I, D - I
46. Rotate is up only -

$$
11 \varnothing \emptyset 1 \emptyset 1 \varnothing \quad \emptyset \emptyset 11 \quad \emptyset 1 \emptyset 1 \quad n=2 \text { becomes }
$$

$$
\emptyset \emptyset 1 \emptyset \quad 1 \varnothing 0 \emptyset \quad 11 \emptyset 1 \quad \emptyset 111
$$

47. Yes
48. Yes - with an external MUX and care with timing.
49. Zero.
50. One.
51. Bit 15 does not participate in the byte mode.

The result will depend on the content of the lower byte.
52. Yes it is true. Especially when multi-bit rotation, priority determination or CRC remainder calculations are needed.

## Solutions for Exercises - Part 1 (continued)

53. Yes - but it was not optimized for this application.

Consider a $16 \times 16$-bit unsigned multiplication:
i) Initialize the partial product (PP) and a counter to zero.
ii) Increment the counter.
iii) Test the MSB of the multiplier -

If the MSB is a one then: PP+multiplicand $-->P P$
iv) If count equals 16 then END else upshift the PP and the multiplier. (remember to carry from the LSH to the MSH of the PP)
v) Goto ii).

Look at the Am29116 instruction set! We can write a better procedure for the Am29116 than the above.

With the use of the prioritize and the rotate instructions you can reduce the number of microcycles.
(But now the number of microcycles depends on the multiplier!)
An improved algorithm is as follows:
i) Initialize PP and counter to zero.
ii) Prioritize the multiplier \& call the result 'prio'. If (prio $=\emptyset$ ) $O$ R (prio > ( 16 -counter)) then rotate PP by ( 16 -counter) and END.
iii) count+prio $->$ count
iv) Rotate PP and multiplier by prio.
v) PP+multiplicand --> PP
vi) Goto ii).

## Solutions for Exercises - Part 1 (continued)

54. Yes
55. Yes - The barrel shifter works in the byte or word mode.
56. Yes - The Am29116 has a 16-bit three-operand ALU.
57. Yes - but it was not optimized for this application.
58. The R-operand bit is passed to the destination.
59. The result is $1 \varnothing 11 \varnothing \varnothing \emptyset 111 \varnothing \varnothing \varnothing 11$.
60. The result is 9 .

Note: <mask bit $i$ equal to zero> allows participation of the source bit i in the priority determination.
61. The result is one.
62. This clears the $n^{\text {th }}$ bit and sets all of the other bits in the destination. The instruction is: 'BOR1, LDC2NR' or 'BONR, LDC2NA' or 'BONR, LDC2NY'.
63. Rotate up with $n=11$. ... (16-5=11)
example: $\varnothing \varnothing \varnothing \emptyset 1111 \quad \varnothing 1 \varnothing 1 \quad 1 \varnothing 1 \varnothing$
$1101 \varnothing 0 \emptyset \varnothing \quad \emptyset 1111 \varnothing 1 \varnothing$

## Solutions for Exercises - Part 1 (continued)

64. QLINK is the linkage bit for shift operations. It is also used by CRC instructions to bring in each bit of serial data.
65. You cannot set/reset the ALU status bits one by one. If you need this feature, use the Am29ø4 as an additional external status logic.
66. Yes
67. In byte mode, the lower 4 bits (ALU status) on the status register are loaded.
68. NOOP, save-status, test-status or any instruction if either of $\overline{\text { IEN }}$ or $\overline{\text { SRE }}$ are high.
69. The upper four status bits (LINK, FLAG 1, FLAG 2, FLAG 3) are changed during status set/reset; status load (word mode only); plus QLINK is updated after each shift

## CHAPTER 4

Am29116 Bit-Mapped Graphics Controllers

## Am29116 Bit-Mapped Graphics Controllers

The Am29116 has proved to be very popular amongst designers of graphics controllers. This is a natural result of the fact that the Am29116 has a very suitable instruction set for the kinds of algorithms that arise in graphics applications.

We will first discuss this subject in general using the article* by Chu and Miller as a reference. Then we will examine in detail one of the most useful procedures for bit-mapped raster scan graphics, the efficient drawing of a straight vector, and show how this is coded for the Am29116. By taking both an overall look at the requirements of bit-mapped graphics and by following a particular algorithm in detail, we are going to demonstrate why the Am29116 has been so successful as a graphics controller.

[^0]Display Memory (logical)


| $1 \varnothing$-bit | $1 \emptyset$-bit |
| :--- | :--- |
| address | address |


| $Y$ | $X$ |
| :--- | :--- |

row\# column\#


Bit-Map Memory (physical)

16-bit 4-bit address address

| $Y^{\prime}$ | $X^{\prime}$ |
| :--- | :--- |

word\# bit\#

## Am29116 Bit-Mapped Graphics Controllers (continued)

Logical Address to Physical Address Mapping (continued)

The mapping process:


Drawing a Vector on a Two-Dimensional Raster Grid


We would like to find an efficient procedure to draw a straight vector between two points. When given the coordinates of two end points, $P_{A}$ and $P_{B}$, located on a raster grid, the procedure should set the bits in screen memory associated with these end points and with a set of intermediate points closely approximating a straight line. We will describe a procedure that works for lines of unit or less slope that selects the closer of the two candidate points for each increment of ' $x$ '. That is, for each ' $i$ ' in the above figure, the procedure will compare the lengths of ' $s_{i}$ ' and ' $t_{i}$ ' and will select the point associated with the shorter length. This procedure can be generalized to handle lines with slope exceeding unity.

Drawing a Vector on a Two-Dimensional Raster Grid (continued)


- of course at $x=x_{A}$ and $x=x_{B}, Y=\varnothing$

In the above figure we see a successful application of such a procedure. Since the intermediate points are selected from a finite grid, they do not lie precisely on the ideal straight line linking $P_{A}$ with $P_{B}$. A measure of the success of the procedure is the value of the $Y_{i}{ }^{\prime} s$, the differences between the $y$-coordinates of the intermediate points and the corresponding vertical position of the ideal line.

## An Algorithm for Fast Vector Plotting with an Am29116

For a straight line linking two points $P_{A}\left(x_{A}, y_{A}\right)$ and $P_{B}\left(x_{B}, x_{B}\right)$ :

$$
y=y(x)=\frac{\left(y_{B}-y_{A}\right) x+x_{B} y_{A}-x_{A} y_{B}}{x_{B}-x_{A}}
$$

Let us define $d y=\left(y_{B}-y_{A}\right)$ and $d x=\left(x_{B}-x_{A}\right)$. For an arbitrary point $P\left(x_{p}, y_{p}\right)$, not necessarily on this line:

$$
Y_{p}=\left(y_{p}-y\left(x_{p}\right)\right) \cdot d x=-x_{p} \cdot d y+y_{p} \cdot d x-x_{B} y_{A}+x_{A} y_{B}
$$

is proportional to the vertical distance of $P$ from the line. We multiplied the distance by 'dx' to obtain a measure of vertical distance that can be calculated without performing a division. In all further discussion we will use this scaled measure for vertical distances. Since we will apply the same scale factor, 'dx', to all vertical distances, this tactic will not affect our ability to determine which of two points is closer to an ideal line.

A simple vector-drawing procedure is as follows:

1. Plot a point at $x_{A}, y_{A}$.
2. Increment $x$ by one.
3. If $x>x_{B}$ then quit.
4. Calculate $Y$, the vertical distance error, scaled by ' $d x$ ', for two points: $x, y$ and $x, y+1$.
5. If $\operatorname{ABS}(Y(x, y))<\operatorname{ABS}(Y(x, y+1))$ then goto 6 else $y=y+1$.
6. Plot a point at $x, y$.
7. Goto 2.

This procedure may be referred to as a digital differential analyser by analogy with numerical methods for solving differental equations. This procedure has avoided division and multiplication but it still has redundant arithmetic. Two additions are required for the two $Y^{\prime} s$, we may have to perform a two's-complement operation on either or both of the $Y^{\prime}$ 's to produce absolute values and finally a subtraction is required to compare the Y's. We can compress this procedure further.

We have been choosing the next point based on the sign of

$$
d=A B S\left(Y_{10 w e r}\right)-A B S\left(Y_{\text {upper }}\right)
$$

(We will now call 'd' the "discriminant" for our problem).

But in the situation shown below, $Y_{\text {lower }}$ is negative and $Y_{\text {upper }}$ is positive. In this case:

$$
d=-Y_{\text {lower }}-Y_{\text {upper }}
$$

and the absolute-value operation is not required.


Further the $i^{\text {th }}$ discriminant can be obtained efficiently from the (i-1) th discriminant. To discover how, consider the Y's involved:

$$
\begin{aligned}
& Y_{i-1}=-x_{i-1} \cdot d y+\quad y_{i-1} \cdot d x-x_{B} y_{A}+x_{A} y_{B} \\
& Y_{\text {lower }_{i}}=-\left(x_{i-1}+1\right) \cdot d y+\quad y_{i-1} \cdot d x-x_{B} y_{A}+x_{A} y_{B}=y_{i-1}-d y \\
& Y_{\text {upper }}=-\left(x_{i-1}+1\right) \cdot d y+\left(y_{i-1}+1\right) \cdot d x-x_{B} y_{A}+x_{A} y_{B}=y_{i-1}-d y+d x
\end{aligned}
$$

Therefore:

$$
d_{i}=-Y_{\text {lower }}^{i}-Y_{\text {upper }}^{i}=-2 Y_{i-1}+2 \cdot d y-d x
$$

And since $Y_{\emptyset}=\emptyset: \quad d_{1}=2 \cdot d y-d x$

## An Algorithm for Fast Vector Plotting with an Am29116 (continued)

An efficient algorithm for plotting a straight vector from given end points with minimal arithmetic can now be based on the use of a sequence of values of the above discriminant, $d$. We start with $x=x_{A}, y=y_{A}, d_{1}=2 \cdot d y-d x$. For each point we check to see if $x>x_{B}$. If so we terminate the procedure. Otherwise we set the bit corresponding to $x, y$ in the screen memory. Then we check the sign of $d_{i}$ and proceed to select the next point and the next discrimininant, $d_{i+1}$ as follows:

$$
\begin{array}{cc}
\text { If } d_{i} \text { is -ve } & \text { If } d_{i} \text { is +ve } \\
x_{i}=x_{i}+1 & x_{i}=x_{i}+1 \\
y_{i}=y_{i} & y_{i}=y_{i}+1
\end{array} \text { (i.e. select upper point) }
$$

Thus for each point plotted, we require only one addition plus one increment for ' $x$ ' and possibly one increment for ' $y$ '. This improved procedure is Bresenham's ${ }^{1}$ algorithm for a straight line. Variations on this algorithm can be devised to plot circular and elliptical arcs.

Reference: Bresenham, J.E. "Algorithm for Computer Control of a Digital Plotter" IBM Syst. J., Vol. 4, No. 1 (1965), pp.25-3ø

## Am29116 Microcode for Vector Generation

Let us show the code that implements Bresenham's algorithm in two sections. First let us calculate the various quantities needed before the main loop begins.

| We start with: | Reg | Content |
| :---: | :---: | :---: |
|  | $R \not \subset \emptyset$ | ${ }^{x_{A}}$ |
|  | Rø1 | $y_{\text {A }}$ |
|  | $R \emptyset 2$ | ${ }^{\text {B }}$ |
|  | $R \emptyset 3$ | $y_{B}$ |
| We want: | Reg | Content |
|  | $R \not \subset \emptyset$ | ${ }^{x_{A}}$ |
|  | $R \emptyset 1$ | $y_{A}$ |
|  | Rø2 | $d x=($ pixel count) -1 |
|  | Rø3 | incrl $=2 \cdot d y$ |
|  | Rø4 | incr2 $=2 \cdot d y-2 \cdot d x$ |
|  | ACC | $d_{1}=2 \cdot d y-d x$ |

The code required to achieve this result is:

SOR
TOR1
SOR
SOR
SOR
TOR1
SHFTNR
SOR
TOR1
TOR1

| W,MOVE, SORA, RDФ | \& CONT | $\mathrm{x}_{\text {A }}$ | --> AC |  |
| :---: | :---: | :---: | :---: | :---: |
| W, SUBS, TORAA, Rø2 | \& CONT | $d x=x_{B}-x_{A}$ | $\cdots$ AC |  |
| W, MOVE, SOAR, Rø2 | \& CONT | dx | --> $\mathrm{R} \emptyset$ | final dx |
| W, MOVE, SOAR, Rø4 | \& CONT | dx | --> Rø |  |
| W, MOVE, SORA, Rø1 | \& CONT | ; $y_{A}$ | $\cdots$ AC |  |
| W, SUBS, TORAA, Rø3 | \& CONT | ; $\mathrm{dy}^{\text {m }} \mathrm{y}_{\mathrm{B}}-\mathrm{y}_{A}$ | $\rightarrow$ AC |  |
| W, SHA, SHUPZ, NRA | \& CONT | $2 \cdot d y$ | --> AC |  |
| W, MOVE, SOAR, Rø3 | \& CONT | $2 \cdot d y$ | $\rightarrow$ RØ | final incrl |
| W, SUBR, TORAA, RФ2 | \& CONT | $2 \cdot d y-d x$ | --> | final $d_{1}$ |
| W, SUBR, TORAR, Rø4 | \& CONT | $2 \cdot d y-2 \cdot d x$ | --> Rø | final incr2 |

* Note: "SHFTR" is used to multiply by two.

The alternative of adding RAM to RAM is not available.
** Note: The particular implementation of the main loop we are about to discuss requires that ' $d_{1}$ ' be adjusted further.

## Am29116 Microcode for Vector Generation (continued)

Program Flow Chart for Main Loop of Bresenham's Vector Algorithm:


## Am29116 Microcode for Vector Generation (continued)

The main loop section of the code is as follows:

| TOR1 | $\begin{aligned} & \text { W, SUBR, TORAA, Rø3 } \\ & \& \text { CONT } \end{aligned}$ | ; -dx --> ACC (final $d_{1}$ ) |
| :---: | :---: | :---: |
| SOR | W, INC, SORY, Rø2 | ; dx+1 --> Am2910 counter |
|  | \& IFNOT CT16 \& CT LOW \& LDCT JUNK | Forced pass loads counter |
|  | \& JMPI \& OEY | ; via "JUMP INDIRECT" path. |

DNEG: ; d is -ve
YYYY Instruction \#1 of subroutine PLOT ; avoid waste of 1 cycle \& IFNOT CT16 \& CT LOW \& CJS PLOT+1 ; unconditional CALL

TOR W,ADD,TORAA,RØ3
; d+incrl --> d \& RPCT DNEGØ ; Last pt plotted? Jmp if not.

AAAA Any instruction \& IFNOT CT16 \& CT LOW \& CJP BRXIT ; unconditional jmp to exit

DNEGD: SOR W,INC,SORR,RDD
; $x+1$--> $x$
\& IF CT16 \& CT N \& CJP DNEG ; Test sign of d, branch if -ve, ; continue if +ve.

DPOS: ; d is +ve
SOR W, INC, SORR,RØ1 ; y+1 --> y \& IFNOT CT16 \& CT LOW \& CJS PLOT ; unconditional CALL

TOR W,ADD,TORAA,R $\varnothing 4$; d+incr2 $-->d$ \& RPCT DNEG $\emptyset$; Last pt plotted? Jmp if not.

BRXIT: XXXX ; The vector is now completely drawn. Exit from Bresenham.
PLOT: ; Hardware-dependant routine to plot point at RめФ, RØ1
YYYY First instruction of PLOT subroutine \& CONT

PLOT+1:ZZZZ Second instruction of PLOT subroutine
....
....

LLLL Last instruction of PLOT subroutine \& IFNOT CT16 \& CT LOW \& CRTN ; return to caller of PLOT/PLOT+1

## Am29116 Microcode for Vector Generation (continued)

Improving the Vector Algorithm:

The above implementation of Bresenham's algorithm can still be improved. As it stands, the code manipulates the logical addresses, ' $x$ ' and ' $y$ ', and then plots each point by a procedure ("PLOT") that must convert these address coordinates to physical coordinates. It would be better to work with the physical addresses directly and hence shorten "PLOT".

Suppose the physical address consists of a word address,' $W$ ', and a bit address, 'b'. Suppose further that ' $b$ ' is represented by a word, ' $B$ ', that has a single bit set in the bit position corresponding to ' $b$ '. Then the algorithm can be modified as follows:

1. Replace ' $y+1$--> $y$ ' by:

W+(\# of pixels per line) --> W ... usually add $2^{\text {n }}$ via 'BOR2 A2NR'
('b' does not need to be altered)
2. Replace ' $x+1$--> $x$ ' by:

Rotate B up one.
Then if this rotation sets the sign bit, conditionally: 'W+1 --> W'

If the code is altered to manipulate ' $W$ ' and ' $B$ ' in this manner, then the subroutine, "PLOT", receives the address of the word in the screen memory to be modified, along with the bit bit mask with which to set that word. This shortens "PLOT" by at least 4 cycles. The implementation of this improvement is left as an exercise for the reader.

## CHAPTER 5

Intelligent Controllers Based on Am29116

## A Typical Peripheral Controller



BLOCK DIAGRAM OF A TYPICAL PERIPHERAL CONTROLLER

## Peripheral Controllers

The functions of a peripheral controller may include:

- Parallel transmission of data and address information between the host computer and the device being controlled.
- Detection and execution of commands from the host.
- Provision of status information to the host indicating the state of the controller and the controlled device.
- Serial transmission of data to and from the controlled device.
- Generation and testing of status, command and timing bits that coordinate controller/controlled-device interaction.
- Execution of calculations or algorithms related to the control of the peripheral or to processing data from it.

Intelligent Controller
Low Speed Version
Intelligent Controller - Low Speed
Devices Needed for the Minimum Configuration:
To control the Am29116:

- Am291ø sequencer
- Microprogram Memory
- Pipeline Register (could be incorporated in a registeredPROM that contains the microprogram)
To control the sequencer:
- $11 / 2$ octal D-type flip-flops (Am292め's) forming a 12-bit register acting as a latch for the direct input to the Am291ø sequencer
To interface with the host:
- 2 8-bit bidirectional I/0 ports (Am2950's.) connected to the data bus and providing the data path to and from the host
2 DMA address generators (Am294Ф's) to drive the host address bus
- 1 bidirectional I/O port (Am295ø)
interfacing with the host control bus
To interface with the peripheral device:
- 2 bidirectional I/0 ports (Am2950's) conveying status and command signals to and from the peripheral device
(4) serial-to-parallel converter for the serial data stream to and from the peripheral device
- 1 scratchpad or buffer RAM
And, of course the Am29116


## Low Speed Disk Controller



## Intelligent Controller - Low Speed

Brief Description of Some of the Elements

## Am2910 Sequencer

- an address sequencer intended for controlling the sequence of the execution of microinstructions stored in microprogram memory
- capabilities:
- fixed width of 12 address bits
- simple sequential access
- conditional branching to any microinstruction within its 12 -bit $4 \emptyset 96$-word address range
- incorporates a 12-bit counter for loop control
- provides a 5-level stack for microsubroutine linkage
- can accept an address from one of 4 sources:
- its microprogram address counter
- a direct input from an external source
- its internal register
- its stack
- executes 16 instructions as specified by 4 instruction inputs
- has 5 control pins to control branching, loading of its register, incrementing of its program counter and enabling its output bus drivers

```
Intelligent Controller - Low Speed (continued)
```

Brief Description of Some of the Elements (continued)

Am292ø Octal D-Type Flip-Flop
(2) provides eight edge-triggered D-type flip-flops with

- a buffered common clock
- a buffered common clock enable
- a buffered common asynchronous clear input
- three-state output control
- the clear input, $\overline{C L R}$, resets all eight flip-flops independent of all other inputs
- With the three-state output-enable LOW, all eight outputs appear as normal TTL outputs. Otherwise the outputs are in a high impedance state.
- The clock-enable input, $\bar{E}_{2}$ is used to selectively load data into the register. When $E$ is HIGH the register will retain its current data. When $\bar{E}$ is LOW, new data is entered on the LOW-to-HIGH transition of the clock input.


# Intelligent Controller - Low Speed (continued) 

Brief Description of Some of the Elements (continued)

Am2950 Eight-Bit Bidirectional I/0 Port

- designed for use as a parallel data I/O port
- provides 2 back-to-back registers to store data moving in both directions between 2 bidirectional three-state busses
- provides a handshake-flag flip-flop for each data direction to allow coordination of demand-response data transfer:
- Each flag flip-flop is set automatically when a register is loaded.
- Each flag flip-flop has an edge-sensitive clear input.
- provides for each register:
- a clock input
- a clock enable
- a three-state output enable
Intelligent Controller - Low Speed (continued)
Brief Description of Some of the Elements (continued)
The DMA Address Generator - Am2940
- High speed, cascadable, eight-bit wideDirect Memory Access address generator slice
- Generates sequential memory addresses for use in the sequential transfer of data to or from a memory
- Equipped with
- address counter (increment/decrement)
- address register (saves the initial address)
- word counter (increment/decrement)
- word counter register (terminal count)
- three-state address ouput buffers
- 4 control modes
- 8 different instructions (3 instruction pins)
- write/read control register (control mode)
- read word/address counter
- reinitialize counters
- load address/word count
- enable counters
- 3 control pins


# Brief Description of Some of the Elements (continued) 

The Scratch Pad Memory

- Not always necessary for a peripheral controller
- Improves system performance by allowing the host CPU and the peripheral device to respond at different rates.
- May also be useful in improving the execution speed of a controller algorithm by providing quick-access storage for variables or look-up tables. This traffic can thus be kept off of the main bus.


## Intelligent Controller - Low Speed (continued)

## Microword

- 16 instruction bits for the Am29116 that are shared with the 12 data bits to the Am2910 that provide for loading the counter and providing branch addresses
- 1 bit to determine the destination of the above lines (to either the Am29116 or the Am2910). Use the IEN input of the Am29116 to disable it on those microcycles when the data on the instruction lines is intended for the Am2910.
- 4 instruction bits for the Am291ø
- 3 instruction bits for the Am294ø
- Using this technique of sharing microinstruction fields, a microword of 28 bits is possible. Of course, there is a speed penalty caused by the need to halt the Am29116 when data is passed to the sequencer.



# Intelligent Controller - High Speed (continued) 

## Additional Elements

- 2 multiplexers:
- to define different Am29116 source and destination registers
- to select the branch condition for the Am291ø
- Am2925 clock generator and microcycle length controller. For extended timing required by Am29116 two-address operation.
- Am29ø4 status and shift control unit:
- to add more flexibility to the Am29116 status tests
- use only the Am2904 micro status register and its condition code instructions


## Intelligent Controller - High Speed (continued)

## Brief Description of the Additional Elements

## Am2925 Clock Generator and Microcycle Length Controller

- General purpose crystal-controlled clock generator/driver
- Has a microprogrammable clock cycle length:
- provides significant speed-up over fixed clock cycle
- meets a variety of system speed requirements
- Generates four different simultaneous clock output waveforms
- One of eight cycle lengths can be selected by the microprogram
- System control functions include:
- run
- halt
- single-step
- initialize
- ready/wait
- inputs can determine: - where a halt occurs
- the end point timing of wait cycles
(2) Up to 12 pins can be controlled by the microword.


## Intelligent Controller - High Speed (continued)

> Brief Description of the Additional Elements (continued)

Am2904 Status and Shift Control Unit

Designed to perform all the miscellaneous functions which are usually performed in MSI around an ALU

- It contains three nearly independent blocks of logic:
- multiplexer to generate the carry-in
- 4 three-state multiplexers for shift linkage
- 2 status registers for storing carry, overflow, zero and negative status flags. These status registers control a condition-test output via a condition code multiplexer. A wide selection of condition-code test logic is provided.

In our application only the status registers and condition-test logic is used.

## Intelligent Controller - High Speed (continued)

## Brief Description of the Additional Elements (continued)

Am2904 Condition Code Output
$I_{3}-I_{\emptyset} \quad$ Condition Code Output
$\emptyset \emptyset \emptyset \emptyset \quad(N \oplus O V R)+Z$
$\emptyset \emptyset \emptyset 1 \quad \overline{(N \oplus O V R)} \cdot \bar{z}$ *
$\emptyset \emptyset 1 \emptyset \quad$ N*OVR
$\emptyset \emptyset 11 \overline{N \oplus O V R}$ *
$\emptyset 1 \varnothing \emptyset \quad$ Z
$\emptyset 1 \emptyset 1 \quad \bar{Z}$
$\emptyset 11 \varnothing \quad$ OVR
$\emptyset 111$ OVR *
$1 \emptyset \emptyset \emptyset \quad C+Z \quad$ *
$1 \nsupseteq 1 \quad \bar{C} \cdot \bar{Z}$ *
1010 C
$1 \emptyset 11 \quad \bar{C}$
*
$11 \emptyset \emptyset \quad \bar{C}+Z$
$11 \not 1 \quad c+\bar{Z}$ *
$111 \varnothing N$
$1111 \quad \bar{N}$ *

* 9 Condition codes not available at the Am29116 CT-output.


## Reasons for the Better Performance of the Second Solution

- Instruction inputs of the Am29116 and the $D_{\emptyset-11}$ inputs of the Am291ф are driven from separate microcode bits.
- allows simultaneous instruction execution in the Am29116 and direct-address branching in the Am291ø
- requires an additional 12 bits in the microinstruction
- Multiplexer at the CC-input of the Am291D (controlled by 2 bits)
- allows testing of conditions without loading the signals into the Am29116
- T1-4 inputs of the Am29116 driven by 4 additional microword bits
- allows simultaneous testing and execution of an Am29116 instruction
- Use of the Am29ø4 (controlled by 4 bits)
- improved flexibility in status testing
- Multiplexer at the $I_{\emptyset-4}$ inputs of the Am29116 (needs 6 bits)
- allows different source and destination addresses in RAM in the same microcycle
- The Am2925 clock generator/driver
- able to dynamically alter the length of the microcycle

Intelligent Controller
Comparison with an Am29ø1 - Based Solution

## Alternate Means of Obtaining Am29116 Features

## Am29116

- 16-bit data path
- 32 registers
- 16-bit barrel shifter


## Am29ø1-Based Solution

- Use 4 Am29ø1 microprocessor slices and

1 Am2902 carry-look-ahead generator.

- Am29ø1 has 16 registers but is not readily expandable. For additional registers use Am29ø3 or Am292ø3 processor with 4 Am297Ф5 $16 \times 4$-bit register file extenders.
- Use n cycles to shift data by n bits.
- Status register and condition code generator/MUX
- Byte/Word Mode

Am29ø1-Based Solution

- Use 2 Am29ø4 status and shift control units
- One Am29ø4 for the ALU flags ( $C, N, O V R, Z$ ) and condition code generation
- One Am29ø4 for the 3 user-definable flags and the linkage flag
- Use two output enable bits ( $\overline{0 E_{Y}}$ )

Then in byte operations force $\overline{0 E}_{\gamma}$ high for slices 3 and 4.

- Multiplex status lines at slices 2 and 4

Note: This will permit external byte-mode functions. Operations on internal registers will still alter all 16-bits. You should choose Am292ø3 rather than Am29ø1 if Byte/Word operations are important. When used with a status-flag MUX, Am292ø3 fully supports Byte/Word operations.

## Am29ø1-Based Solution

- Built-in 16-bit priority encoder
- Masking capability of the ALU
- Bit oriented instructions
- CRC instructions

Use two 8-bit Am25LS2513 priority encoders

- Use three microcycles:
- mask the first operand
- mask the second operand
- operate on the masked operands
- Emulate with several microinstructions
- Can be emulated by a lengthy microcode sequence. Will execute too slowly for many applications.
Intelligent Controller - Using Am29ø1
Brief Description of the Additional Devices
Am29ø1 4-bit Microprocessor Slice
- High-speed cascadable element for use in
- CPU's
- peripheral controllers
- programmable microprocessors
- numerous other applications
- Consists of:
- a 16-word by 4-bit two-port RAM
- a high-speed 8-function ALU
plus shifting, decoding and multiplexing sections
Cascadable with either:
- simple ripple carry propagation
- the Am29ø2 look-ahead carry generator
- Produces 4 status flags (N, OVR, C, Z)
- accepts 9-bit microinstructions:
- 3 bits select the ALU operand source
- 3 bits select the ALU function
- 3 bits select the ALU destination
- accepts 8 bits to select the two RAM addresses
- accepts 2 control bits ( $\overline{O E}$ and carry-in)


## Brief Description of the Additional Devices

## Am2902 High-Speed Look-Ahead Carry Generator

- Provides anticipated carries across a group of four binary ALU's
- Accepts up to 4 pairs of carry Propagate and carry Generate signals from an ALU and one carry input.

With ALU operands $A$ and $B$ and an addition operation:

$$
\begin{array}{ll}
P=\overline{P_{3} \cdot P_{2} \cdot P_{1} \cdot P_{\emptyset}} & G=\overline{G_{3}+P_{3} \cdot G_{2}+P_{3} \cdot P_{2} \cdot G_{1}+P_{3} \cdot P_{2} \cdot P_{1} \cdot G_{\emptyset}} \\
P_{\emptyset}=A_{\emptyset}+B_{\emptyset} & G_{\emptyset}=A_{\emptyset} \cdot B_{\emptyset} \\
P_{1}=A_{1}+B_{1} & G_{1}=A_{1} \cdot B_{1} \\
P_{2}=A_{2}+B_{2} & G_{2}=A_{2} \cdot B_{2} \\
P_{3}=A_{3}+B_{3} & G_{3}=A_{3} \cdot B_{3}
\end{array}
$$

- Also provides carry-propagate and carry-generate signals to use for further levels of look-ahead.
- Logic results provided at the outputs are:

$$
\begin{aligned}
& C_{n+x}=\overline{G_{\emptyset}+P_{\emptyset} \cdot C_{n}} \\
& C_{n+y}=\overline{G_{1}+P_{1} \cdot G_{\emptyset}+P_{1} \cdot P_{\emptyset} \cdot C_{n}} \\
& C_{n+z}=\overline{G_{2}+P_{2} \cdot G_{1}+P_{2} \cdot P_{1} \cdot G_{\emptyset}+P_{2} \cdot P_{1} \cdot P_{\emptyset} \cdot C_{n}} \\
& \bar{G}=G_{3}+P_{3} \cdot G_{2}+P_{3} \cdot P_{2} \cdot G_{1}+P_{3} \cdot P_{2} \cdot P_{1} \cdot G_{\emptyset} \\
& \bar{P}= \\
& =P_{3} \cdot P_{2} \cdot P_{1} \cdot P_{\emptyset}
\end{aligned}
$$

## Intelligent Controller - Using Am29ø1

Brief Description of the Additional Devices

## Am25LS2513 Three-State Priority Encoder

- Encodes eight lines to three-line binary
- Three-state outputs
- controlled by three active LOW and two active HIGH inputs
- Cascadable
- provides an input enable and an output enable to permit cascading without additional circuitry


## Building a 16-bit Priority Encoder from 2 AM25LS2513



Intelligent Controller - Using Am2901

# Thus we have seen that it is rather hard to build a controller with the same features as the AM29116 from parts with a lower level of integration. 

On the next page we see a simple disk controller based on a pair of Am29ø1A's.


MPR-430

```
The Microword for the 2901 - Based Solution
- 48 bits wide
    . 19 bits Am2901 instruction (M47 - M37, M35 - M28)
    - 3 bits function select (FCN)
    - 3 bits source (SRC)
    - 3 bits destination (DST)
    - 4 bits A port register address
    - 4 bits B port register address
    -1 bit carry in ( (CN)
    - 1 bit output enable (OE)
    . 8 bits M-bus control (M27 - M20)
    - 4 bits bus source (BUS SRC)
    - 4 bits bus destination (BUS DST)
    . 11 bits sequencer (M36, M17 - M8)
    - 4 bits Am2911 instruction (SEQ INSTR)
    - 4 bits condition code select (TEST)
    - 1 bit polarity of CC (POL)
    - 2 bits to determine the page of microprogram memory
    . 8 bits data (M7 - M0)
    . 2 bits additional control (M19, M18)
    - 1 bit increments MAR (INC MA)
    - 1 bit initiate data transfer (TRAN)
```


## Intelligent Controller - Very High Speed

## Some General Considerations

- Interface-signal names, polarities and functions used here are similar to those used in the current ANSI standard for hard-disk drives.
- The methods and functions discussed here can be used for most current hard- or flexible-disk drives.
- With minimal external logic, this controller uses an Am29116 and an Am952ø burst error processor to perform all of the functions needed to:
- write and read at 30 Mbits per second
- format a disk

Including:

- searching a track for a specific header and sector
- managing data flow through a high-speed buffer memory
- generating modified Fire-code check bits while writing
- detecting and correcting single and burst errors on reading
- generating and checking of CRC's in sector headers


## Intelligent Controller - Very High Speed (continued)

## Am952ф Burst Error Processor

Distinctive Characteristics:

- Provides for detection and correction of burst errors:
- detects errors in serial-data blocks up to 585 K bits long
- allows correction of error bursts of up to 12 bits long
- Effective data rates up to $2 \emptyset$ Mbits per second:
- fast enough for high-performance hard disk systems
- Four selectable industry-standard polynomials:
- popular IBM 56- and 48-bit polynomials
- also 35-bit and 32-bit polynomials
- Three correction algorithms provide flexibility:
- full-period clock-around method compatible with current practice
- Chinese Remainder Theorem method reduces correction time by orders of magnitude
- reciprocal polynomial method for correction with the 48-bit IBM code
- Designed for use in disk controllers and communication systems based on fixed-instruction-set or microprogrammed processors.



# Intelligent Controller - Very High Speed (continued) 

Am952め Burst Error Processor (continued)

Functional Description:

- Register Array
- consists of 56 flip-flops used for
- check-bit computation during write operations
- syndrome computation during read operations
- error pattern extraction during error correction operations
- Polynomial-Divide Matrix:
- establishes interconnections and feedback for a group of shift registers such that an entire byte of data is divided in a parallel operation by the selected polynomial
- the matrix is controlled by 2 Polynomial Select and 2 Function Select inputs
- the data is presented to the matrix a byte at a time on 8 data lines
- When correction operations are complete, the error pattern is available on 12 outputs:
- eight bits on the $Q_{\emptyset}-Q_{7}$ outputs
- four bits on the $L P_{\emptyset}-L P_{3}$ outputs


## Intelligent Controller - Very High Speed (continued)

## Am952ø Burst Error Processor (cont inued)

- Write:
- While the data is being written, the Am952ø is in the Compute-Check-Bits mode, calculating the polynomial remainder without affecting the flow of data to the disk
- After the last data byte, the Am952ø is switched into the Write-Check-Bits mode, outputting the $4,5,6$ or 7 check bytes.
- These check bytes constitute additional information to be appended to the data stream to allow detection and correction of errors on reading.
- Read:
- Two modes are available when reading: Normal and High Speed
- The two modes use different correction algorithms.
- After the last information byte has been read, the state of the ER output signal indicates whether an error has occured.


# Intelligent Controller - Very High Speed (continued) 

Am952ø Burst Error Processor (continued)

- Correction:
- After the read operation, the syndrome in the register array contains information specifying:
- the location of the error
- the bit pattern of the error
- Normal Correction Mode:
- The error location is found by counting the number of clock pulses required to make the EP output go HIGH.
- The error pattern available on $L P_{\emptyset}-L P_{3}$ and $Q_{\emptyset}-Q_{7}$ can be Exclusive ORed with the data to effect the correction.
* High-Speed Correction Mode:
- The error location is found by counting the number of clock pulses required to generate an indicator for each of the 2 or 4 factors of the polynomial.
- The error pattern available on $L P_{\varnothing}-L P_{3}$ and $Q_{\emptyset}-Q_{7}$ can be Exclusive ORed with the data to effect the correction.
- While there are more steps to the high-speed-correction procedure than are required by the "Normal" procedure, the correction is accomplished far faster.
- The high-speed correction method is not available for the 48-bit polynomial.
- For the 56-, 35- and 32-bit polynomials, the high-speed method should be preferred over the "normal method".


## Intelligent Controller - Very High Speed (continued)

Am9520 Burst Error Processor (continued)
Computing Check Bits:

- The Polynomial Divide Matrix and the Register Array implement the familiar serial form of feedback shift register arrangement in an 8-bit parallel form.
(3) The $S_{1}$ and $S_{\emptyset}$ inputs select one of 4 polynomials:

| Polynomial | Degree \& Number of Check Bits | Period <br> (Bits) | Correctable Burst Error Length (Bits) |
| :---: | :---: | :---: | :---: |
| $\begin{aligned} & \left(x^{22}+1\right) \cdot\left(x^{11}+x x^{7}+x^{6}+x+1\right) \\ & \left(x^{12}+x^{11}+x^{10_{+}}\right. \\ & \left(x^{11}+x^{9}+x^{7}+x^{6}+x^{\circ}+x+1\right) \end{aligned}$ | 56 | 585,442 | 11 |
| $\left(x^{21}+1\right) \cdot\left(x^{11}+x^{2}+1\right)$ | 32 | 42,987 | 11 |
| $\left(x^{23}+1\right) \cdot\left(x^{12}+x^{11}+x^{8}+x^{7}+x^{3}+x+1\right)$ | 35 | 94,185 | 12 |
| $\left(x^{13}+1\right) \cdot\left(x^{35}+x^{23}+x^{8}+x^{2}+1\right)$ | 48 | $\begin{gathered} 13 \cdot\left(2^{35}-1\right) \\ =4.466 \ldots \times 1 \emptyset^{11} \end{gathered}$ | 7 |

- When the last data byte has been read or written, the Register Array contains the check bits.
- Remember to select the same polynomial for reading as was used for writing.


## Intelligent Controller - Very High Speed (continued)

Am952ø Burst Error Processor (continued)

Computing Check Bits (continued)

- Sequence of events to compute the check bits
i) The clock input, $C P$, should be in the quiescent HIGH state.
ii) Initialize by activating the master reset input, $\overline{M R}$, LOW and return it to HIGH.
iii) Specify the desired polynomial via $S_{\emptyset}, S_{1}$.

Apply zeroes to $C_{2}-C_{\emptyset}$ to select Compute-Check-Bits mode.
iv) Establish a byte of data on $D_{\varnothing}-D_{7}$ inputs.
v) Make the clock input go LOW then HIGH.
vi) Repeat from step iv) until all data bytes are entered.

# Intelligent Controller - Very High Speed (continued) 

## Am952ф Burst Error Processor (continued)

## Write Check Bits:

- When the Write-Check-Bits mode is established, the feedback paths of the register array are disabled and the check bits may be shifted out.
- Checkbits are available on the $Q_{\varnothing}-Q_{7}$ outputs one byte at a time.
- Sequence of events to obtain the check bits:
i) The clock input, $C P$, should be in the quiescent HIGH state.
ii) Select the polynomial via the $S_{\emptyset}, S_{1}$ inputs.
iii) Force $C_{2}, C_{1}, C_{\emptyset}$ to LOW LOW HIGH (Write Check Bits).
iv) After a propagation delay the $Q_{\emptyset}-Q_{7}$ outputs will contain the, first check byte.
v) Make the clock input go LOW then HIGH. The next check byte will be available on the $Q_{\varnothing}-Q_{7}$ outputs.
vi) Repeat from step v) until all check bytes are read out.


# Intelligent Controller - Very High Speed (continued) 

Am952め Burst Error Processor (continued)

Read - General:

- The input stream (data and check bytes) is divided by the selected polynomial to obtain the syndrome.
- A non-zero syndrome indicates an error has been detected. If the syndrome is not zero the ER output will be HIGH.
- Two methods for error correction are available:
- full-period clock-around ("Normal")
- Chinese Remainder Theorem ("High Speed Method")
- There is a different read procedure for each of these methods:
- Read Normal: produces one syndrome
- Read High Speed: produces as many syndromes as the number of factors in the polynomial.

The input stream is simultaneously divided by all of the factors of the polynomial. The ER output indicates whether or not all syndromes are zero.

- Read High Speed is not available for the 48-bit polynomial


## Intelligent Controller - Very High Speed (continued)

## Am952ø Burst Error Processor (continued)

## Read Normal:

- Sequence of events for Read Normal:
i) The clock input should be in the quiescent HIGH state.
ii) Initialize by pulling the master reset, $\overline{M R}$, LOW and then returning it to HIGH.
iii) Select the required polynomial via $S_{\phi}, S_{1}$ inputs.
iv) Apply LHL to $C_{2}, C_{1}, C_{\emptyset}$ to select Read Normal.
v) Present a byte of information as read from the disk to the $\mathrm{D}_{\varnothing}-\mathrm{D}_{7}$ inputs.
vi) Make the clock go LOW then HIGH.
vii) Repeat from step v) until the last check byte read from the disk is processed.
viii) Test the ER output:
- ER HIGH: an error has been detected.
- ER LOW: no error has been detected.

```
Am952\emptyset Burst Error Processor (continued)
```

High Speed Read:

- Sequence of events for Read High Speed:
i) The clock input, $C P$, should be in the quiescent HIGH state.
ii) Specify the polynomial via the $S_{\emptyset}, S_{1}$ inputs.
iii) Apply LHH to $C_{2}, C_{1}, C_{\emptyset}$ to select Read High Speed.
iv) Initialize by pulling the master reset, $\overline{M R}$, LOW and then returning it to HIGH.
v) Present a byte as read from the disk to the $D_{\emptyset}-D_{7}$ inputs.
vi) Make the clock go LOW then HIGH.
vii) Repeat from step v) until the last check byte has been read.
viii) Test the ER output:
- ER HIGH: an error has been detected.
- ER LOW: no error has been detected.


## Intelligent Controller - Very High Speed (continued)

## Am952ø Burst Error Processor (continued)

## Function Select Codes

| $C_{2}$ | $C_{1}$ | $C_{\emptyset}$ | Function |
| :--- | :--- | :--- | :--- |
| L | L | L | Compute check bits |
| L | L | H | Write check bits |
| L | H | L | Read normal |
| L | H | H | Read high speed |
| $H$ | L | L | Load |
| $H$ | L | H | Reserved |
| $H$ | $H$ | L | Correct normal |
| $H$ | $H$ | $H$ | Correct high speed |

# Intelligent Controller - Very High Speed (continued) 

## Am952ø Burst Error Processor (continued)

Correct Normal:

- The Am952ø manipulates the syndrome to yield:
- error pattern (at $Q_{\emptyset}-Q_{7}$ and $L P_{\emptyset}-L P_{3}$ outputs)
- error location (needs further external computation)
- Syndrome is repeatedly divided by the polynomial until the error pattern is located:
- done by repeatedly clocking without regard to the $D_{\varnothing}-D_{7}$ inputs until EP goes HIGH
- If the AE output goes HIGH while the EP output remains LOW, an alignment exception has been detected.
- Count clock cycles (\#C) until EP goes HIGH.
- If \#C > (period of polynomial) then the error is uncorrectable.

Intelligent Controller - Very High Speed (continued)

Am952ø Burst Error Processor (continued)

Correct Normal: (continued)

- Use two external counters (R1, R2)
- RI counts the number of cycles until AE goes HIGH
- R2 counts the number of cycles from AE going HIGH to EP going HIGH. (= $\varnothing$ if no alignment exception)
- If R1+R2 > (period of polynomial) then the error is uncorrectable.
- ( $N \cdot K-8 \cdot R 1-R 2$ ) gives the location of the first bit in the error burst counting from the last check bit of the record for the 56-bit and 32-bit polynomials.
( $N \cdot K-8 \cdot R 1-R 2+5$ ) gives the location of the first bit in the error burst counting from the last check bit of the record for the 35 -bit polynomial.
where:
$K$ is the smallest +ve integer that makes this expression +ve. and $N$ is the period of the polynomial.

Note: The 48-bit polynomial uses another correction algorithm. See specification sheet for details.

# Am9520 Burst Error Processor (continued) 

Burst Error Processor (continued)
Correct High Speed:

- This mode allows you to determine the error pattern in far fewer clock cycles than does the "Normal" mode.
- A polynomial with $m$ factors with periods $P_{1}, P_{2}, \ldots, P_{m}$ will correct in no more than the following number of of clock cycles:
- In Normal mode: $\quad P_{1} \cdot P_{2} \cdot \ldots \cdot P_{m}$ i.e. the product of the $P^{\prime} s$
- In High Speed mode: $P_{1}+P_{2}+\ldots+P_{m}$ i.e. the sum of the $P^{\prime} s$
- Number of syndromes equals the number of factors of the polynomial.
- Refer to the table a few pages back which shows the factorization of the polynomials.

As we have written the factors, the first factor has a special signifigance. The degree of this first factor determines the maximum length of an error burst that is still correctable.

## Intelligent Controller - Very High Speed (continued)

## Am952ø Burst Error Processor (continued)

- The error location is given by:
$L=N \cdot K-\left(A 1 M 1+A 2 M 2+\ldots+A_{m} M_{m}\right)$

Where $N$ is the natural period of the polynomial.
K is the smallest integer that makes $L$ positive.
$M_{i}$ are the numbers of clock cycles required to match the error pattern of each factor.
and $A_{i}$ are the Chinese Remainder Theorem coefficients:

| Polynomials | A1 | A2 | A3 | A4 |
| :--- | :---: | :---: | :---: | :---: |
| 56 -bit | 452,387 | 578,864 | $2,521,9 \varnothing 4$ | $2,647,216$ |
| 32-bit | 311,144 | $32,76 \emptyset$ | -- | -- |
| 35-bit | $32,76 \emptyset$ | $72 \emptyset, 728$ | -- | - |

# Intelligent Controller - Very High Speed (continued) 

Am952ø Burst Error Processor (cont inued)

Correct High Speed (continued)

- To determine $M_{i}$ and the error pattern use:
- $P_{\phi}-P_{3}$ inputs to select the register section to be clocked
- $\mathrm{PM}_{2}-\mathrm{PM}_{4}$, the "pattern match" outputs:

When $\mathrm{PM}_{\mathbf{i}}$ is HIGH then the syndrome ${ }_{j}$ shows a match with the error pattern.
( For M1: use the polynomial-shift controls to select register-section \#1 ( $P_{\phi}, P_{1}, P_{2}, P_{3}$ - HLLL).

Use the same procedure as in the correct normal mode.

- If R1+R2 > (period of factor \#1) then error is uncorrectable.
- If error is correctable then $M 1=8 R 1+R 2$
- The $M_{i}$ will be determined after M1 but slightly differently:
- Select the $i^{\text {th }}$ register section via $P_{3}-P_{\emptyset}$
- While $P M_{j}$ is LOW clock the $952 \emptyset$ and increment a counter
- If the count exceeds the period of this factor the error is not correctable.
- When EP and all $\mathrm{PM}_{\mathrm{i}}$ 's associated with this polynomial are HIGH, then the error pattern and error location are determined.


# Intelligent Controller - Very High Speed (continued) 

Am952ø Burst Error Processor (continued)

Am952ø Polynomial Periods

| Polynomial | Period <br> Factor 1 | Period <br> Factor 2 | Period <br> Factor 3 | Period <br> Factor 4 | Composite <br> Period (N) |
| :---: | :---: | :---: | :---: | :---: | :---: |
| 56 -bit | 22 | 13 | 89 | 23 | 585,442 |
| 32 -bit | 21 | $2 \emptyset 47$ | -- | -- | 42,987 |
| 35 -bit | 23 | $4 \varnothing 95$ | -- | -- | 94,185 |

Note: The 48-bit polynomial requires the use of a different correction procedure and is not shown here.

Error Pattern Format for 56-Bit, 35-Bit and 32-Bit Polynomials


## Data Path for Very High Speed Controller



# Intelligent Controller - Very High Speed (continued) 

## System Organisation:

- Interface to the disk drives:
- bit-serial data paths for read and write
- byte-parallel paths for
- commands
. disk addresses
. disk status
- Am291ø sequencer and Am27535 registered PROM microprogram memory drives 76 -bit control bus:
- Data flow is:
- asynchronous \& serial at $3 \varnothing$ Mbits/sec from drive to a FIFO array
- 16-bit clocked \& parallel from the FIF0 array to the buffer memory at $2 \emptyset$ Mbits per second via the internal data bus.
- The BEP is located on the internal data bus.
- Disk read:
- Data is read into the FIFO array at 30 Mbit per second.
- Concurrently, the data is transferred from the buffer memory to the BEP at a rate of 15 MHz .
- Disk write:
- The BEP pre-calculates the check bits before the write. There is not enough time to overlap BEP/buffer and buffer/FIFO transfers.


## Intelligent Controller - Very High Speed (continued)

## Microinstruction Format

- 80-bit wide microinstruction
- This is not a minimum width, but it demonstrates microcoding in a straightforward manner.
- Sample microcode for uncompressed sector read/write operations is available including:
- header and sector searching
- error checking of the header via Am29116 CRC instructions
- error checking and correction of the data segments via Am952ø and its 56-bit modified Fire code polynomial
- The sector input/output microroutine (SECTIO) performs input or output of a single 256 -byte sector.

| Microinstruction for the Very High Speed Controller |  |  |  |
| :---: | :---: | :---: | :---: |
| Microinstruction Field |  |  |  |
| Bits | Width | Mnemonic | Comment |
| 79－64 | 16 | I $15-\mathrm{I} \varnothing$ | Am29116 Instruction |
| 63－60 | 4 | $\mathrm{T}_{4} \mathrm{~T}_{1}$ | Am29116 Conditional Test Select |
| 59 | 1 | $\overline{\text { SRE }}$ | Am29116 Status Register Enable |
| 58 | 1 | $\overline{0 E}_{Y}$ | Am29116 Output Enable Y－Bus |
| 57 | 1 | $\overline{\text { IEN }}$ | Am29116 Instruction Enable |
| 56 | 1 | DLE | Am29116 Data Latch Enable |
| 55－52 | 4 | $I_{3}-I_{\emptyset}$ | Am291ø Instruction |
| 51－42 | 10 | $D_{9}-D_{\emptyset}$ | Am2910 Direct Input |
| 41－36 | 6 | －－－ | Condition－Code MUX Selection \＆Polarity |
| 35 | 1 | $\overline{\text { ADMC }}$ | Address Mark Control |
| 34 | 1 | $\overline{B F C B}$ | （Enable Memory）Bus from（Disk Drive）Control Bus |
| 33 | 1 | $\overline{\text { BFTP }}$ | （Enable Memory）Bus from Translate PROM <br> （Translate PROM needed in data compression operations） |
| 32 | 1 | $\overline{\text { BFø }}$ | （Enable Memory）Bus from 94ø3A FIFO array |

## Microinstruction for the Very High Speed Controller (continued)

| Microinstruction Field |  |  |  |
| :---: | :---: | :---: | :---: |
| Bits | Width | Mnemonic | Comment |
| 31 | 1 | $\overline{\text { BF16 }}$ | (Enable Memory) Bus from Am29116 Y-Bus |
| $3 \varnothing$ | 1 | $\overline{\text { BF2L }}$ | (Enable Memory) Bus Lower Byte from Am952ø Q-Bus |
| 29 | 1 | BF2U | (Enable Memory) Bus Upper Byte from Am952ø Q-Bus |
| 28 | 1 | $\overline{\text { BOUT }}$ | Bus Direction OUT |
| 27 | 1 | BTФ3 | (Enable Memory) Bus to 94D3A FIFO Array |
| 26 | 1 | $\overline{\text { BT16 }}$ | (Enable Memory) Bus to Am29116 Y-Bus |
| 25 | 1 | $\overline{\text { BT2L }}$ | (Enable Memory) Bus Lower Byte to Am952ø D-Bus |
| 24 | 1 | $\overline{\text { BT2U }}$ | (Enable Memory) Bus Upper Byte to Am952¢ D-Bus |
| 23 | 1 | $\overline{\text { BT2 }}$ | (Enable Memory) Bus to Am9520 REP, P3-PФ, C2-CD |
| 22 | 1 | $\overline{\text { CE2L }}$ | Clock Enable Am952ø to Lower-Byte Bus Interface Register |
| 21 | 1 | $\overline{\text { CE2D }}$ | Clock Enable Memory Bus to Am952¢ Interface Registers |
| $2 \emptyset$ | 1 | CP2¢ | Clock Pulse for 952¢ (Microcoded Waveform) |
| 19 | 1 | $\overline{\text { CREQ }}$ | Command Request |
| 18 | 1 | $\overline{\text { INPT }}$ | (Enable Serial Data) Input to 94ø3A FIFO Array |

Microinstruction for the Very High Speed Controller (continued)
Microinstruction Field

| Bits | Width | Mnemonic | Comment |
| :---: | :---: | :---: | :---: |
| 17-16 | 2 | $\overline{\text { JMPI }}$ | (Enable) Jump Indirect Am29116 Y |
| 15 | 1 | $\overline{\text { MADR }}$ | (Enable Loading of Buffer) Memory Address Re |
| 14 | 1 | $\overline{\text { MREA }}$ | (Enable Buffer) Memory Read |
| 13 | 1 | MWRT | (Enable) Memory Write Operation |
| 12 | 1 | $\overline{\text { OUPT }}$ | (Enable Serial Data) Output from 94Ø3A FIFO array |
| 11 | 1 | $\overline{\text { PENB }}$ | Parameter Enable |
| $1 \varnothing$ | 1 | $\overline{\text { PFPM }}$ | (Enable Setting of Am952ø) P Bits from Am952ø PM Bi |
| 9 | 1 | РFФ3 | (Enable) Parallel Fetch from 9403A FIFO array |
| 8 | 1 | $\overline{\text { PL®3 }}$ | (Enable) Parallel Load of 9403A FIF0 array |
| 7 | 1 | $\overline{\text { PREQ }}$ | Parameter Request |
| 6 | 1 | $\overline{\text { RDGA }}$ | Read Gate |
| 5 | 1 | $\overline{\mathrm{RFIF}}$ | Reset 94Ø3A FIF0 array |
| 4 | 1 | $\overline{\text { SAST }}$ | Select/Attention Strobe |
| 3 | 1 | $\overline{\text { WRGA }}$ | Write Gate |
| 2-ø | 3 | XLAT | Translate Table Select for Data Compression PROM |

# Intelligent Controller - Very High Speed (continued) 

## Conclusion

What makes this controller so fast?

- The Am29116 microprocessor has been combined with the Am952ø burst error processor.
- This provides the powerful Am29116 instruction set and the very effective hardware elements of the Am29116:
. its CRC logic
- 32 registers
- barrel shifter
- priority encoder
- The Am952ø is a relevant specialized device which:
- generates the checksum
- checks the data together with the checksum and produces both the error pattern and error location
- is very fast (due to its special hardware).

You can calculate a CRC remainder -
on an Am952ø at $5 \not 0 \mathrm{nsec}$ per data byte. on an Am29116 at $2 \not \emptyset$ nsec per data bit.
(The Am952ø is $32 x$ faster than the Am29116!).

- The Am952ø is unique in supporting the very fast Chinese Remainder Theorem method that greatly speeds the correction of a faulty sector.

This method is fully supported by hardware for use while reading and in the correction calculation itself.

# Intelligent Controller - Very High Speed (continued) 

## Conclusion (continued)

- We can write special microcode: for example for packing ASCII fields.
This is not possible with a fixed-instruction-set processor or a specialized LSI disk-controller chip.
- We are making use of the parallel hardware of the $952 \emptyset$. For example, the Am952ø can perform up to four simultaneous polynomial division operations and produce four independant syndromes concurrently.
- We are concurrently reading from the disk and generating the polynomial remainders for error detection and correction.
(We were not able to support checkbit calculation in parallel with writing, however).
- We have used a FIFO array to allow us to read a data stream that is faster than even an Am952ø can check it.
- We have incorporated a buffer memory that:
- holds images of the last eight sectors read from or written to disk
- holds I/ $\varnothing$ request queues to maximize throughput
- holds additional house keeping tables
- We have used a PROM to translate from EBCDIC data to packed-BCD or ASCII

Application of Am29116 to General Purpose CPUs

## A Microprogrammed CPU Using Am29116

## Introduction and System Overview

- Am29116 is optimized for peripheral controller applications
- However, Am29116 is also an ideal choice for CPU's as well.
- it has a powerful instruction set for:
. arithmetic operations
. data movement
. multiple-bit shifts
- bit manipulation
- status manipulation
- it has high speed (1め nsec cycle time)
- it can reduce power requirements
- it save area on PC boards
- We will describe a CPU built with the Am29116 which maintains architectural and software compatibility with the Super-16.*
- This CPU incorporates pipelining at
- the microprogram level
- the macroinstruction level
* As described in detail in
"Bitslice Microprocessor Design" by Mick and Brick, Chapter 9


## Central Processing Unit Block Diagram



This is a simple system comprised of:

- 16-bits wide main memory built from static RAM chips
- Am29116 processor and CCU
- A simple bus structure:
- can be modified to accomodate interface signals
- but to add other I/ $\varnothing$ devices, a bus controller is needed

- handshaking over three busses
- 16-bit-wide address bus
- 16-bit-wide bidirectional data bus
- 7-bit-wide control bus
- memory request
- read/write
- address accepted
- data strobe
- data synch
- interrupt control lines



# A Microprogrammed CPU Using Am29116 (continued) 

## Memory Read (continued)

- To use the data in the $n+1^{\text {th }}$ cycle, the Am29116 generates the main memory address during the $(n-1)^{\text {th }}$ cycle.
- Data is read during the $n^{\text {th }}$ cycle.
- The $n^{\text {th }}$ cycle must be stretched to accomodate main-memory READ timing.
- The signal to stretch the $n^{\text {th }}$ cycle is provided to the Am2925 clock generator during the $(n-1)^{\text {th }}$ cycle.



## A Microprogrammed CPU Using Am29116 (continued)

## Instruction Formats

(same as for Super-16)

- One-word instructions (16 bits):
- register to register RR
$15 \quad 8743 \quad \emptyset$

| $0 P$ | $R 1$ | $R 2$ |
| :--- | :--- | :--- |

- register storage

RS

| OP | R 1 | $\times 2$ |
| :--- | :--- | :--- |

- storage to storage SS

| $O P$ | $X 1$ | $X 2$ |
| :--- | :--- | :--- |

- Two-word instructions (32 bits):
$15 \quad \emptyset 15$
$\emptyset$
- register indexed storage $\quad R X$

| $O P$ | $R 1$ | $X 2$ | $d$ |
| :--- | :--- | :--- | :--- |

- register storage immediate RX

| $O P$ | $R 1$ | $X 2$ | $d$ |
| :--- | :--- | :--- | :--- |

4-bit register address defines one of 16 registers:

- uses lower half ( $\mathrm{R} \emptyset$ - R15) of the 32 registers in the Am29116 as user registers
- upper half (R16 - R31) used by the operating system (stack pointer, counter, etc.)


## A Microprogrammed CPU Using Am29116 (continued)

Instruction Formats (continued)

- 8-bit opcode specifies 256 instructions
- includes information about the addressing mode
- PUSH/POP operations
- I/0 instructions
- decimal and binary integer arithmetic
- Data types
- bit- nibble
- byte
- word

A Microprogrammed CPU Using Am29116 (continued)
Central Processing Unit


## A Microprogrammed CPU Using Am29116 (continued)

CPU - Architecture

- Internal data transfers use 16-bit wide internal bus
- Data is transferred between the system bus and the internal CPU bus on three paths:
- data register
- address register
- instruction lookahead register (Z-latch)
- Pipelining
- microlevel: pipeline register at the output of the microprogram memory
(for overlapped instruction fetch and execution)
- macrolevel: instruction register (IR) and Z-latch
. decodes the macroinstruction in the IR
- next macroinstruction, displacement field or data in the Z-latch
- Macroinstruction can move from main memory to
- the $Z$-latch or
- immediately to the IR (by making the Z-latch transparent)
- Load IR directiy:
- during pipeline-fill operation
- on instruction after a two-word instruction
- Decoding of (IR) determines the meaning of ( $Z$ ):
- next instruction if (IR) is a RR, RS or SS instruction
- displacement if (IR) is a RX instruction:

A displacement is moved into the Am29116 via its Y -bus in the cycle in which the operand address is formed.

- immediate data (used in the execution cycle)
- Am29116 can input and output data in the same microcycle:
- data passes into the D-latch in $1^{\text {st }}$ half of cycle
- In the second half of the cycle, the D-latch is disabled \& the ALU result goes out to internal bus via the Y -bus.
- N-register can be used for:
- $N$-way branching or for normalization (use prioritize instr' $n$ )
- "n" in the rotate-by-n instruction


## A Microprogrammed CPU Using Am29116 (continued)

## Microprogram Control

- Am2910 sequencer generates address for next microinstruction
- branch address sources:
- pipeline register
- interrupt vector decoder
- macroinstruction decoder (mapping PROM)
- N-register
- Conditions for branching can come from:
. Am29116 condition-test output
- Am29ø4 condition-test output. Provides:
- for extended set of branching conditions
- for saving conditions generated previously (a useful result of having 2 status registers)
- Am29116 T-bus (carry, overflow, negative, zero)
- Interrupt request accepted by an Am2914 interrupt controller request

A Microprogrammed CPU Using Am29116 (continued)
Timing Analyses


## A Microprogrammed CPU Using Am29116 (continued)

## Timing Analyses (continued)

Path Computations

```
Path 1 (IR --> pipeline)
```

From To

| Instruction register | $C P$ | $Q$ | 13 ns |
| :--- | :--- | :--- | :--- |
| Mapping PROM | ADD | $Y$ | $4 \emptyset \mathrm{~ns}$ |
| Sequencer | $D$ | $Y$ | $2 \emptyset \mathrm{~ns}$ |
| Micromemory | ADD | $Y$ | $4 \emptyset \mathrm{~ns}$ |
| Pipeline register | set up |  | 5 ns |

Path 2 (CC-flip flop --> CC-MUX control)

|  | From | To |  |
| :--- | :--- | :--- | :--- |
|  | CP | Q | 13 ns |
| CC flip flop | CC | $Y$ | 43 ns |
| Sicromemory | ADD | $Y$ | $4 \emptyset \mathrm{~ns}$ |
| CC-MUX | sel | $Y$ | 15 ns |
| CC-flip-flop | set up |  | 5 ns |

A Microprogrammed CPU Using Am29116 (continued)

| Timing Analysis (continued) |  |  |  |
| :---: | :---: | :---: | :---: |
| Path Computations (continued) |  |  |  |
| Path 3 (pipeline register --> pipeline register) |  |  |  |
|  | From | To |  |
| Pipeline register | CP | Y | 13 ns |
| Sequencer | I | Y | 70 ns |
| Micromemory | ADD | $Y$ | 40 ns |
| Pipeline register | set up |  | 5 ns |
|  |  |  | 128 ns |
| Path 4 (pipeline register --> CC-flip-flop) |  |  |  |
|  | from | to |  |
| Pipeline register | CP | $\gamma$ | 13 ns |
| Three-state gate | enable | $\gamma$ | 29 ns |
| Am29116 (preliminary) | I | CT | 47 ns |
| CC - MUX | $\mathrm{D}_{\text {in }}$ | $\gamma$ | 15 ns |
| CC flip flop | set up |  | 5 ns |

A Microprogrammed CPU Using Am29116 (continued)

Timing Analysis (continued)

Path Computations (continued)

Path 5 (pipeline register --> data register)
From To

| Pipeline register | $C P$ | $Y$ | 13 ns |
| :--- | :--- | :--- | ---: |
| Three-state gate | enable | $Y$ | 29 ns |
| Am29116 (preliminary) | I | $Y$ | 88 ns |
| Data register | set up |  | 5 ns |

Path 6 (PC and STACK --> pipeline register)
From To

Am291ø (PC and STACK) CP Y 1めめ ns*
Micromemory $\quad$ ADD $\quad Y \quad 4 \emptyset \mathrm{~ns}$
Pipeline register set up 5 ns

145 ns

[^1]
## A Microprogrammed CPU Using Am29116 (continued)

Timing Analysis (continued)

- Most critical path was

From To

| Pipeline register | CP | $Y$ | 13 ns |
| :--- | :--- | :--- | :--- |
| Am29116 | $I$ | $C T$ | 47 ns |
| CC MUX | $D$ | $Y$ | 15 ns |
| Sequencer | $\overline{C C}$ | $Y$ | 43 ns |
| Micromemory | ADD | $Y$ | $4 \emptyset \mathrm{~ns}$ |
| Pipeline register | set up |  | 5 ns |
|  |  |  | $\frac{163 \mathrm{~ns}}{}$ |

- Introducing the CC-Flip-Flop separates the cycle time for that path into two non-critical paths (4 and 2).
- Since the CC-FF delays the $\overline{C C}$ signal by one clock, the CC-MUX-select lines are driven directly from the microprogram memory rather than the pipeline register. This tactic re-aligns the selection of the condition code with the execution of the microinstruction.


## A Microprogrammed CPU Using Am29116 (continued)

## Macroinstruction Execution

- 4 basic sequences of operations:
- Form Memory Address of Instruction: (1 microcycle)
$P C, M A R<--P C+2$
- Fetch Instruction: (1 microcycle)
generate Main-Memory-Request and Read-Strobe
bus <-- ((MAR))
IR <-- (bus) at the next rising edge
- Decode Instruction: (1 microcycle)

IR <-- Z-latch

PROM generates starting address for the microprogram

- Execute:

Am29116 performs the specified operation on the operands. The number of microcycles depends upon the operation.

- 2 extra steps are needed for instructions with memory operands
- Form Operand Address: (1 microcycle)

MAR <-- $(X)+d$ using the ' $d$ ' in the Z-latch

- Fetch Operand: (1 microcycle)
$Z<--((M A R))$ or $D<--((M A R))$


## Pipelining at the Macrolevel

- There is very little scope in this configuration for additional macrolevel pipelining:
- You could use a second Am29116 as the basis for a Program Control Unit. That is, you would use one Am29116 as an ALU and the second Am29116 as a PCU.
- You could use some other processor as a basis for this PCU.
- A PCU would increase throughput by introducing parallelism in the advancing of the program counter, and in stack operations.
- The present configuration already incorporates some concurrency, however.


## A Microprogrammed CPU Using Am29116 (continued)

## Pipelining at the Macrolevel (continued)

Overlapping of Register to Register Instructions

|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 47 | 18 | 19 | 20 | 21 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Form Instruction Address | A | B |  |  |  | 0 |  |  | D |  |  | E |  |  |  |  |  |  |  |  | $\begin{aligned} & \mathrm{PC}+2 \rightarrow \mathrm{PC} \\ & \mathrm{PC}+2 \rightarrow \mathrm{MAR} \end{aligned}$ |
| Fetch Instruction |  | $\begin{aligned} & 1 R \\ & A \end{aligned}$ | $\begin{aligned} & \mathrm{Z} \\ & \mathrm{~B} \end{aligned}$ |  |  |  | 2 $C$ |  |  | $Z$ $D$ |  |  | $\begin{aligned} & Z \\ & E \end{aligned}$ |  |  |  |  |  |  |  | PC $+2 \rightarrow$ MAR and Load $\mathbf{Z}$ Latch or IR |
| Decode |  |  | A |  |  | $\begin{aligned} & \text { IR } \\ & \text { B } \end{aligned}$ |  |  | $\begin{aligned} & \mathrm{IR} \\ & \mathrm{C} \end{aligned}$ |  |  | $\begin{gathered} \mathrm{IR} \\ \mathrm{D} \end{gathered}$ |  |  |  |  |  |  |  |  | Decode Instruction and Load Plpeline Register |
| Form Operand Address |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | $\begin{aligned} & z+\text { Index } \\ & \text { Register } \rightarrow \text { MAR } \end{aligned}$ |
| Fetch Operand |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | Load Data Register |
| Execute |  |  |  | A | A |  | B | E |  | 0 | C |  | D | D |  |  |  |  |  |  |  |
|  | $Y$ | $Y$ |  | Y | $Y$ | Y | $Y$ | $Y$ | $\gamma$ | $Y$ | $Y$ | Y | Y | Y |  |  |  |  |  |  | The Am29116 Usage |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

$A, B, C, D$ are Register to Register type instructions.
$Z=Z$ Latch
IR = Instruction Register

- Cycle 2: pipefill operation (directly load IR) from memory while Am29116 forms next instruction address
- Cycle 4,5: in single-port Am29116, RR instr. needs 2 cycles
- Cycle 6: IR <-- (Z) and map into microaddress with a PROM while Am29116 forms next instruction address
- Cycle 3 is the only cycle in which the Am29116 is idle.
- After the first 5 cycles, every third cycle produces a result.

A Microprogrammed CPU Using Am29116 (continued)

## Pipelining at the Macrolevel (continued)

Overlapping of Register to Index Storage Instructions

|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Form Instruction Address | A | $A_{D}$ |  |  | B |  | Bo |  |  | C |  | $C_{0}$ |  |  |  |  |  |  |  |  | $\begin{aligned} & P C+2 \rightarrow \text { AAR } \\ & P C+2 \rightarrow P C . \end{aligned}$ |
| Fetch Insiruction |  | IR <br> A | Z <br> ${ }_{A D}$ |  |  | IR $\mathrm{B}$ |  | z <br> $B_{0}$ |  |  | IR <br> C |  | $z$ <br> $C_{D}$ |  |  |  |  |  |  |  | $\begin{aligned} & (P C+2 \rightarrow \text { MAR } \\ & \text { and PC)* } \\ & \text { Load IR or } Z \text { Latch } \end{aligned}$ |
| Decode |  |  | A |  |  |  | B |  |  |  |  | C |  |  |  |  |  |  |  |  | $\mathrm{PC}+2 \rightarrow \mathrm{MAR}$ and PC Decode and Load Pipeline Fegister |
| Form Operand Address |  |  |  | A |  |  |  |  | B |  |  |  |  | C |  |  |  |  |  |  | $\begin{aligned} & z+\text { Index } \\ & \text { Register } \rightarrow \text { NAR } \end{aligned}$ |
| Fetch Operand |  |  |  |  | A |  |  |  |  | B. |  |  |  |  | C |  |  |  |  |  | $P C+2 \rightarrow P C$ and HAR Load Operand in Data Register |
| Execute |  |  |  |  |  | A |  |  |  |  | B |  |  |  |  | C |  |  |  |  |  |
|  | $Y$ | Y |  | $Y$ | $Y$ | $Y$ | $Y$ |  | Y | $Y$ | $Y$ | $\gamma$ |  | $Y$ | $Y$ | $Y$ |  |  |  |  | The Am29116 Usage |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |

*For pipefill operation only.
A, BC are Register to Index storage type instructions.
$A_{D}, B_{D}, C_{D}$ are displacement.
$Z=Z$ Latch
IR = Instruction Register

- Cycle 2: Form $A_{D}$ creates address to fetch displacement, 'd'.
- Cycle 3: Decoding determines if $Z$ contains a displacement.
- Cycle 6: Two-word instructions use both Z-latch and IR:
- Instruction is directly loaded into IR.
- Execution of A prevents formation of address of displacement, 'd'. (a PCU would save a cycle here).
- Every fifth cycle the Am29116 is idle.
- Every fifth cycle produces a result


## A Microprogrammed CPU Using Am29116 (continued)

Pipelining of the Macrolevel (continued)
Overlapping of Branch-on-Condition RX Type Instructions

| $O P$ | $M$ | $\times 2$ | $d$ |
| :--- | :--- | :--- | :--- |

$M$ is the condition $(\mathrm{X} 2)+d$ is the address

|  | 1 | 2 | 3 | 4 | 5 | 5 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 45 | 16 | 17 | 18 | 19 | 20 | 21 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Form instruction Address | A | $A_{D}$ |  |  | $\mathrm{B}_{\mathrm{K}} \text { or }$ | $\begin{aligned} & B_{D} \\ & K_{D} \end{aligned}$ |  |  | $\begin{aligned} & B+2 \\ & K+2 \end{aligned}$ |  | $\left\|\begin{array}{l} \mathrm{B}+2 \mathrm{D} \\ \mathrm{~K}+2 \mathrm{D} \end{array}\right\|$ |  |  |  |  |  |  |  |  |  | $\begin{aligned} & P C+2 \rightarrow \text { MAA } \\ & P C+2 \rightarrow P C \end{aligned}$ |
| Fetch Instruction |  | $\begin{aligned} & \text { IR } \\ & \text { A } \end{aligned}$ | $\begin{gathered} z \\ A_{D} \end{gathered}$ |  |  | $\begin{aligned} & \text { IR } \\ & \text { Bor } \\ & \text { K } \end{aligned}$ | $\begin{aligned} & z \\ & \mathrm{BD}_{\mathrm{D}} \end{aligned}$ |  |  | 18 B $\mathrm{~B}+2$ $\mathrm{~K}+2$ |  | $\left\|\begin{array}{c} Z \\ B+2 D \\ K+2 D \end{array}\right\|$ |  |  |  |  |  |  |  |  | $\begin{aligned} & (P C+2 \rightarrow \text { MAR } \\ & \text { and PCC)* } \\ & \text { Load } 1 R \text { or } 2 \text { Latch } \end{aligned}$ |
| Decode |  |  | A |  |  |  | $\begin{aligned} & B \\ & K \end{aligned}$ |  |  |  | $\begin{aligned} & \mathrm{B}+2 \\ & \mathrm{~K}+2 \end{aligned}$ |  |  |  |  |  |  |  |  |  | $\mathrm{PC}+2 \rightarrow \mathrm{MAR}$ and PC Decode and Load Pipeline Register |
| Form Operand Address |  |  |  |  |  |  |  | $\begin{aligned} & \mathrm{B} \\ & \mathrm{~K} \end{aligned}$ |  |  |  |  | $\begin{aligned} & \mathrm{B}+2 \\ & \mathrm{~K}+2 \end{aligned}$ |  |  |  |  |  |  |  | $\begin{aligned} & Z+\text { Index } \\ & \text { Register } \rightarrow \text { MAR } \end{aligned}$ |
| Fetch Operand |  |  |  |  |  |  |  |  | $\begin{aligned} & \mathrm{B} \\ & \mathrm{~K} \end{aligned}$ |  |  |  |  | $\begin{aligned} & \mathrm{B}+2 \\ & \mathrm{k}+2 \end{aligned}$ |  |  |  |  |  |  | $P C+2 \rightarrow P C$ and MAR Load Operand in Data Register |
| Execute |  |  |  | $A^{*}$ |  |  |  |  |  | B |  |  |  |  | $\begin{aligned} & \mathrm{B}+2 \\ & \text { or } \\ & \mathrm{K}+2 \end{aligned}$ |  |  |  |  |  |  |
|  | Y | Y |  | Y | $Y$ | $Y$ |  | $Y$ | $\gamma$ | Y | Y |  | $Y$ | Y | $Y$ |  |  |  |  |  | The Am29116 Usage |

*During this cycle decision to branch takes place
If condition is true, Address $=K=$ Index Reg $+A_{D}$ If condition is false, Address $=B=A+1$

$$
\begin{aligned}
& Z=Z \text { Latch } \\
& I R=\text { Instruction Register }
\end{aligned}
$$

- Cycle 5: the result of the execution in cycle 4 held by the Am29ø4 determines whether the Am29116 will issue the inline or the branch address

A Microprogrammed CPU Using Am29116 (continued)
Microword Format

- 78-bit wide microword Control bits for each functional unit are grouped

| Field Width | Mnemonic | Description |
| :---: | :---: | :---: |
| ALU |  |  |
| 16 | $\mathrm{I}_{\emptyset}-\mathrm{I}_{15}$ | 29116 Instruction |
| 1 | DLE | 29116 Data Latch Enable |
| 1 | IEN | 29116 Instruction Enable |
| 1 | OEY | 29116 Output Enable Y-bus |
| 1 | SRE | 29116 Status Register Enable |
| 1 | OET | 29116 Output Enable T-bus |
| 3 | RAMSRC | $29116 \mathrm{I}_{\emptyset}-\mathrm{I}_{4}$ Source Select |
| 2 | NSRC | $29116 \mathrm{I}_{9}-\mathrm{I}_{12}$ Source Select |
| Data Path |  |  |
| 4 | DSEL | Data Register Source/Destination Select |
| 1 | DLD | Data Register Enable |
| 1 | MARLD | Memory Address Enable |
| 1 | IRLD | Instruction Register Enable |
| 1 | ZLD | Z-Latch Enable |
| 1 | NLD | $N$-Register Enable |
| 1 | MAP | Mapping PROM Output Enable |
| 1 | VECT | Interrupt Vector PROM Output Enable |

A Microprogrammed CPU Using Am29116 (continued)
Microword Format (continued)

| Field Width | Mnemonic | Description |
| :---: | :---: | :---: |
| Memory Control |  |  |
| 1 | R/W | Memory READ/WRITE Pulse |
| 1 | WREO | Wait Request |
| 1 | DATASTB | Data Strope |
| 1 | MEMREQ | Memory Request |
| Interrupt Control $4$ | $\mathrm{I}_{\varnothing}-\mathrm{I}_{3}$ | 2914 Instruction |
| 1 | INTD | 2914 Interrupt Disable |
| Clock Select $3$ | $L_{1}-L_{3}$ | 2925 Clock Length Select |
| Status |  |  |
| 1 | EZ | 2904 Enable Zero |
| 1. | EC | 2904 Enable Carry |
| 1 | ES | $29 \emptyset 4$ Enable Sign |
| 1 | EOVR | $29 \emptyset 4$ Enable Overflow |
| 1 | OEM | $29 \varnothing 4$ Enable Machine Status |
| 1 | OEMICRO | 2904 Enable Micro Status |

## A Microprogrammed CPU Using Am29116 (continued)

Microword Format (continued)

| Field Width | Mnemonic | Description |
| :---: | :---: | :---: |
| Test |  |  |
| 4 | $T_{1}-T_{4}$ | 29116 or 2904 Test Status Instruction |
| CCSEL |  |  |
| 3 | CCSEL | Condition Code MUX Select |
| Sequence Control |  |  |
| 4 | $I_{\emptyset}-I_{3}$ | $291 \varnothing$ Instruction |
| Branch Address |  |  |
| 12 | $B A$ | Next Micro Address |
| 78 |  |  |

## A Microprogrammed CPU Using Am29116 (continued)

## Conclusion

- Microprogrammability makes for easy and quick design of a customized architecture.
- The powerful instruction set of the Am29116 is very suitable for CPU applications:
- bit mampulation
- multiple bit rotate
- rotate and merge
- rotate and compare
- prioritize functions
- We have shown a minimal configuration:
- a PCU unit based on another Am29116, Am2901's
or Am293ø PCU slices could increse the throughput.

A Microprogrammed CPU Using Am29116 (continued)

## A Microprogrammed CPU Using Am29116 (continued)

The Super-16
A 16-bit computer built from Am29øø Family parts to illustrate design techniques. Incorporates pipelining at the macro- and microprogram level.


## A Microprogrammed CPU Using Am29116 (continued)

Super-16 (cont inued)

## Processor Organization

- Distinct sections
- program control unit (PCU)
- arithmetic and logic unit (ALU)
- computer control unit (CCU)
- data paths
- memory control, clock control
- I/0 interface
- interrupt section

There is one important distinguishing feature as compared with the Am29116 CPU described in the last section:

- The ALU and the PCU are separate.

We will restrict our attention to this point.

## Comparison with the Super-Sixteen

The PCU:

- Incorporates:
- MAR (latch)
- 4 Am2901 microprocessor slices
- a 16-bit transfer register
- a 16-bit bidirectional buffer ... called a transfer driver
- Controls the macroprogram sequencing
- Is controlled by the microprogram
- Register assignments

Rø- program counter PC
R1- stack pointer
R2- stack lower limit
R3- stack upper limit
R4- +2 ... a useful constant
R5- +4 ... a useful constant
R6,R7- not used but are available
R8,R15- not used (wired to be disabled)

## A Microprogrammed CPU Using Am29116 (continued)

## Comparison with the Super-Sixteen (continued)

## PCU (continued)

- Function
- updating of the PCU
- MAR <-- (PC) for reading instructions or data from main memory
- updating of the stack pointer checking stack limits
- Communication
- data to PCU from ALU via transfer register
- data from PCU to $Y$-bus of the ALU via PCU transfer drivers

ALU

- Designed from
- 4 Am29ø3 superslices
- Am29ø4 status and shift control logic
- PSW register
- 3 buses
$D A: A L U$ input from $Z_{\emptyset}$ register (memory data)
IMMD gate (immediate field of microcode)
DB: ALU input from PSW
Y: ALU output, input to RAM registers
- RAM register selection from instruction register (I)
- $I_{\emptyset-3}$ : A address on Am29ø3
- $I_{4-7}: B$ address on $A m 29 \varnothing 3$ or control for CCMUX (in Am29ø4) in conditional branch instructions (macrolevel!)


## A Microprogrammed CPU Using Am29116 (continued)

Comparison with the Super-Sixteen (continued)

ALU (continued)

- Byte-wide operations are possible as well as word-wide:
- only the lower 8 bits are affected
- by disabling write-enable \& output-enable for slices 3,4
- word/byte MUX selects C,N,OVR from slice 2 (MSS)
- force $Y_{8-15}$ to zero for Z-signal
- Control for the ALU
- A \& B addresses, Am29ø4 CCMUX from macroleve $1 \mathrm{I}_{7-\emptyset}$
- ALU operations from microlevel M78-86
- WORD signal Mgø (byte/word)


## (Macro) Instruction Execution

- Form instruction address (PCU)
- publish bus request at the beginning of the cycle
$-P C$, MAR $<--(P C)+2$
- address bus <-- (MAR) $5 \emptyset n s$ prior to beginning of next cycle
- Instruction fetch (main memory)
- the main memory is fast enough to all reading in one cycle
- Decode (mapping. PROM)
- fetched instruction is routed through $Z$ and $Z 1$ registers to the instruction decoder (mapping PROM)
- 8-bit opcode is address for the PROM giving the starting address for the microprogram to execute this instruction
- Displacement fetch (main memory)
- every instruction fetch is followed by another read cycle:
- next instruction fetch for one word instructions
- displacement fetch for two word instructions
- decoding of the previous instruction determines how this data should be interpreted.

A Microprogrammed CPU Using Am29116 (continued)

# Comparison with the Super-Sixteen (continued) 

(Macro) Instruction Execution (continued)

- Form operand address (ALU)
- fetched displacement is sent through the $Z$ and $Z \emptyset$ register to the ALU (29ø3)
- MAR $<--\left(X_{2}\right)+d \quad$ ( 50 ns before the end of the cycle)
(See note 1)
- Operand fetch (memory)
- Execute (ALU) (See note 2)
- 29ø3's perform specified operation on the operand's
- simultaneous with the last execution cycle the instruction decoder is enabled

Notes: 1) Remember the instruction format for the CPU in the last section. This application is an emulation of the Super-16, so that the instruction formats match.
2) Execution may take more than one microcycle.

## Pipelining at the Macrolevel (continued)

## Register-to-Register Pipeline Operation



- Cycle 1,2: Compare with pipelining of the CPU in the last section
- Cycle 3: Address of a third instruction is formed (Super-16: $Z, Z_{\phi}, Z_{1}$-registers; Other: Z-register)
- Cycle 4: Simultaneous instruction execution and formation of instruction address

Execution of $A$ needs:

- one microcycle for most instructions because of simultaneous decoding of $B$
- more than one microcycle for a few instructions (e.g. I/0 instructions). The pipeline stops.
- After the first three cycles, every cycle produces a result.


## Pipelining at the Macrolevel (continued)

Register-to-Indexed-Storage Pipeline Operation

| Action | A, B, C, D are RX instructions |  |  |  |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Form Instruction Address | A |  | B |  |  | c |  |  |  |  |  |  |  |
| Fetch Instruction |  | A |  | B |  |  | c |  |  |  |  |  |  |
| Decode |  |  | A |  |  | B |  |  | c |  |  |  |  |
| Fetch Displacement |  |  | A |  |  | B |  |  | c |  |  |  |  |
| Form Operand Address |  |  |  | A |  |  | B |  |  | c |  |  |  |
| Fetch Operand |  |  |  |  | A |  |  | B |  |  | C |  |  |
| Execute |  |  |  |  |  | A |  |  | B |  |  | c |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |
| Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |  |

- Cycle 2: The same actions as in the RR pipeline operation occurs (fetch instruction $A, \quad$ MAR $<--P C+2$ )
- Cycle 3: Form instruction address, memory read and decode are performed simultaneously

The system doesn't "know" whether a displacement or the next instruction is fetched from memory.

After the decode of instruction $A$ it is determined to be a displacement

- Cycle 4: Simultaneous form instruction address (PCU) and form operand address (ALU). This was impossible in the CPU of the last section.
- Cycle 5: Only one pipeline stage is active: memory read

The operand is fetched
Decoder is not enabled yet and therefore the upper part of the pipeline is stopped

# Pipelining at the Macrolevel (continued) <br> Register-to-Indexed-Storage Pipeline Operation (continued) 

- Cycle 6: Compare with Cycle 3

Additional execute A cycle (ALU)
In this cycle you see again the advantage of two separate devices for the PCU and the ALU. You can form instruction address and execute in the same microcycle.

- The pipeline needs 6 cycles from the beginning to produce the first result.
- Every third cycle a result is produced.


## Pipelining at the Macrolevel (continued)

Branch on Condition RX Pipeline Operation

| Action | $A=R X$ Branch Instruction <br> $B=$ Next RXI Instruction if branch is not taken <br> $\mathrm{K}=$ next RX Instruction if branch is taken |  |  |  |  |  |  |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Form Instruction Address | A |  | 8 | K | $\begin{aligned} & B+2 \\ & K+2 \end{aligned}$ |  |  |  |  |  |  |  |  |
| Fetch Instruction |  | A |  | B | K | $\begin{aligned} & \mathrm{B}+2 \\ & \mathrm{~K}+2 \end{aligned}$ |  |  |  |  |  |  |  |
| Decode |  |  | A |  |  | K | etc. |  |  |  |  |  |  |
| Fetch Displacement |  |  | A |  |  |  | 8 $K$ |  |  |  |  |  |  |
| Form Operand Address |  |  |  |  |  |  |  | K |  |  |  |  |  |
| Fetch Operand |  |  |  |  |  |  |  |  | B K |  |  |  |  |
| Execute |  |  |  | $A_{1}$ | $A_{2}$ | $A_{3}$ |  |  |  | B K |  |  |  |
|  | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |  |  |  |
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |

Cycle 3: Form instruction-address (next instruction if branch is not taken)
Fetch displacement of next instruction
Decoding A detects a branch instruction

## Pipelining at the Macrolevel (continued)

Branch on Condition RX Pipeline Operation (continued)

- Cycle 4: Execution of the branch instruction starts
- Determination whether the condition is true or false is not yet possible.
- Sequencing of the program is temporarily unknown

To place the correct instruction in the Z-register when decoding of the next instruction ( $B$ or $K$ ) is enabled, form the instruction addresses of the two alternative instructions:

- Form instruction address $K$ (where $\left.K=\left(X_{2}\right)+d\right)$
- Instruction address $B$ was formed in the last cycle and B can be fetched in this cycle.
- Cycle 5: Fetch instruction K

Executing $A_{2}$ determines whether the condition is true or false

The cycle is long enough to form the correct instruction address ( $\mathrm{K}+2$ or $\mathrm{B}+2$ )

- Cycle 6: The last execution cycle $A_{3}$ enables the decoder. Decode either $K$ or $B$ as determined in cycle 5.
- You see: The generation of the two alternative instruction addresses prevents the pipeline from flushing out totally.

Super-16 Microword Format

| ROUTE TO B | $\overline{\text { RTB }}$ | $\stackrel{0}{\circ}$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| TRANSFER Z TO ZI Am2910 | $\frac{(\mathrm{BP})}{\mathrm{CCEN}} Z \rightarrow Z 1$ | $\begin{aligned} & 0 \\ & 0 \\ & \end{aligned}$ | $\bigcirc$ |  |  |
| Am2903 IEU WORD/BYTE <br> Am2903 <br> Am2903 <br> Am2903 <br> Am2903 <br> Am2903 <br> Am2903 <br> Am2903 <br> Am2903 <br> Am2903 <br> Am2903 <br> Am2903 <br> Am2903 | $\begin{aligned} & \hline \overline{W O R D} \\ & \frac{\overline{E A}}{\overline{O E Y}} \\ & \hline O E B \\ & I_{8} \\ & I_{7} \\ & I_{6} \\ & I_{5} \\ & I_{4} \\ & I_{3} \\ & I_{2} \\ & I_{1} \\ & I_{0} \end{aligned}$ | 90898887868584838281807978 | $\underset{\underset{\omega}{c}}{\stackrel{\rightharpoonup}{c}}$ |  |  |
| ENABLE TRANSFER REG. <br> LOAD TRANSFER REG. <br> I-REG EN CTR <br> H-REG INC/DEC <br> PCU TRANS CHIP DISABLE <br> PCU TRANSFER REG. <br> LOAD MEMORY ADDA. REG. <br> LOAD D-REG. <br> LOAD ZI INTO I REG. <br> ENABLE ZO $\rightarrow$ DA <br> ENABLE PSW <br> SHIFT CNT Am2910 ADDR. <br> BRANCH INSTR. EN | ENTREG <br> LDTREG <br> ENCTR <br> INC <br> PCUCD <br> PCU $\rightarrow Y$ <br> LDMAR <br> LDD <br> $\mathrm{ZI} \rightarrow$ I <br> ENZO <br> PSW <br> $\frac{\text { SHTCNTEN }}{\text { BRIEN }}$ |  |  |  |  |
| Am2901 $F \rightarrow B / \bar{Q}$ Am2901 Am2901 Am2901 Am2901 Am2901 Am2901 Am2901 Am2901 Am2901 Am2901 | $\mathrm{PCUH}_{7}$ $\mathrm{PCUl}_{3}$ $\mathrm{PCUl}_{2}$ PCUI PCUI 0 $\mathrm{PCUA}_{2}$ $\mathrm{PCUA}_{1}$ PCUA 0 $\mathrm{PCUB}_{2}$ $\mathrm{PCUB}_{1}$ $\mathrm{PCUB}_{0}$ |  | $\begin{array}{r} 0 \\ =9 \\ =3 \\ =3 \\ 0 \\ 0 \end{array}$ |  |  |
| BUS REQUEST <br> MEMORY REQUEST <br> HOLD REQUEST <br> MEMORY WRITE/ $\overline{R E A D}$ <br> MEMORY WORD/BYTE | REQB <br> MREQ <br> HREO <br> WRITE <br> MWORD | $\begin{aligned} & \text { cg } \\ & \text { M } \\ & \text { G } \\ & \text { G } \\ & 0 \\ & 0 \end{aligned}$ | 으울 |  |  |

Super－16 Microword Format（continued）

| EN IMMEDIATE $\rightarrow$ DA BUS ROMIREGEN <br> UO CONTROL REG．EN <br> Am2914 INTERRUPTS DISABLE <br> Am2914 $\mathrm{EN}_{\mathrm{o}}$－$E \mathrm{EN}_{3}$ <br> Am2904 SHIFT EN | $\overline{\text { MMD }}$ <br> ROM／I <br> IOEN <br> INTDIS <br> $\frac{\text { INTRIEN }}{\text { SHFTEN }}$ |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| GENERAL USE CONTROL BITS | $\mathrm{CNTLB}_{7}$ $\mathrm{CNTLB}_{6}$ $\mathrm{CNTLB}_{5}$ $\mathrm{CNTLB}_{4}$ $\mathrm{CNTLB}_{3}$ $\mathrm{CNTLB}_{2}$ CNTLB $0_{0}$ |  | $\widehat{\Phi}$ |  |  |
| Am2904 OUT EN CONDITIONAL TEST <br> Am2904 EN ZERO <br> Am2904 EN CARRY <br> Am2904 EN SIGN <br> Am2904 EN OVERFLOW <br> Am2904 EN MACHINE STATUS <br> Am2904 EN MICRO STATUS <br> Am2904 $\mathrm{I}_{12}$ CARRY OUT CNTL <br> Am2904 $\mathrm{I}_{11}$ CARRY OUT CNTL | $\overline{O E C T}$ <br> $\frac{E Z}{}$ <br> $E C$ <br> $E S$ <br> $\frac{E O V A}{}$ <br> $C E M$ <br> $C E \mu$ <br> $I_{12}$ <br> $I_{11}$ |  | 鸟荡 |  | $\begin{aligned} & \stackrel{\rightharpoonup}{\dot{\phi}} \\ & \text { 音 } \\ & \mathbf{o} \end{aligned}$ |
| Am2904 <br> Am2904 <br> Am2904 <br> Am2904 \＆Am25LS251 <br> Am2904 \＆Am25LS251 <br> Am2904 \＆Am25LS251 | $\begin{aligned} & \text { TEST }_{5} \\ & \text { TEST }_{4} \\ & \text { TEST }_{3} \\ & \text { TESST }_{2} \\ & \text { TEST }_{1} \end{aligned}$ |  | 요 |  | $\begin{aligned} & \frac{3}{y} \\ & \stackrel{1}{c} \\ & \text { n } \end{aligned}$ |
| $\mathrm{Am}^{2910 I_{3}}$ <br> Am2910 $\mathrm{I}_{2}$ <br> Am2910 $I_{1}$ <br> Am2910 Io | $\mathrm{NAC}_{3}$ <br> $\mathrm{NAC}_{2}$ <br> $\mathrm{NAC}_{1}$ <br> $\mathrm{NAC}_{0}$ |  |  | 6 |  |
|  | $M_{15}$ <br> $\mathrm{M}_{14}$ <br> $M_{13}$ <br> $M_{12}$ <br> $M_{11}$ <br> $M_{10}$ <br> $\mathrm{M}_{9}$ <br> $\mathrm{M}_{8}$ <br> $\mathrm{M}_{7}$ <br> $M_{6}$ <br> $M_{5}$ <br> $M_{4}$ <br> $\mathrm{M}_{3}$ <br> $\mathrm{M}_{2}$ <br> $M_{1}$ <br> $M_{0}$ |  |  |  |  |

Comparison of Features as General Purpose CPU's

| Operation | Am2901C | Am292ø3 | Am29116 |
| :---: | :---: | :---: | :---: |
| 16-bit ADD | 83 ns (max) | $\sim 12 \emptyset \mathrm{~ns}$ (max) | $1 \varnothing \emptyset-12 \emptyset \mathrm{~ns}$ (max) |
| 16-bit MULT | $\sim 16 \not \subset \operatorname{ns~}^{\text {n (max) }}$ | $\sim 18 \varnothing \emptyset$ ns (max) | $\sim 9 \varnothing \varnothing \varnothing$ ns (max) <br> $\sim 3 \varnothing D$ ns (max with Am29516/7) |
| BCD ADD | $\sim 1 め \varnothing$ ns (max) | $\sim 12 \emptyset \mathrm{~ns}$ (max) | $\sim 100 \square$ ns (max) |
| Data Word Width | 4-256 Bits | 4-256 Bits | 8 or 16 Bits |
| \# Registers | 16 only | 16,32,48,64 | 32 only |
| 8-Bit Rotate | $\sim 8 \emptyset \emptyset \mathrm{~ns} \mathrm{(max)}$ | $\sim 1000 \mathrm{~ns}$ (max) | $\sim 10 \emptyset-12 \emptyset$ ns (max) |
| 8-Bit Rotate \& Merge | $\sim 1 \varnothing \varnothing \emptyset$ ns (max) | $\sim 13 \not 0$ ns (max) | $\sim 1 \phi \phi-12 \phi$ ns (max) |
| Devices for | $4 \times$ Am29ø1C | $4 \times$ Am292ø3 | $1 \times$ Am29116 |
| 16-bit ALU | $1 \times \mathrm{Am} 29 \emptyset 2$ | $1 \times$ Am29ø2 |  |
|  | $\sim \frac{1}{} \times \mathrm{Am} \times \mathrm{Am} 29 \varnothing 4$ | $1 \times$ Am2904 |  |
| Typical power for 16-bit ALU | 5.2 W | 5.6 W | 2.2 W |

Performance Analysis


## Performance Analysis (continued)



Performance Analysis (continued)


## Performance Analysis (continued)

Are you familiar with the Am25S1ø?

How does it help speed up shifts and rotates on Am2901 systems?

It is a four-bit high-speed shifter:

- It shifts four bits of data $\emptyset, 1,2$ or 3 places.
- Several devices can be connected to:
- perform shifts of $\emptyset, 1,2$ or 3 places
on words of any length. (\# of 25S1D's equals \# of bits/4)
- perform a complete end-around barrel shift.
(\# of 25S1ф's needed equals \# of bits/2)


# CHAPTER 7 <br> Exercises - Part 2 

## Exercises - Part 2

1. You have 8 ASCII characters that normally occupy 64 bits of memory. Since each byte has in fact only 7 active bits, it may be useful to pack these bytes into only 56 bits for storage on a disk. In writing these bytes to a disk you wish to pack them into a 56-bit contiguous frame by discarding the parity bit from each byte (i.e. the bit in the most signifigant position). Write the microinstructions for the Am29116 for packing these 64 bits into 56 bits. Assume that the characters are initially in Rø thru R3 and are to be packed into R4 thru R7.
2. Write the code to perform a $16 \times 16$-bit unsigned integer multiplication. Write your code on the assumption that the operands are already in the RAM or on the assumption that the operands are to be supplied by your code as immediate values.
3. If you have time, try the following exercise:

The Am29116 can be very effective in arbitrating requests for service from several different sources. Suppose there are eight sources of service requests and the Am29116 has already serviced all requests from sources $\mathrm{S7}, \mathrm{~S} 6$ and S 5 . Write the code to cause the Am29116 to branch to the service routine associated with the highest priority source of the group $S 4, S 3, S 2, S 1, S \emptyset$ while ignoring $S 7, S 6$ and $S 5$.

## Solutions for Exercises - Part 2

Problem \#1: Packing ASCII characters


$$
U: \varnothing 1 \varnothing \emptyset \emptyset \emptyset 1 \emptyset \emptyset 1 \varnothing \emptyset \emptyset \varnothing \emptyset 1
$$

rot U: $1 \emptyset 1 \varnothing \emptyset \varnothing \varnothing 1 \varnothing \emptyset 1 \varnothing \emptyset \varnothing \emptyset \emptyset$

mask: $\varnothing \varnothing 1111111 \varnothing \emptyset \emptyset \emptyset \varnothing \emptyset \emptyset$
$A C C: ~ Ф \emptyset 1 \varnothing \varnothing \varnothing \varnothing 1 \quad \varnothing 1 \varnothing \varnothing \emptyset \varnothing \varnothing 1$
B without A without parity parity

ROTM, W, 14, MRAI, R1 IMME H\#CDФD

U: $\emptyset 1 \varnothing \emptyset \emptyset 1 \varnothing \emptyset 11 \emptyset \emptyset \emptyset \emptyset 11$
rot U: $11 \varnothing 1 \emptyset \emptyset \emptyset 1 \emptyset \emptyset 11 \emptyset \varnothing \emptyset \emptyset$
$R: \emptyset \emptyset 1 \varnothing \emptyset \varnothing \emptyset 1 \emptyset 1 \varnothing \varnothing \emptyset \varnothing \emptyset 1$
mask: 11ФФ ФФФФ ФФФФ ФФФФ

etc.

## Solutions for Exercises - Part 2 (continued)

## Microinstructions for Packing ASCII Characters

SOR B, MOVE, SORA, R $\emptyset$
ROTM W, 15, MRAI, R $\emptyset$ IMME H \# $3 \mathrm{~F} 8 \emptyset$
ROTM W, 14, MRAI, R1 IMME H \# CøФø
SOR W, MOVE, SOAR, R4
ROTR1 W, 14, RTRA, R1
ROTM W, 13, MRAI, R1 IMME H \# ØFED
ROTM W, 12, MRAI, R2 IMME H \# Føøø
SOR W, MOVE, SOAR, R5
ROTR1 W, 12, RTRA, R2
ROTM W, 11, MRAI, R2 IMME H \# Ø3F8
ROTM W, 1ø, MRAI, R3 IMME H \# FCDø
SOR W, MOVE, SOAR, R6
ROTR1 $W, 1 \emptyset$, RTRA, R3
ROTM W, 9, RAI, R3 IMME H \# ØøFE
SOR B, MOVE, SOAR, R7

## Solutions for Exercises - Part 2 (continued)

## Problem \#2: Unsigned Integer Multiplication

Some observations...
o Result is a 32-bit value.
The Am29116 does not provide a double length shift directly.
o Remember that the Am29116 has a single port RAM architecture.

- try to minimize operations which need two register operands
i.e. - shift the result, not the multiplicand (only the result needs two registers)
- use the ACC or D-latch for the multiplicand
- in the double-precision addition use zero as as an immediate value

$$
\text { Unsigned Multiplication (16 x } 16 \text { bit) }
$$

| Symbolic Address | Branch Address | Sequencer | CC | $\begin{gathered} \text { Am29116 } \\ \text { Instruction } \end{gathered}$ | $\overline{\text { SRE }}$ | Comment |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| LOOP: | 16 | cont | PASS | SOR, MOVE, SOIR, RDФ | $x$ | Multiplier --> RøD |
|  |  | cont |  | Multiplier | $x$ | Immediate value |
|  |  | cont |  | SONR, MOVE, SOI, NRA | X | Multiplicand --> ACC |
|  |  | cont |  | Multiplicand | x | Immediate value |
|  |  | cont |  | SOR, MOVE, SOZR, Rø2 | $x$ | Initialize MSH of product to $\emptyset$ |
|  |  | PUSH |  | SOR, MOVE, SOZR, Rø3 | $x$ | Initialize LSH of product to $\varnothing$ Load counter \& push loop address |
|  |  | cont |  | SHFTR, SHRR, SHUPZ, R $\emptyset 3$ | $\emptyset$ | Shift LSH of product by one. Load LSB with zero. |
|  |  | cont |  | SHFTR, SHRR, SHUPL, Rø2 | $x$ | Shift MSH of product by one. Load LSB with QLINK. |
|  |  | cont |  | SHFTR, SHRR, SHUPZ, RФФ | $\emptyset$ | Shift multiplier. Load LSB with zero. |
|  | SKIP | CJP | $\overline{C T}(29116)$ | TEST, TL | $x$ | Branch to SKIP if MSB was zero. |
|  |  | cont |  | TOR1, TORAR, ADD, Rø3 | $\emptyset$ | ```Double precision ADD to accumulate partial product ... LSH: R\emptyset3+ACC --> R\emptysetं``` |
|  |  | cont |  | TOR1, TORIR,ADDC, Rø2 | $x$ | MSH: $\quad \mathrm{Q} \emptyset 2+\emptyset+\mathrm{C} \rightarrow \mathrm{R} \emptyset 2$ |
|  |  | cont |  | $\emptyset$ | x | Immediate zero |
| SKIP: |  | RFCT |  |  | x | Loop back to "LOOP" |

\#3 Exercise Solution


## APPENDIX A

A BRIEF REVIEH OF NUMBER THEORY

## A Brief Review of Number Theory

Number theory has application in relation to several products of Advanced Micro Devices, including the Am295ФФ Family of signal and array-processing devices and the Am952ø/Am8065 Burst Error Processor. The theory also is important in cryptography and in random-number generation. What follows is a brief summary, without proofs, of selected number theory results and related notation.

For proofs and much more material see:
J.V. Uspensky and M.A. Heas let, Elementary Number Theory (McGraw-Hill, New York 1939)
H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms (Springer-Verlag, New York 1982)

If 'a' and 'b' are integers, with ' $b$ ' positive, the division of 'a' by 'b' is defined by:

$$
a=b q+r, \quad \phi \leq r<b,
$$

where ' $q$ ' is called the quotient and ' $r$ ' is called the remainder. When ' $r$ ' is zero, ' $b$ ' and ' $q$ ' are factors or divisors of 'a'. To indicate that ' $b$ ' is $a$ factor of 'a' we write "bla", which we can read as "'b' divides 'a'". When 'a' has no divisors other than 1 and 'a', 'a' is a prime. In all other cases, 'a' is composite.

All integers which yield the same remainder when divided by 'b' are said to be congruent modulo 'b'. To say that ' $c$ ' is congruent to 'd' modulo 'b' we write
$c \equiv d$ modulo b. (but we will write = for $\equiv$ from here on)
This condition will be true if bl(c-d).

Modulus arithmetic is similar to ordinary arithmetic:
$\left(a_{1} \pm a_{2}\right)$ modulo $b=\left(\left(a_{1} \operatorname{modulo} b\right) \pm\left(a_{2} \operatorname{modulo} b\right)\right)$ modulo $b$
$\left(a_{1} \cdot a_{2}\right)$ modulo $b=\left(a_{1}\right.$ modulo $\left.b\right) \cdot\left(a_{2}\right.$ modulo $\left.b\right)$ modulo $b$
However, division is a little more complicated. If
$n a_{1}=n a_{2}$ modulo $b$
we cannot always cancel the n's and conclude
$a_{1}=a_{2}$ modulo $b$.
What is true is that

$$
a_{1}=a_{2} \text { modulo }(b / d),
$$

where $d$ is the greatest common divisor of ' $n$ ' and ' $b$ ' which we write as

$$
d=(n, b) .
$$

Only if $d=1$ can we cancel the n's. That is, we can only divide by numbers that are "relatively prime" to the modulus.

It proves useful to define a function which designates the number of integers that are smaller than a given integer, ' $m$ ', and are relatively prime to 'm'. We call this Euler's totient function and write it as " $\varphi(m)$ ".

If ' $m$ ' is a prime then $\varphi(m)=m-1$.

If 'm' is composite such that

$$
m=p_{1}^{a 1} \cdot p_{2}^{a 2} \cdot p_{3}^{a 3} . \ldots \cdot p_{s}^{a s} \text { for } p_{1}, p_{2}, p_{3}, \ldots, p_{s} \text { all prime, }
$$

then

$$
\varphi(m)=m \cdot \frac{p_{1}-1}{p_{1}} \cdot \frac{p_{2}-1}{p_{2}} \cdot \frac{p_{3}-1}{p_{3}} \cdot \ldots \cdot \frac{p_{s^{-1}}}{p_{s}}
$$

We now have enough notation to state Euler's Theorem:

$$
\begin{aligned}
& \text { If }(a, m)=1 \text { then } \\
& { }_{a} \mathscr{P}(m)=1 \text { modulo } m .
\end{aligned}
$$

Euler's Theorem allows us to perform division with respect to a modulus in a more general way than we discussed above. Suppose we want to solve for ' $x$ ' given this linear congruence with one unknown:
$a x=c$ modulo $m$

Using Euler's Theorem

$$
a x=c \cdot a \varphi(m) \text { modulo } m
$$

and provided $(a, m)=1$ we may divide both sides by 'a' so that

$$
x=c \cdot a \varphi(m)-1 \text { modulo } m
$$

and we can see that $a \varphi(m)-1$ is the reciprocal of 'a' with respect to modulus 'm'。

One of the uses of Euler's Theorem relates to its application to the Chinese Remainder Theorem.

The Chinese Remainder Theorem states that simultaneous linear congruences in one unknown can be solved:

Let $m_{j}$ be $k$ positive integers greater than 1 and relatively prime in pairs. The set of linear congruences $x=r_{\mathfrak{j}}$ modulo $m_{j}$ has a unique solution modulo $M$, where $M=m_{1}{ }^{\circ} m_{2}{ }^{\circ} m_{3}{ }^{\circ} \ldots{ }^{\circ} m_{k}$.

The calculation of $x$ from the $r_{i}{ }^{\prime} s$ and $m_{j}^{\prime} s$ is called the Chinese remainder reconstruction. It can be shown that

$$
x=\left(\frac{M}{M_{1}}\right)^{\varphi\left(m_{1}\right)} r_{1}+\left(\frac{M}{M_{2}}\right)^{\varphi\left(m_{2}\right)} r_{2}+\ldots+\left(\frac{M}{M_{k}}\right)^{\varphi\left(m_{k}\right)} r_{k} \quad \text { modulo } M .
$$

That is, ' $x$ ' is obtained from a linear combination of the remainders:

$$
x=A_{1} r_{1}+A_{2} r_{2}+\ldots+A_{k} r_{k} \text { modulo } M
$$

We can note that for $x=1$, all $r_{i}$ 's also equal 1 . Hence we see that the coefficients, $A_{j}$, also sum to 1 :

$$
A_{1}+A_{2}+\ldots+A_{k}=1 \text { modulo } \mathrm{M}
$$

which is a useful check on the correctness of a set of such Chinese remainder reconstruction coefficients which we may have calculated.

It will often be easier to calculate the coefficients of the $r_{i}$ 's by knowing that

$$
\left(\frac{M}{m_{i}}\right)^{\varphi\left(m_{i}\right)} \text { modulo } M=\frac{M}{m_{i}} \cdot\left[\left(\frac{M}{m_{i}}\right)^{\varphi\left(m_{i}\right)-1} \text { modulo } m_{i}\right]
$$

where the equivalent expression on the RHS above can be calculated on a machine with a shorter capacity for integers than is required for the LHS.

## An Example of Chinese-Remainder Reconstruction

Let us apply the above material to the problem of calculating the location of an error burst detected by an Am952ø/Am8ø65 Burst Error Processor using its 32-bit polynomial. In this case, the location in bits will be given modulo 21 and 2ø47. These moduli have no common factors and hence are relatively prime. We may expect from the Chinese Remainder Theorem that we can calculate the location modulo $M=21 \cdot 2 \varnothing 47=42,987$.

We will need the totients for the two factors:

$$
21 \text { is composite, thus } \varphi(21)=\varphi(3 \cdot 7)=21 \cdot \frac{3-1}{3} \cdot \frac{7-1}{7}=12
$$

$2 \emptyset 47$ is also composite, thus $\varphi(2 \emptyset 47)=\varphi(23 \cdot 89)=2 \emptyset 47 \cdot \frac{23-1}{23} \cdot \frac{89-1}{89}=1936$

To understand how to proceed to calculate the Chinese remainder reconstruction coefficients, let us begin with the second coefficient:


But $21^{1936}=$
$655472694777778177572 \emptyset \emptyset 96972 \emptyset 48 \emptyset 8 \emptyset 799674685 \emptyset 6621451973812186245842 \emptyset 27219418995 \emptyset 7$ $214238972 \emptyset 23878 \emptyset 5933 \emptyset 947 \emptyset 646 \emptyset 5277363 \emptyset 4494713254745339 \emptyset 813736 \emptyset \emptyset 44141323884312 \emptyset 193$ $325858275143793338597612331177125335 \emptyset 95514594439112 \emptyset 858 \emptyset 632578963327992769384375$ $426 \emptyset 3 \emptyset 36 \emptyset 5797 \emptyset 228687681 \emptyset 1123274979188 \emptyset \emptyset 7286483256198 \emptyset 27628227259161757426 \emptyset 963299$ $1 \emptyset 79353828654476 \emptyset 6531 \varnothing \emptyset 39978695543136316162542942814716679981146 \emptyset 1395 \emptyset 224728 \emptyset 241$ $4842199 \emptyset 224184 \emptyset 32422659 \emptyset \emptyset 195295 \emptyset 12588 \emptyset 85462 \emptyset 4351761786765363 \emptyset 4 \emptyset 17189886727728577$ $664787628 \emptyset 863253372 \emptyset 31654163 \emptyset 42966532392362882377 \emptyset 688465926358 \emptyset 82614916364897819$ $952996 \emptyset 899572134113382953519568833 \not \subset 579297981894772229974859 \emptyset 5 \emptyset 98799382 \emptyset 371499794$ $8139595475454 \emptyset 59872 \emptyset 417165654121 \emptyset 78 \emptyset 3394484 \emptyset 12311848669835656 \emptyset 1 \emptyset 398876 \not 08858 \emptyset 6716$ $494 \emptyset 17 \emptyset 8128661849 \emptyset 4914556373 \emptyset 46948353 \varnothing 51418 \emptyset 43247 \emptyset 512369426690573967 \emptyset 32574682856$ $195 \emptyset 57585513584 \emptyset 15 \emptyset 37 \emptyset 25585339 \emptyset 728992993721 \emptyset 32 \emptyset 29965844431643939374743856 \emptyset 146636$ $68 \emptyset 396498292567426522 \not \subset 57 \emptyset \emptyset \emptyset 8871 \emptyset 43947894813614346273 \emptyset 13137842 \emptyset 168689745 \emptyset 3 \emptyset 3 \emptyset 5135$ $3 \emptyset 3815 \emptyset 65 \emptyset 983 \emptyset 2794868 \emptyset 9578859 \emptyset 37463252657776488671886 \emptyset 1294749167 \emptyset 732595286632623$ $512388812648924579766 \emptyset 67432961 \emptyset 83362672852792835 \emptyset 3 \emptyset 6 \emptyset 928469489247666663467542882$ $428727743929 \emptyset 87782119276 \emptyset 4 \emptyset 4428 \emptyset 5556 \not \emptyset 13 \emptyset 65288869 \emptyset 7275399764 \emptyset 2131 \emptyset 8 \emptyset 69 \emptyset 579575162$ $33 \emptyset 26218677122 \not \subset 44778 \emptyset 4162 \emptyset 3879851 \emptyset 65849987775591 \emptyset 9963128759282 \varnothing 81825133237 \emptyset 66491$ $1436 \emptyset 2 \emptyset \emptyset 47886 \emptyset \emptyset 3619963826332615863961359379331328363447352586257876 \emptyset 7268 \emptyset 7117983$ $835 \not 47864131856848895867772437764182668646 \emptyset 3 \emptyset 84497 \emptyset 27156863545962997632912488928$ $2438266623823344 \emptyset 27 \emptyset 65744 \emptyset 53916733 \emptyset \emptyset \emptyset 49977683952663871397985248 \emptyset 9164723581467953$ $7 \emptyset 1228675 \emptyset 11166497323384469477 \emptyset 79129219382 \emptyset 8517168298 \emptyset 67784 \emptyset 4339225531874454551 \emptyset$ $4898782734119 \varnothing \emptyset 938541 \varnothing 5444 \emptyset 954 \emptyset 153766489875664 \emptyset 8983 \emptyset 1 \emptyset 15 \emptyset 5881273123135229 \emptyset 137586$ $725619481 \emptyset 49 \emptyset 8 \emptyset 8971 \emptyset 6183422 \emptyset 55421 \emptyset 724 \emptyset 25165974464 \emptyset 8 \emptyset 5787879248125748563346533914$ $49216499 \emptyset 479226585182216298832 \emptyset 4672382218 \emptyset 6259886 \emptyset \emptyset 971981213 \emptyset 951971 \emptyset 5148 \emptyset 373964 \emptyset$ $47732 \varnothing 626 \emptyset 985899954219 \varnothing 6575692181371886751746336 \emptyset \emptyset 21949232411693267 \emptyset 268519947893$ $951547824227912523391 \varnothing 313193735366165959485134173 \emptyset 6943873441978578856179187457 \emptyset 7$ $5156713418321522426793595578 \emptyset 8884189974117666779836 \emptyset 4449691 \emptyset 49545327112 \not 1836 \not 127$ $56862721 \emptyset 1 \varnothing 965881665681274 \emptyset \emptyset 8574228137556218 \emptyset \emptyset 715747234727688231 \emptyset 366231153287428$ $9 \emptyset 393414227 \emptyset 9 \emptyset 838945939699975151 \emptyset 362815 \emptyset \emptyset \emptyset \emptyset 4 \emptyset 754387 \emptyset 547718976156277794 \emptyset 953711229$ $2 \emptyset 34472 \varnothing 97857 \emptyset 2499411957534948562224963353551133415 \emptyset \emptyset \emptyset 7 \emptyset 9149 \varnothing \emptyset 2 \emptyset 648383765 \emptyset 114413$ $194863 \emptyset 8794 \emptyset 26419 \emptyset \emptyset 355 \emptyset 913 \emptyset 6459187 \emptyset \emptyset 27487527693765 \emptyset 738 \emptyset 26546828639 \emptyset \emptyset 464126535 \emptyset 37$ $46847 \emptyset 3 \emptyset 3988469375491874 \emptyset 98949598312912928829138 \emptyset 945913994 \emptyset \emptyset 658273311 \emptyset 5367368931$ $145226154528261622 \emptyset 711634827481795557294195265761671415 \emptyset 55512486226365192239 \emptyset 721$

No kidding! ... 2,56Ø digits.

It is inconvenient to deal with such large integers. Thus it is clear that doing the multiplications to get the 1936 th power then reducing this large result modulo 42,987 is a difficult route to get $A_{2}$. Since the power is a relatively small number, we could alternately do one multiplication, reduce this partial result modulo 42,987 and then repeat this process until we have reached the 1936th power. By this route we will never have to produce a partial result that exceeds ( $42,987-1$ ) $\times 21=9 \varnothing 2,706$. This 6 -digit integer is within the range of a pocket calculator. However, multiplying 1936-1=1935 times and reducing the result modulo 42,987 each time is still more tedious than necessary.

At this point it is useful to note that we can reach large powers rapidly by repeated squaring. That is, to raise a number 'a' to the power $2{ }^{b}$ we just square 'a' 'b' times:

$$
a^{16}=a^{\left(2^{4}\right)}=\left(\left(\left(a^{2}\right)^{2}\right)^{2}\right)^{2}
$$

That is, we can get the 16 th power by 4 multiplications rather than by $16-1=15$ multiplications.

$$
\begin{aligned}
& 1936 \text { in binary is } \quad \begin{array}{rrrrrrrrrrr}
10 & 9 & 8 & 7 & 6 & 5 & 4 & 3 & 2 & 1 & \emptyset \\
& 1 & 1 & 1 & 1 & 0 & 0 & 1 & 0 & 0 & 0
\end{array} 0 \\
& \text { Thus } a^{1936}=a^{20} \cdot a^{2} \cdot a^{2} \cdot a^{2} \cdot a^{2^{4}}
\end{aligned}
$$

which we can calculate by only $(10+9+8+7+4)+4=42$ multiplications and by far fewer multiplications if we can use the calculation of the smaller powers as a start towards the larger powers.

All of this can be consolidated in a short BASIC language program that calculates integer powers by converting the power to binary and accumulating the power by the squaring method while building the larger terms from the smaller:
$2 \emptyset \emptyset \emptyset$ 'Integer Power Subroutine
$2 \emptyset 2 \emptyset$ 'Calculates $A=B A S E-P W R$ mod $M$
$2 \emptyset 4 \emptyset$ 'Alters BASE \& PWR
$2 \emptyset 6 \emptyset$ '
$2 \emptyset 8 \emptyset A=1$ ' initial partial result
$21 \emptyset \emptyset$ IF PWR MOD $2=1$ THEN $A=$ (A*BASE) MOD $M$
' update partial result if next bit of PWR is one
$212 \emptyset$ PWR=INT(PWR/2)' right shift pWr
$214 \emptyset$ IF PWR= $\emptyset$ THEN RETURN' done if all bits of PWR have been used
$216 \emptyset$ BASE $=(B A S E A B A S E)$ MOD M' square again
$218 \emptyset$ GOTO $21 \varnothing \emptyset$

By the use of this program or other means we obtain

$$
A_{2}=4 \emptyset 95 \quad * *
$$

And similarly we get $A_{1}=\left(\frac{42,987}{21}\right)^{12}$ modulo 42,987

$$
=2047^{12} \text { modu } 1042,987
$$

$$
=38,893 \quad * *
$$

Note that $A_{1}+A_{2}=42,988=1$ modu 1042,987 (as is expected for valid coefficients). Of course, if we were absolutely sure of the value for $A_{2}$, we could have obtained $A_{1}$ from $42,988-A_{2}$.

Thus, if the error location was known to be $16 \bmod 21$ and $1128 \bmod 2 \varnothing 47$ then the location is

$$
\begin{aligned}
L & =(38,893 * 16+4 \emptyset 95 * 1128) \text { modulo } 42,987 \\
& =\underline{\left.4 \emptyset, \emptyset 21 \quad \text { (Checking: } \quad \begin{array}{rl}
4 \emptyset, \emptyset 21 & =1,9 \emptyset 5 * 21+16 \\
\& 4 \emptyset, \emptyset 21 & =19 * 2 \emptyset 47+1128
\end{array}\right)}
\end{aligned}
$$

** Note: Am952ø/Am8065 data sheets show $A_{1}$ and $A_{2}$ as above but multiplied by 8 for convenience in using byte counts rather than bit counts in calculating error locations.


[^0]:    * "Microprocessor Architecture Suits Bit-Mapped Graphics"
    by Paul Chu and Warren Miller, Electronic Design, Jan.2ø,1983 p. 143

[^1]:    * It is assumed that the previous instruction could produce no change in the counter or could only decrement the counter

