# Inside the StarCore SC140



A Technical Evaluation by the staff of Berkeley Design Technology, Inc.

The following is excerpted and abridged from BDTI's report, Inside the StarCore SC140.

Contents of this excerpt include:

- Introduction
- Scope
- The SC140 Core
- Benchmark Performance:
  - Sample Execution Time Results
  - Sample Memory Usage Results
- Summary

The complete report may be ordered from BDTI. Details are on page 4.

## Introduction

In 1998, Lucent Technologies and Motorola announced the formation of a joint development center dubbed "StarCore." StarCore's primary stated goal is to develop next-generation DSP processor cores which will be used by both Lucent and Motorola in their own chip-level products. In 1999, StarCore announced its first new architecture: the SC100. The first implementation of the SC100 architecture is the SC140 core, and this core is the focus of *Inside the StarCore SC140*, a technical report published by BDTI in May 2000.

StarCore states that the SC100 architecture is scalable, and that it expects to create other cores based on the SC100 architecture with different complements of execution units than are included in the SC140 core. These cores may be assembly-code compatible with the SC140, providing an upgrade path for customers.

At the time this report was published, only Motorola had announced a product based on the SC140. This product, Motorola's MSC8101, was announced in late 1999. *Inside the StarCore SC140* evaluates the SC140 core and the MSC8101 chip.

In *Inside the StarCore SC140*, the technical staff of BDTI evaluates the DSP performance of the SC140 (and MSC8101) and explores how the SC140 architecture addresses the needs of DSP applications. The report includes both a detailed qualitative analysis of the SC140's architecture, and a quantitative evaluation based on the performance of the SC140 and MSC8101 on a series of DSP benchmarks developed by BDTI.

At the time this report was written, in mid-2000, initial MSC8101 devices based on the SC140 were expected to run at speeds of 300 MHz using a 1.5-volt supply. StarCore has fabricated an SC140-based evaluation chip that executes at 300 MHz, on which BDTI verified its benchmark timing.

The SC140 is notable for its extremely high level of parallelism, even in comparison to other VLIW-based processors. The core provides this parallelism while targeting very low power consumption; it is the first VLIW-based DSP processor to attempt to combine low power consumption with very high performance. At its 300 MHz clock rate, the SC140 is currently the fastest general-purpose DSP processor to be demonstrated in silicon. The SC140 core is targeted at high-performance applications, such as cellular base stations and gateways, and at portable applications such as cellular terminal devices.

## Scope

Inside the StarCore SC140 is intended for anyone interested in understanding the DSP performance and capabilities of the SC140 or SC140-based products. It assumes a basic knowledge of DSP processor concepts and terms, both of which are covered in BDTI's text, DSP Processor Fundamentals. Inside the StarCore *SC140* is especially useful for electronic system designers, hardware and software engineers, processor designers, engineering managers, and product marketing managers. It will aid in the assessment of the SC140's suitability for a given application, and will allow engineers and systems designers to make informed decisions when considering the SC140 for their latest designs.

For comparison purposes, this report includes brief analyses of several other processors: Lucent Technologies' DSP16xxx and Texas Instruments' TMS320C54xx and TMS320C62xx.

## About BDTI

Berkeley Design Technology, Inc. (BDTI) was founded in 1991 to assist companies in creating, selecting, and using DSP technology. The technical staff of BDTI has extensive experience in the development of DSPintensive software and hardware for commercial applications. BDTI offers a variety of technical products and services, including:

- Published reports on DSP
  processors and technology
- •DSP software development services
- Technical advisory services
- Training

These processors have been included to give the reader insight into how the SC140 compares to other well-known DSP architectures.

## The SC140 Processor Core

The SC140 is a VLIW architecture, and can execute up to six instructions at a time. Instructions that are grouped for parallel execution are referred to as an "execution set" by StarCore. Instructions are scheduled for parallel execution at compile time by code-generation tools or by the assembly-language programmer.

The SC140 contains four 16-bit data paths, each of which contains a combined ALU/MAC/bit-field unit (BFU). The BFU contains a 40-bit barrel shifter. All of the data paths are identical, and share a common set of 16 source and destination registers.

The MAC units, ALUs, and BFUs that comprise each of the four data paths are not independent (in contrast to other VLIW processors, which typically have independent MAC, ALU, and shifter units). Hence, it is not possible, for example, to issue a set of instructions that uses all four MAC units and also one of the BFUs. For this reason, in each group of six instructions executed in parallel, only four can use the data paths. The remaining two instructions in an execution set can use the address generation unit to perform data moves, pointer arithmetic, or bit mask operations; or they can specify program flow-control instructions.

The SC140's four data paths can each perform single-cycle 16×16-bit multiplications. The multipliers support all combinations of signed and unsigned operands, and support fractional and integer formats (both operands must be integer or fractional). Each data path supports SIMD-style addition and subtraction (using the ADD2 and SUB2 instructions) by treating values in registers as packed pairs of 16-bit operands. For example, using SIMD operations, the SC140 can perform eight 16-bit additions per instruction cycle.

## High Data Memory Bandwidth

The SC140 has two 32-bit address buses and two 64-bit data buses for trans-

ferring data. Instructions are fetched via a 32-bit address bus and 128-bit data bus. Program and data memory is unified; any address can contain either instructions or data.

The SC140 can perform two data reads, two data writes, or one read and one write per instruction cycle. Each read or write can access contiguous groups of data up to 64 bits wide. On a 300 MHz SC140, the maximum on-core data memory bandwidth is therefore 2400 million 16-bit words/second. The SC140 has much higher on-chip data memory bandwidth than most other DSP processors. Its memory bandwidth should be sufficient to keep the execution units supplied with data and avoid data memory bottlenecks when the processor uses data in on-chip memory.

## **Instruction Set**

The SC140 can fetch eight 16-bit instruction words per cycle and can execute up to six instructions in parallel (the remaining two words can be used for prefixes, described below, or for immediate values). Each instruction in the execution set uses one execution unit.

Two different methods are used for specifying which instructions will be included in an execution set: serial grouping and prefix grouping.

Serial grouping uses the two most significant bits in the instruction to determine the end of an execution set.

Prefix grouping adds a one-word or two-word prefix to an execution set. The prefix defines how many instructions are included in the execution set, and also contains information used for conditional execution and looping. Prefix grouping must be used if instructions are to be executed conditionally.

Instructions within the same execution set always start execution at the same time. A new execution set begins execution only after all instructions belonging to previous execution sets are completed. Therefore, the time required to complete an execution set is determined by the instruction in the set that requires the most time.

The SC140 instruction set is quite orthogonal, because most of the instructions are simple and specify a single operation. Unlike some VLIW processors, the SC140's instruction set is composed of relatively short (16-bit) instructions whose functionality can be extended using prefixes or extensions. Short instructions often require processor architects to place restrictions on, e.g., register usage; the SC140 avoids the need for register restrictions by using prefix words where needed. With the exception of special instructions for Viterbi decoding, all SC140 instructions can use all registers without restriction. There are a number of restrictions on grouping instructions in execution sets. however, which complicate assembly language programming.

# Pipeline

The SC140 processor uses a five-stage pipeline consisting of pre-fetch, fetch, dispatch, address generation, and execute stages. The SC140 pipeline is not interlocked; however, the assembler detects pipeline hazards and issues warnings. In comparison to other VLIWbased DSP processors, such as Carmel and the TMS320C62xx, the SC140's pipeline is quite short. The SC140's fivestage pipeline is benign, and does not seriously complicate programming. Pipeline hazards can be avoided with very little programming effort.

# Addressing

The SC140 provides one address generation unit (AGU) that contains two address arithmetic units (AAU), a bit mask unit (BMU), and a set of addressing registers. The AGU is capable of generating two addresses per instruction cycle, and provides 16 primary registers (R0-R15). Like many DSP processors, the SC140's maximum memory bandwidth can only be achieved when data is arranged in groups of contiguous words in memory, since the processor is only capable of generating two addresses at a time.

# **Benchmark Performance**

Inside the StarCore SC140 includes extensive benchmark results, used to quantitatively evaluate the processor's DSP performance. For each benchmark, BDTI reports cycle counts, execution times, energy consumption, cost-performance, and memory usage. BDTI also provides extensive analysis of why the processors perform as they do. In this section, we present sample execution time and memory usage results excerpted from the complete set of results in the report.

# **Execution Time**

The execution time for a BDTI Benchmark function is defined as the amount of time required by the processor to complete the benchmark's initialization, kernel, and termination sections. To determine the execution time of a particular benchmark on a given processor, the number of instruction cycles the processor requires to execute the benchmark is multiplied by the processor's instruction cycle time. Inside the StarCore SC140 includes tables and charts illustrating the number of cycles required by each processor to execute each benchmark, and uses these results to generate corresponding tables and charts of execution times at a specified clock speed. For the SC140, the clock speed used for determining execution times is the projected speed for Motorola's first SC140-based chip, the MSC8101.

## Sample Benchmark Results

The execution time results for BDTI's Viterbi decoder benchmark are shown in

## About the BDTI Benchmarks<sup>™</sup>

The BDTI Benchmarks are a set of DSP software functions that BDTI has independently designed to provide an objective basis for comparing processor performance characteristics such as speed and memory use for DSP applications. The BDTI Benchmark functions are implemented in assembly language to allow a realistic assessment of processors' DSP performance. The resulting software is then verified for functional correctness, optimality, and adherence to the BDTI Benchmark specifications. Benchmark performance results are obtained through manual analysis and careful, detailed simulation, or by measurement on sample devices.



the figure above. As illustrated in this figure, the MSC8101 (Motorola's SC140-based chip) has a significantly faster result on this benchmark than any of the other processors shown. The SC140 core has instructions dedicated to Viterbi decoding (described in more detail in the full report), and is able to efficiently implement the addition-intensive portion of the algorithm via its support for eight 16-bit additions per cycle. The processor's use of multiple execution units and specialized instructions results in an extremely low cycle count for the Viterbi benchmark: the MSC8101 consumes fewer than one-third of the cycles required by the TMS320C6203. Its architectural efficiency combined with its high clock rate enable the MSC8101 to achieve a very strong result on this benchmark.

## Memory Usage

Speed is often the first metric designers use to compare processors. Memory use is also of interest, however, for several reasons. For example, memory use may have a significant impact on overall system cost. Memory use can also affect processors' performance; if application software and data cannot fit entirely in on-chip memory, significant performance degradation may occur on many processors. Because of these and other factors, memory use is an important metric for processor selection. For each benchmark, BDTI reports each processor's program, constant data, and nonconstant data memory use.

Most of the BDTI Benchmarks are optimized first for maximum speed, then for minimum memory usage, because this is usually the order of priorities in DSP applications. The exception to this rule is the Control benchmark, described below.

# Control Benchmark

The BDTI Benchmarks include one benchmark function specifically designed to evaluate memory use for control-oriented programs. Control-oriented code usually takes up the bulk of a DSP application's memory requirements but only a small fraction of the application's processing time. Thus in controloriented code, memory use is usually a more serious concern than execution speed.

BDTI's Control benchmark is designed to be representative of control-oriented code. The primary goal for programmers implementing the Control benchmark is minimum memory use, and the secondary goal is execution speed. This optimization hierarchy mirrors the hierarchy generally followed by control-code programmers. Note that memory-use results on the Control benchmark are not necessarily indicative of processors' memory use in signal-processing-intensive code.

## Sample Benchmark Results

The memory usage results for BDTI's Control benchmark are shown in the figure at right. The SC140's memory usage is surprisingly low, considering that VLIW processors often have significantly higher memory usage than conventional DSP processors. The SC140 achieves its memory efficiency via its use of short 16-bit instruction words coupled with its ability to use prefix words when needed. In addition, the SC140 has a more diverse, powerful, and orthogonal instruction set compared to the other processors benchmarked here, which results in a reduction in the number of instructions used to perform a task.

#### Summary

Based on the processor's results on BDTI's DSP benchmark suite, the SC140 has succeeded on several fronts. At its projected clock speed of 300 MHz, the MSC8101 will provide executiontime performance beyond that of any mainstream DSP processor currently available. Energy consumption results



included in the report (based on Motorola's power projections for the MSC8101) indicate that the processor should also achieve surprisingly strong energy efficiency. And finally, memory usage results for the Control benchmark indicate that the SC140's control code density is likely to be very good, particularly for a VLIW processor.

As with the TMS320C62xx, compiler quality will likely prove to be a key issue in determining whether the SC140 succeeds in the market. TI, for example, has a three-year head start with its TMS320C62xx compiler, and offers mature software development tools. As with any new architecture, developing a rich palette of tools and application software, including third-party offerings, will be a major challenge for StarCore.

| <b>Order Form</b><br>Inside the StarCore SC140: A BDTI Technical Evaluation |                                             | <b>Description</b>                                                                          | <u>Qty</u> | <u>Price</u> |   |  |
|-----------------------------------------------------------------------------|---------------------------------------------|---------------------------------------------------------------------------------------------|------------|--------------|---|--|
|                                                                             |                                             | First copy                                                                                  | х          | \$950        | = |  |
| Mail this form along with a check or fax with purchase order to:            |                                             | Additional                                                                                  | x          | *            | = |  |
| Berkeley Design Technology, Inc.                                            | Tel: (510) 665-1600                         | Tax (for CA orde                                                                            | ers)       |              | = |  |
| 2107 Dwight Way, Second Floor<br>Berkeley, CA 94704 USA                     | Fax: (510) 665-1680<br>Email: info@BDTI.com | Shipping & Handling (for int'l, add \$60)                                                   |            |              |   |  |
|                                                                             |                                             | TOTAL                                                                                       |            |              |   |  |
| Name                                                                        |                                             | *Additional copies may be purchased at a substantial discount.<br>Contact BDTI for details. |            |              |   |  |
| Title, Division                                                             |                                             | Payment                                                                                     |            |              |   |  |
|                                                                             |                                             | International orders must be prepaid in US dollars.                                         |            |              |   |  |
| Company<br>Address                                                          |                                             | Check enclosed, payable to Berkeley Design<br>Technology, Inc.                              |            |              |   |  |
| City, State, Zip, Country                                                   |                                             | <ul> <li>Purchase order attached, number:<br/>(credit approval required)</li> </ul>         |            |              |   |  |
| Tel: Fax:                                                                   |                                             |                                                                                             |            |              |   |  |
| Email:                                                                      | Wire transfe                                | ansfer (contact BDTI for instructions)                                                      |            |              |   |  |