Bài báo về các tiêu chuẩn đánh giá DSP chíp, thấy hay copy cho anh em FET đọc nhất là ai muốn tìm hiểu sâu hơn về DSP. Hơi dài nên ai quan tâm sẽ đọc hết, còn không coi như tớ spam thêm 4 bài.
Benchmarking microprocessors for high-end signal processing
By Stephen Paavola
The advent of the PowerPC microprocessor with AltiVec technology made general-purpose microprocessors into viable candidates for high-end, signal-processing applications. The original PowerPC, which Motorola and IBM had jointly designed, combined relatively low power for a GP micro with computational performance well beyond what its clock frequencies could indicate. However, the G4 generation of AltiVec microprocessors, with their super-scalar architecture and integrated Single Instruction, Multiple Data (SIMD) capabilities, are what made the architecture competitive for digital signal processing (DSP).
Today there are a number of general-purpose microprocessor architectures, while not designed for high-end signal processing, which might provide the processing performance required for complex radars, leading-edge semiconductor inspection systems, and other such demanding applications. Nevertheless, how well does each really perform as a digital signal processor? Additionally, how do these contenders compare for different kinds of operations on various length vectors running out of different levels of a memory subsystem?
To answer this question, engineers from SKY Computers developed simple benchmarks to gauge the memory bandwidth and computational performance of the Motorola 7447 PowerPC, IBM 970 PowerPC, AMD Opteron, and Broadcom MIPS-based 1250 chips for common types of radar and signal intelligence.
The bottom line of those benchmarks completed to date is that the PowerPC with AltiVec produces impressive computational performance compared to the other processors considered. Motorola, now Freescale, has updated the performance of the 74xx several times since its introduction. Apple uses the 74xx in its G4 systems, as well as in its laptop computers. Apple laptops currently in the market are running nearly five times faster than the first G4 systems, due to a redesign of the PowerPC 7400’s pipeline for the PowerPC 7445/7455. Further, IBM is now also shipping a PowerPC with AltiVec capability based on their power architecture.
Yet, despite the strengths of AltiVec, the benchmarks revealed that the alternative processors offer some interesting capabilities for particular types of signal processing. These include applications where memory bandwidth, for example, may be more important than sheer speed, or where parts count is limited.
The basic benchmarks run were:
+ Memory-read bandwidth
+ Vector multiply
+ Fast Fourier Transform (FFT)
+ Basic signal processing
Table 1 : lists the basic characteristics of the benchmarked microprocessors. The following figures indicate a sampling of the results. Some of the results were exactly as expected, but some surprises also emerged.
Benchmark #1: Memory-read bandwidth
To measure the basic memory-read bandwidth of the processors, engineers used a very trivial vector-sum computation for vectors ranging from 1 KB to 8 MB in length. In this simple benchmark, as well as in others, all of the processors suffered a definite step down in bandwidth when the vector length exceeded the L1 cache size, requiring access to L2 cache. Likewise, performance further declined when a vector overran the size of the L2 cache, requiring access to DRAM main memory.
The benchmark operation consisted of performing a vector sum on the first byte of every 32-byte cache line and storing the result in a register, discarding most of the data from the cache line. Engineers chose this “for-loop” methodology because the benchmark was intended to measure bandwidth, not computational performance.
As expected (refer to Figure 1), all of the processors provided their best bandwidth when they ran out of L1 cache, with different degrees of slowdown when they accessed L2 cache and then DRAM. As might have been expected, the 800 MHz Broadcom BCM1250, with the lowest operating frequency of the group, also had the lowest memory bandwidth, whether the access was to L1 or L2 cache. Despite the fact that this dual-processor chip had integrated memory controllers, it still lagged behind the other processors when it accessed DRAM in this particular benchmark. Nevertheless, the 1250 suffered less of a slowdown as it moved from L2 cache to main memory than several of the other processors suffered.
Benchmark engineers tested several PowerPC 744x microprocessors, although for clarity, Figure 1 shows results for only one of them. Performance when operating out of cache was pretty much as expected, with memory bandwidth to cache being directly proportional to the clock speed of the processor. The change in performance is quite clear, as the vector size overflowed the L1 cache. The change was almost as dramatic when the L2 cache overflowed, although performance for the 512-KB long vectors was less than expected. The surprises in this benchmark appeared to be in the behaviors of the Opteron and PowerPC 970 processors, both 1.8-GHz parts.
The Opteron chip, for example, had by far the best bandwidth of the group for this benchmark when it operated out of L1 cache, but its DRAM bandwidth was only marginally better than the alternatives, and its L2 bandwidth lagged all of the other processors except the Broadcom BCM1250.
The biggest surprise, however, was in the behavior of the 970, which had a very fast clock. In that benchmark, the 970 demonstrated the second slowest L1 bandwidth of the group, despite having almost twice the operating frequency, such as that of the 1-GHz PowerPC 7447.
On the other hand, the benchmark results clearly showed the superior efficiency of the 970’s L2 cache and automatic pre-fetch engines for that particular test. The bandwidth fall-off between L1 and L2 caches of this processor was quite minor, whereas the bandwidth of all the other processors in the group fell substantially when vector length forced an L2 access. The 970’s pre-fetch engines analyzed the memory access behavior of the application and started fetching data from memory before the application requested it, if the accesses were regular enough.