User Tag List

+ Trả lời chủ đề
Hiện kết quả từ 1 tới 3 của 3

Chủ đề: Benchmarking microprocessors for high-end signal processing

  1. #1
    Uỷ viên ban điều hành Box khoa ĐTVT Avatar của nothingtolose
    Tham gia ngày
    Sep 2004
    Bài gửi

    Mặc định Benchmarking microprocessors for high-end signal processing

    Bài báo về các tiêu chuẩn đánh giá DSP chíp, thấy hay copy cho anh em FET đọc nhất là ai muốn tìm hiểu sâu hơn về DSP. Hơi dài nên ai quan tâm sẽ đọc hết, còn không coi như tớ spam thêm 4 bài.

    Benchmarking microprocessors for high-end signal processing
    By Stephen Paavola

    The advent of the PowerPC microprocessor with AltiVec technology made general-purpose microprocessors into viable candidates for high-end, signal-processing applications. The original PowerPC, which Motorola and IBM had jointly designed, combined relatively low power for a GP micro with computational performance well beyond what its clock frequencies could indicate. However, the G4 generation of AltiVec microprocessors, with their super-scalar architecture and integrated Single Instruction, Multiple Data (SIMD) capabilities, are what made the architecture competitive for digital signal processing (DSP).

    Today there are a number of general-purpose microprocessor architectures, while not designed for high-end signal processing, which might provide the processing performance required for complex radars, leading-edge semiconductor inspection systems, and other such demanding applications. Nevertheless, how well does each really perform as a digital signal processor? Additionally, how do these contenders compare for different kinds of operations on various length vectors running out of different levels of a memory subsystem?

    To answer this question, engineers from SKY Computers developed simple benchmarks to gauge the memory bandwidth and computational performance of the Motorola 7447 PowerPC, IBM 970 PowerPC, AMD Opteron, and Broadcom MIPS-based 1250 chips for common types of radar and signal intelligence.

    The bottom line of those benchmarks completed to date is that the PowerPC with AltiVec produces impressive computational performance compared to the other processors considered. Motorola, now Freescale, has updated the performance of the 74xx several times since its introduction. Apple uses the 74xx in its G4 systems, as well as in its laptop computers. Apple laptops currently in the market are running nearly five times faster than the first G4 systems, due to a redesign of the PowerPC 7400’s pipeline for the PowerPC 7445/7455. Further, IBM is now also shipping a PowerPC with AltiVec capability based on their power architecture.

    Yet, despite the strengths of AltiVec, the benchmarks revealed that the alternative processors offer some interesting capabilities for particular types of signal processing. These include applications where memory bandwidth, for example, may be more important than sheer speed, or where parts count is limited.

    The basic benchmarks run were:

    + Memory-read bandwidth
    + Vector multiply
    + Fast Fourier Transform (FFT)
    + Basic signal processing

    Table 1 : lists the basic characteristics of the benchmarked microprocessors. The following figures indicate a sampling of the results. Some of the results were exactly as expected, but some surprises also emerged.

    Benchmark #1: Memory-read bandwidth

    To measure the basic memory-read bandwidth of the processors, engineers used a very trivial vector-sum computation for vectors ranging from 1 KB to 8 MB in length. In this simple benchmark, as well as in others, all of the processors suffered a definite step down in bandwidth when the vector length exceeded the L1 cache size, requiring access to L2 cache. Likewise, performance further declined when a vector overran the size of the L2 cache, requiring access to DRAM main memory.

    The benchmark operation consisted of performing a vector sum on the first byte of every 32-byte cache line and storing the result in a register, discarding most of the data from the cache line. Engineers chose this “for-loop” methodology because the benchmark was intended to measure bandwidth, not computational performance.

    As expected (refer to Figure 1), all of the processors provided their best bandwidth when they ran out of L1 cache, with different degrees of slowdown when they accessed L2 cache and then DRAM. As might have been expected, the 800 MHz Broadcom BCM1250, with the lowest operating frequency of the group, also had the lowest memory bandwidth, whether the access was to L1 or L2 cache. Despite the fact that this dual-processor chip had integrated memory controllers, it still lagged behind the other processors when it accessed DRAM in this particular benchmark. Nevertheless, the 1250 suffered less of a slowdown as it moved from L2 cache to main memory than several of the other processors suffered.

    Benchmark engineers tested several PowerPC 744x microprocessors, although for clarity, Figure 1 shows results for only one of them. Performance when operating out of cache was pretty much as expected, with memory bandwidth to cache being directly proportional to the clock speed of the processor. The change in performance is quite clear, as the vector size overflowed the L1 cache. The change was almost as dramatic when the L2 cache overflowed, although performance for the 512-KB long vectors was less than expected. The surprises in this benchmark appeared to be in the behaviors of the Opteron and PowerPC 970 processors, both 1.8-GHz parts.

    The Opteron chip, for example, had by far the best bandwidth of the group for this benchmark when it operated out of L1 cache, but its DRAM bandwidth was only marginally better than the alternatives, and its L2 bandwidth lagged all of the other processors except the Broadcom BCM1250.

    The biggest surprise, however, was in the behavior of the 970, which had a very fast clock. In that benchmark, the 970 demonstrated the second slowest L1 bandwidth of the group, despite having almost twice the operating frequency, such as that of the 1-GHz PowerPC 7447.

    On the other hand, the benchmark results clearly showed the superior efficiency of the 970’s L2 cache and automatic pre-fetch engines for that particular test. The bandwidth fall-off between L1 and L2 caches of this processor was quite minor, whereas the bandwidth of all the other processors in the group fell substantially when vector length forced an L2 access. The 970’s pre-fetch engines analyzed the memory access behavior of the application and started fetching data from memory before the application requested it, if the accesses were regular enough.
    Lần sửa cuối bởi nothingtolose; 25-06-2005 lúc 10:12 AM

  2. #2
    Uỷ viên ban điều hành Box khoa ĐTVT Avatar của nothingtolose
    Tham gia ngày
    Sep 2004
    Bài gửi

    Mặc định

    Benchmark #2: Memory-read bandwidth unrolled

    Why did the 970 fare so poorly in the simple memory-read benchmark? Perhaps it was because, with all of the 970’s functional units and deep pipeline, the test simply did not provide enough data for it to work on and apply all the optimizations it is capable of performing.

    To test this theory, engineers developed a second benchmark to account for this possibility (see Figure 2). The modified benchmark unrolled the loop by a factor of two, with the index bumped to every other cache line (instead of every cache line) and the loop picking up data out of two cache lines (instead of one), giving the processors more work to handle with each loop iteration.

    In this second benchmark, all of the microprocessors in the group ran faster than in the previous benchmark, but the 970 ran dramatically faster. Running, in fact, two times faster than in the earlier test, the 970 took the lead as the second fastest processor in the group for performing the unrolled memory read when operating from L1 cache, lagging only the Opteron in speed.

    As for the other processors, the 7447 bandwidth increased just a bit when the vector was unrolled, while the Opteron bandwidth increased a reasonable amount. The effect of the benchmark modification on the bandwidth of the Broadcom BCM1250 was mixed. This chip’s performance changed little when running out of L1 cache, but its bandwidth essentially doubled when running out of L2 and DRAM. This indicated that the BCM1250 had capabilities for optimizing memory of which the first bandwidth did not take advantage.

    Benchmark #3: Memory-read bandwidth with pre-fetch
    All of the processor architectures selected for this benchmark had some programmable pre-fetch capabilities. This feature allows the application to predict future data requests and issue touch instructions to ask the processor to fulfill the requests in advance. Engineers arbitrarily selected a pre-fetch factor of 3 for this third benchmark (see Figure 3). In this test, the for-loop was not unrolled.

    This benchmark modification had little effect on the behavior of the 970, which had built-in engines for predicting memory requests and always did a lot of work to optimize its memory bandwidth. Nor did the modification much affect the performance of the Opteron. The 7447s, however, suffered a serious slowdown when they operated out of cache.

    The third benchmark dramatically illustrated a major strength of the BCM1250 chip, as its bandwidth in some instances increased substantially when pre-fetching. In Benchmark #1, the chip’s bandwidth started to drop as vector lengths reached 16 KB, and once it began accessing L2 cache, bandwidth dropped precipitously. With Benchmark #3, in stark contrast, L2 cache and DRAM bandwidth suffered far less of a fall-off.

    For 64-KB vectors, the BCM1250, for example, demonstrated more than four times the memory-read bandwidth with three-line pre-fetch than without; and with 512-KB vectors, it achieved nearly a six-time improvement. The upshot is that, with pre-fetch, the BCM1250 architecture was competitive with the 7447 architectures in DRAM bandwidth. Moreover, increasing pre-fetch to a factor of 7 provided a further increase in bandwidth for the architecture.

    As for the 7447, although there were some dependencies on the system controller chip, the general lesson is that a designer needs to be careful with pre-fetch in this architecture. The advisability of pre-fetching will depend on the algorithm.

    Benchmarks #4 and #5: Complex vector multiplies

    The next tests involved quantifying CPU performance of the microprocessor architectures on complex data types containing both real and imaginary data components. Engineers chose a vector multiply operation of two complex vectors of various lengths, with the data type being a complex, single-precision floating point. The complex data was stored in memory in interleaved form, with the real and imaginary components for each element stored in adjacent memory locations.

    Engineers chose the interleaved format because it is the most common for this class of processors and more challenging than the split format. Split format stores a complex vector as two vectors in memory: the first vector with all of the real values, the second vector with the imaginary values. While split format can produce better performance on processors with SIMD capability, engineers chose not to benchmark this one.

    To understand where the performance steps occur in the graphs, one must understand the dataset sizes. For the complex vector multiplies, each data point or element in a vector is 8 bytes long, so a 1-K vector contains 8 KB of data. A complex vector multiply requires three vectors: thus, 24 KB, equivalent to 3/4 of the PowerPC processors’ L1 cache, is required for a 1-K operation.

    Testers ran two variations on this benchmark: one using vanilla C scalar computations with for-loops (see Figure 4); and one using optimized libraries that leverage the processors’ native SIMD capabilities.

    The GNU GCC compiler generated the C code that engineers used in Benchmark #4. For Benchmark #5, engineers used SKY’s proprietary assembly code to optimize the AltiVec operation for the PowerPC architecture. The GNU extensions generated the SSE2 assembly code for the Opteron chip and packed single code for the BCM1250.

    Benchmark #4 demonstrated that the two 1.8-GHz chips, the PowerPC 970 and the Opteron, delivered essentially identical performance when they ran scalar code out of L2 cache or DRAM, but the Opteron outpaced the 970 for L1 cache operations. When the code was optimized for SIMD functions, however (Figure 5), the 970 was much faster than the Opteron when they were operating out of L1 cache. For longer vector lengths that required access to L2 cache or DRAM, however, the performance of the two processors was again essentially identical.

    Results indicated that for complex vector multiplies using scalar C code, the PowerPC G4 was much slower than the Opteron or 970 for L1 cache operations, but it beat them both for L2 accesses. Once a 7447 had to go out to main memory, however, its performance dropped precipitously, dipping below that of the BCM1250, which was the worst L1 and L2 performer of the group.

    The 7447 performance improved dramatically when optimized for AltiVec SIMD functions. Operating out of L1 cache on a 1K complex vector multiply, for example, the 7447’s performance was nearly as good as that of the 970 and better than that of the Opteron. At a 2-KB vector length, the larger L1 cache of the Opteron gave it the performance advantage over the 7447. Once the vector size overflowed the L2 cache and moved to DRAM, the 7447 suffered a big decline to the level of the BCM7447 chip.

    It is interesting that the performance of the BCM1250 was as good as it was when operating from memory. Providing the extra work of a complex vector multiply enabled it to take better advantage of its memory architecture than the simple memory bandwidth benchmark.

    The performance level of the Opteron was significant compared to the PowerPC’s. Despite the fast clock, the SSE2 instruction set lacked the flexibility to rearrange the data within the AltiVec’s 128-bit SIMD processing element. In addition, the PowerPC was able to rearrange data simultaneously with computation, while the Opteron wasn’t.

  3. #3
    Uỷ viên ban điều hành Box khoa ĐTVT Avatar của nothingtolose
    Tham gia ngày
    Sep 2004
    Bài gửi

    Mặc định

    Benchmark #6: Working with FFTs

    The complex Fast Fourier Transform (FFT) is among the most common digital signal processing function. Next, engineers ran a sixth benchmark (refer to Figure 6) to evaluate the different processors in this realm. They used optimized libraries along with the processors’ SIMD capabilities in all cases. The Broadcom chip doesn’t appear in this chart because it had no optimized library at the time of test.

    Engineers used an FFT from AMD for the Opteron. They used an FFT by SKY, optimized for the 7447 pipeline, on the PowerPC processors. This benchmark provided an even more dramatic demonstration of the advantage of the PowerPC AltiVec over SSE2. As for the Opteron, it didn’t become competitive until memory bandwidth became the limitation, even in comparison to the much slower 7447.

    Benchmark #7: Digital signal processing

    The final, seventh example in this suite of benchmarks (Figure 7) reveals how three of the processors performed when they ran a simplistic signal processing application. For this test, the assumed source was a digitized sensor such as a radar receiver, which provided 16-bit integer data.

    In the benchmark, the data converted to float, and then it performed a forward FFT, followed by a vector multiply and an inverse FFT. This function resembles pulse compression in radar where signals perform a convolution on the input data, or when signal intelligence uses a frequency domain filter. The shape of these curves and the relative performance of the processors were very similar to those in the FFT chart.

    SKY Computers continues to benchmark these and other leading-edge processing solutions to determine their applicability for high-end signal-processing applications. The role of general-purpose microprocessors in digital signal processing will evolve over time. The Broadcom processor roadmap, for example, contains a future quad-processor device. Additionally, other architectures, and perhaps some new entries yet to come, will not be standing still.

    ---------- About author -----------------------------------

    Stephen Paavola is chief technical officer at SKY Computers, Inc. of Chelmsford, Massachusetts. He has held a number of positions at SKY, including director of advanced development, director of marketing, software quality manager, engineering manager, and applications support manager. Paavola spent two years at CETIA (now Thales Computers) as technical support manager as part of the team establishing a U.S. office for this French company. Before joining SKY, Paavola was at Digital Equipment Corporation in home-office software support running an engineering group, was product manager for the RSX-11 real-time operating systems and a personal computer program, and was a product marketing manager. Paavola holds a BS from Caltech in Pasadena, California.

    SKY Computers, Inc., a wholly owned subsidiary of Analogic Corporation, is a leading supplier of standards-based, high-performance embedded computer systems. SKY’s broad range of commercial-off-the-shelf products meets the embedded computing requirements for high-performance embedded computing applications. Sky combines standards and open-source software with advanced technologies to provide solutions for demanding military and defense electronics, industrial inspection, medical imaging, and security applications. For additional information, visit

+ Trả lời chủ đề

Thông tin chủ đề

Users Browsing this Thread

Hiện có 1 người đọc bài này. (0 thành viên và 1 khách)

Chủ đề tương tự

  1. The Intel Microprocessors
    Gửi bởi vitconbk trong mục Giảng đường khoa ĐTVT
    Trả lời: 8
    Bài cuối: 10-09-2007, 10:53 PM
    Gửi bởi dinhvandt trong mục Giảng đường khoa ĐTVT
    Trả lời: 1
    Bài cuối: 25-05-2007, 04:39 PM
  3. New File Added: Digital Signal Processing_DSP and Applications
    Gửi bởi glass_rose trong mục Thế Giới Phần Mềm
    Trả lời: 0
    Bài cuối: 23-02-2007, 09:53 PM
  4. New File Added: Ceramic Technology and Processing
    Gửi bởi sonnvl trong mục Thế Giới Phần Mềm
    Trả lời: 0
    Bài cuối: 27-01-2007, 12:54 AM
  5. Image Processing in C 2nd Edition
    Gửi bởi The Fool trong mục Diễn đàn Học tập và Nghiên cứu KH Sinh viên
    Trả lời: 0
    Bài cuối: 19-05-2005, 11:30 PM

Từ khóa (Tag) của chủ đề này

Quyền viết bài

  • Bạn không thể gửi chủ đề mới
  • Bạn không thể gửi trả lời
  • Bạn không thể gửi file đính kèm
  • Bạn không thể sửa bài viết của mình

About svBK.VN

    Bách Khoa Forum - Diễn đàn thảo luận chung của sinh viên ĐH Bách Khoa Hà Nội. Nơi giao lưu giữa sinh viên - cựu sinh viên - giảng viên của trường.

Follow us on

Twitter Facebook youtube