technical writing, technical articles, technical writers

     Home          Rates and Policies          About Bernie           About Toni         


Asynchrous Multiprocessing on CompactPCI

CompactPCI has impressive speed capabilities, offering 64-bit data at 66 MHz and speeds of up to 528 Mbytes per second. But quantum increases in the size and power of real-time machine control, industrial automation, data acquisition, military systems, and telecom applications have created demand for even more bandwidth and more speed. Customers were asking us for a CompactPCI card with a high-performance CPU for real-time, I/O-intensive applications, high throughput for hard disks, "huge" memory, and powerful graphics. The existing cards just didn't have the horsepower needed to meet these demands.

To meet these requirements on a 6U (about 9 x 6 inches) CompactPCI board was challenging, to say the least. It would take an array of components that normally go on a full-size ATX motherboard, and we needed to squeeze it all onto an extremely small form factor. Compounding the challenge was the fact that we wanted to use two CPUs on the board. In the targeted environments one CPU was needed to run the system and a secondary processor was needed to handle real-time tasks--something no one had done before on a CompactPCI board. We wanted true real-time asymmetric multiprocessing -- two or more processes running independently of one another on the same memory subsystem.

The model today in Symmetric Multiprocessing (SMP) is that two CPUs on the same board give you increased processing power of thirty to forty percent (operating-system dependent) over the speed of a single CPU. With asymmetric multiprocessing the increase in speed of instruction execution is approximately 98 percent -- just about the equivalent of two CPU boards.

There were other performance gains realized using this approach. For example, in CompactPCI telecom applications, the big problem is trying to execute a deterministic function in a high-performance application with low latency. The ATM controller is very I/O intensive -- it takes a lot of a processor's bandwidth to perform its functions. So in a single-board computer application one processor allocates 90-100 percent of its time just dealing with the data transfers to and from the ATM controller. But once the data comes in, you no longer have the determinism or computational ability to work with the data stream on a real-time basis. In an average ATM application the controller is carrying a lot of calls. The processor must bring this data from the controller, process it, and put it back onto some other media, whether disk or Ethernet or other I/O channel. Historically, this cannot be done on one processor card because of the bandwidth required. Having a second CPU allows one processor to process data from the ATM controller while the other processes calls and all other system functions.

A primary objective was to make the second CPU virtually independent of the primary processor, carrying the "loosely coupled" concept a step further to what we like to call "barely coupled." Our idea was to provide the ability to hot-swap the second CPU's kernels. That is, we wanted the second CPU to be independent enough that we could boot the operating system on the main processor, run a microkernel or application on the second processor, and then swap it out for a different kernel or application without having to reboot the system. We knew this would be of supreme value to always-on applications such as you find in telecom.

We began by choosing the Pentium III for our CPUs, not only because of its 550 MHz speed, but because it provides an integrated Advanced Programmable Interrupt Controller (APIC). This controller enables multiple Pentiums to communicate with each other over a high-speed serial bus, providing minimal latency in parallel with program execution.

The Pentium is also available with a companion chip, the 82093AA IOAPIC, that provides multiprocessor interrupt management with a dynamic interrupt distribution scheme. In systems with multiple I/O subsystems, each subsystem can have its own set of interrupts. Together, the APIC and IOAPIC overcome the communications bottlenecks and lack of determinism that often plague both SMP and distributed multiprocessing systems.

However, on a CompactPCI board processor height is a big issue since Slot 1 Pentiums normally are installed vertically. We solved that by using our own right-angle connector, the Slot Saver, which lays the Pentiums down flat. This solution also allowed us to design a heat sink that meets the needs of the Pentium processors.

A bigger problem we faced was that of putting more than one CPU board in a CompactPCI system. Since the primary system controller sets up the memory and I/O allocations throughout the system, a standard PCI-to-PCI bridge could not be used. Therefore if you add another CPU with its own local resources -- such as Ethernet, SCSI, or a PCI-to-ISA bridge -- and you put a PCI-to-PCI bridge on the board, the system controller will come across that bridge and reconfigure memory and I/O of the local device. The local CPU will also attempt to configure these same devices, causing many system conflicts.

In order for the local CPU to keep control of its own resources, you have to use a "blocking" bridge so that neither CPU can launch configuration cycles on the downstream side. We used the DEC 21554 Draw Bridge chip, which is a PCI-to-PCI bridge that will not forward Type 1 configuration cycles. This allows for more than one CPU board in a system, each one configuring its own local devices without interfering with any of the other CPU boards or intelligent peripherals in the system. As far as we know, this is another first on a CompactPCI board.

We answered the call for "huge memory" with a gigabyte of PC-100-compliant synchronous DRAM and a megabyte of L2 cache. Both processors run out of shared memory.

To achieve high I/O throughput on the hard disks, we chose a 40-Mbytes/sec UltraSCSI and two autodetecting, autoswitching 100BaseT Ethernet devices used for redundancy or routing applications. We also decided to design an Ultra-DMA 33 IDE interface, a pair of USB ports, dual Serial I/O with optional RS422 drivers, and a parallel port.

To meet the demand for powerful graphics, we used the Intel 740 Accelerated Graphics Port (AGP) chip that gave us over 7 million triangles per second. AGP provides a direct connection between the display adapter and memory and doesn't use up PCI bus bandwidth. This would be another element in providing fast I/O.

We chose Intel's Front Side Bus because of its 100-MHz capability, giving us the speed we needed for our CPUs. Formerly the limit for this type of architecture was 66 MHz. The processors, cache and memory are linked by this bus, which not only supports current 500-MHz processors, but will support next-generation processors as well.

The finished board, designated the C2P3, runs a variety of desktop and real-time OSes, including Windows NT, Solaris x86, QNX, and VxWorks, with multiprocessing CompactPCI drivers that treat the CompactPCI bus as a network and communicate via TCP/IP over the backplane. This is in conformance with a new CompactPCI standard, currently in draft, that will use TCP/IP-type packet sending as a method of communication between multiple CPUs within a system.

When economy is an issue, the board can accommodate two Celeron PPG370 processors that currently run at 466 MHz, with a future clock speed of 667 MHz on the drawing board. The Celeron version also includes 256K of on-die cache with one-to-one clocking that dramatically improves the performance of the cache. This configuration provides Pentium III performance at fraction of the cost.

In tandem with designing the C2P3, we needed to write the code to make it jump through hoops. As mentioned above, one of our primary challenges was to get the secondary processor configured and up and running on a system where the primary processor is already running. In order to do this we created RAMP (Real-time Asynchronous Multiprocessor), a board-support program that runs on the primary processor under the OS; a microkernel that runs on the secondary processor; some libraries used to compile user-created kernels or applications; and an API. We configured the microkernel to handle eight real-time tasks scheduled in its task list. These can be executed independently from the tasks on the primary processor.

But this wasn't simple. When a CPU comes up under VxWorks it has control over the system memory, all the hardware ports, and interrupts. To bring up another kernel in that environment is difficult because you can't touch any of the memory or hardware resources. It has to be set up so resources are requested from the host processor on an as-needed basis. The way we solved it was to bring up our kernel running a do-nothing loop, then when we give it a task it requests needed memory or port allocations from the host processor.

Finally, we optimized the kernel for high-speed context switching and interrupt response. We utilized an event-driven, priority-based, multitasking scheduler to make it capable of supporting up to eight tasks.

We wrote the API to provide calls for uploading, downloading, starting and stopping tasks, allocating resources (i.e., memory and hardware resources), establishing protected memory regions, and facilitating communications between tasks. The programmer can use API calls from within the main OS to distribute tasks and resources to slave processors and to coordinate slave processor activity. Programmers can also add their own API calls to address application-specific requirements. Designers can guarantee worst-case latencies and take full advantage of available hardware resources by assigning particular tasks to particular processors.

RAMP is loaded onto the primary processor along with the OS and will support however many processors are on the board. In the case of the C2P3, that's one secondary processor, but theoretically it will support as many as 32 processors. It can run code in its own cache concurrently with the other processors, eliminating a large number of bus accesses. This system is deterministic, and it's completely configurable by the user as to how much load each processor will have to handle.

Under RAMP both processors operate out of the same memory and utilize common interprocessor and interrupt communications protocols and pathways. In addition, RAMP’s single-processor programming model enables complex multiprocessing programs to be developed just as they would for a single processor. The master processor runs a full-featured RTOS, such as VxWorks, that is primarily responsible for resource allocations and housekeeping. The slave processors are free to run independent real-time tasks.

RAMP is royalty free, so designers need only purchase a full-featured RTOS for the master processor. An efficient way to handle board-to-board communications is with packages like VxMP, with RAMP used to handle on-board Real-time Asynchronous Multiprocessing.

The RAMP board support package, including API and microkernel, are provided free of charge with General Micro Systems’ single-board computers.

To return to the page you came from, just close this window.