# Parallel Processing-Parallel Memory Approach for Super Fast Design of Future Microprocessor

Yaakob Karomy. Hanna Al-Mustansiriyah University, College of Engineering Department of Computer & Software Engineering Baghdad, Iraq

## Abstract

The early design of the microprocessor  $(\mu P)$  used a single ALU with a single unit of memory. The development of the microprocessor design generates a multi-ALUs microprocessor that is a parallel processing with multi-units of memory.

The parallel processing approach will increase the speed of the processing but this speed up is non-linear with increasing the number of processors that are used in the system  $(\mu P)$ . However, the efficiency of the parallel processing is non-linear and depends on some factors such as the parallel processing type, the overall design, the programming approach and the applications, yet in general the parallel processing efficiency will decrease by the increase of the number of processors in the system.

The history of the Intel  $\mu$ P's will be used as an example to trace and analyze the growth of the  $\mu$ P. This tracing will disclose that the main future problem in the  $\mu$ P is the storage not the processing of the data. This problem is generated because the shared memory in the parallel processing will capture the processors in the system. The processor in this parallel processing system is not free to use the memory but it shares a single memory with other processors in this system.

This paper propose a novel approach designing a parallel memory that gives the processors in the parallel processing system a higher freedom to use the memory and eventually increases the efficiency of each processor, that end result will increase the total speed of the parallel processing system because it will become a parallel processing parallel memory (PPPM) system.

This approach will apply to the Intel processor  $P^{\xi}$ , which will show that it is able to increase the speed of the  $P^{\xi}$  processor for more than four times. These results are used to propose a future design strategy as a first step to implement a super fast processor and then a super fast PC.

The proposed processor is PPPM system with  $\uparrow \circ \uparrow ALUs$ , this processor is expected to enhance the strategy of the management and control units to become a successful super fast processor with speed up to  $\uparrow \cdot$  times over the Intel P  $\pounds$ .

**Keyword:** Intel  $\mu P$ , parallel processing, parallel memory, cache memory, DMA, memory cycle,  $P \leq$ , prediction, fetching, Net Burst, shared memory, delay, latency.

الخلاصة

التصاميم الاولية للمعالجات الدقيقة استخدمت وحدة واحدة فقط من وحدات الحساب والمنطق (ALU) مع ذاكرة واحدة. التطوير في تصميم المعالج الدقيق خلق معالجات دقيقة بعدة وحدات من الـ (ALU) تعمل بصورة متوازية مع بعضها البعض مع استخدام عدة وحدات من الذاكرة ( multi-units of memory).

طريقة المعالجة المتوازية تزيد من سرعة المعالجة ولكن هذه الزيادة لاتكون زيادة خطية مع الزيادة في عدد المعالجات المستخدمة في النظام. ان كفاءة نظام المعالجة المتوازية ليست خطية بسبب اعتمادها على عوامل اخرى مثل ، نوع المعالجة المتوازية، التصميم النهائي للنظام، نوع البرمجيات مع التطبيقات،اكن بصورة عامة ان كفاءة نظام المعاجة المتوازية تقل كلما ازداد عدد المعالجات في النظام.

ان تاريخ تطور المعالجات من نوع (Intel) ستستخدم كمثال لبحث وتحليل نمو وتطور المعالج الدقيق. هذا البحث سيكشف ان المشكلة المستقبلية الرئيسية في المعالج الدقيق ستكون في خزن المعلومات وليس في معالجتها. هذه المشكلة تظهر بسبب ان الذاكرة المشتركة في نظام المعالجة المتوازية ستحدد عمل المعالجات في النظام حيث ان المعالج في هذا النظام سيكون ليس حرا في استخدام الذاكرة ولكنه سيتقاسم تلك الذاكرة مع المعالجات الاخرى في النظام.

البحث يقترح نهجا جديدا لتصميم الذاكرة المتوازية التي تعطي المعالجات في نظام المعالجة المتوازية حرية اكبر في استخدام الذاكرة مما يزيد من كفاءة كل معالج ، ونتيجة لذلك ستزداد السرعة الكلية للنظام بسبب نظام المعالجة المتوازية والذاكرة المتوازية (PPPM).

هذا النهج سيطبق على المعالج الدقيق نوع ( P f Intel )، والذي سوف يزيد سرعة المعالج f f لأكثر من أربع مرات. هذه النتائج تستخدم لاقتراح استراتيجية تصميم مستقبلية كخطوة أولى لتنفيذ معالج بسرعة فائقة ومن ثم حاسوب شخصي (PC) بمثل هذه السرعة.

المعالج المقترح يستخدم نظام معالجة متوازية مع ذاكرة متوازية (PPPM) ب ALUs ٢٥٦ وحدة حساب ومنطق، ومن المتوقع ان هذا المعالج سيعزز استراتيجية وحدات الادارة والسيطرة لتصبح بنجاح معالج فائق السرعة تفوق سرعة معالج (٤ Intel P) ب (٢٠) مرة.

## **1.** Introduction

What are the requirements to implement a fast PC? The basic answer is the implementation of a fast PC requires a fast  $\mu$ P; but this answer is not complete because the fast  $\mu$ P requires to a suitable components to help it. The fast PC required at least to:

- ۱. Fast µP.
- <sup>7</sup>. Large size of memory with suitable heretically implementation.
- ۳. Flexible I/O controller.

This answer arise a new question, what is the fast  $\mu P$ ? It's a  $\mu P$  that has at least:

- 1. Fast clock.
- ۲. Wide data bus (ALU).
- ۳. Wide address bus.

- ٤. Suitable control signals.
- °. Large set of instructions.
- J. Good control circuits.

The application of all the above points in the design of the  $\mu P$  is not enough to implement a fast  $\mu P$ ; the key of the next step to increase the  $\mu P$  speed is the parallel processing with extra help units such as fetching unit, MMX, SSE, Net Burst,..... [`]

This paper will trace the steps of the processor growing from the history of the Intel processors Section  $\gamma$ .

Section  $\mathcal{T}$  shows the types of the parallel processing while the efficiency measurement of the shared parallel processing system will be summarized in section  $\mathcal{E}$ .

Section  $\circ$  shows the design of the P<sup> $\epsilon$ </sup> processor as suitable example for Intel  $\mu$ P that uses the parallel processing approach.

Section  $\forall$  will give the key of the novel method of the parallel memory, while the effect of the PPPM approach will present in examples in section  $\lor$  that will estimate the time delays using a simple basic equations then simulate the simple parallel processing system for much better results.

Section  $\wedge$  gives the main lines to design a super processor in the future, the conclusions and a simple future work of this paper are in sections  $\uparrow$  and  $\uparrow$ .

### **\*.** Growing History of Intel microprocessors

The history of the Intel microprocessors will be summarized in tables ( $\uparrow$ ,  $\uparrow$  and  $\uparrow$ ); these tables show how the growing of the  $\mu$ P is joined with the other factors in the PC. The light points for this history are noted in bold lines in the tables like the use of DMA in  $\land \cdot \land \mu$ P.

The first row in table () shows the  $\xi \cdot \cdot \xi \mu P$  in 1971 which is the first general IC (the early form of  $\mu P$ ) used to implement micro-controllers such as the simple calculator. this IC mange  $1\xi$ . Byte of memory and uses the  $\xi \cdot \cdot 9$  IC as I/O manager. The expansion of the memory size to 11 KB in  $\wedge \cdot \wedge$  will require a special memory management and I/O terminals. This unit is called Direct Memory Access (DMA). The  $\wedge \cdot \wedge$  IC has the same architecture of the  $\xi \cdot \cdot \xi$  IC with twice number of bytes for data bus and address bus. The 11 bit address bus was presented in  $\wedge \cdot \wedge$  that expands the memory to  $1\xi$  KB. The  $\wedge \cdot \wedge \circ$  in 1971 is a glaring point in the Intel processor history; this IC is a real  $\mu P$  that has a complex set of instructions 11 bit address bus with new input (maskable interrupt) controller, so it has three internally data-buses with all necessary registers set. The weakness in this  $\mu P$  is the limited memory ( $1\xi$  KB) that is not suitable to save the real programs and the data for the real problems.

Expansion of the address bus to  $\checkmark$  bit expands the memory size to  $\checkmark$  MB. The new size for memory requires a large size memory word and registers to capture the addresses of the instructions and the data in the memory, this problem is solved by using the *segmentation* approach. The electronic growth helps to implement fast and large ICs, the designer use this

advantage to increase the data bus to 17 bit, and hence to implement the first element for the successful family of processor that called 1.47 in 1.474. The 1.471 µP has speed up to 1.474 MHz, maskable interrupt and management 1.474 MB memory will be used to realize the dream to implement the early PC that called IBM PS/7.

| Year | μP         | Data<br>bus | Address<br>bus | Clock<br>(Hz) | MIPS | Memory<br>(Byte)               | I/O                   | Used             |
|------|------------|-------------|----------------|---------------|------|--------------------------------|-----------------------|------------------|
| 1971 | £<br>£     | £           | ٤              | ٧٤·K          | ۰,۰۷ | 75.                            | 59                    | Calculator       |
| 1971 | ٤٠٤<br>•   | ٤           | ٤              | νε・Κ          | ۰,۰۷ | 1.72                           | £ Y • V               | Calculator       |
| 1977 | ۸<br>۸     | ٨           | ^              | ۰۰۰K          | •,•0 | ۱٦K                            | DMA                   |                  |
| 1975 | ۸۰۸<br>۰   | ٨           | 17             | ۲Μ            | ٠,٦٤ | ٦٤K                            | DMA                   | Traffic<br>light |
| ۱۹۷٦ | Λ • Λ<br>ο | ٨           | 17             | ۲Μ            | •,٣٧ | ٦٤K                            | maskable<br>interrupt |                  |
| ۱۹۷۸ | ۸۰۸<br>٦   | ١٦          | ۲.             | ۸M            | ٠,٦٦ | <sup>ヽ</sup> M<br>segmentation | maskable<br>interrupt | IBM PS/Y         |

Table ( $^{1}$ ). The early generations Intel  $\mu$ Ps.

The  $\wedge \cdot \uparrow \wedge \uparrow \mu P$  that has  $\uparrow \not \in$  bit in address bus and  $\uparrow \uparrow MB$  memory with MSDOS is able to implement a useful PC that can be used for long time. The new problem in that time was the limited memory size in comparison with the requirements for solved problems in PC. The solution comes with  $\wedge \cdot \uparrow \wedge \uparrow DX$  in  $\uparrow \uparrow \wedge \circ$  that had  $\uparrow \uparrow \uparrow$  bit for data bus and  $\uparrow \uparrow$  bits for address bus to expand the memory to  $\not \in$  GB with additional feature called *virtual memory* that expanded the external memory of the PC (hard disk) to  $\uparrow \not \in$  TB.

The large memory size in  $\wedge \cdot \uparrow \wedge \neg DX$  required a long read and write times. this problem was solved by adding a small ( $\wedge$  KB) intermediate memory called *cache* memory in  $\wedge \cdot \notin \wedge \neg DX$ .

The Pentium technology opens a new gate to implement a very large and very high speed ICs ( $\mathcal{T},\mathcal{V}$  million transistors with  $\mathcal{T}\mathcal{V}\mathcal{T}$  pin), the data bus became  $\mathcal{T}\mathcal{E}$  bit with  $\mathcal{V}\mathcal{T}$  KB cache in the first generation of these processors which called P° or PI. This processor with all the next generations uses two new technologies, the first is the *prediction* unit which represents the early class of the parallel processing, while the second is the imbedded processor to increase the speed independent of the memory and I/O terminals manger. The new technology (Pentium) with the

two new features in the processor gives the PC a high speed that makes the PCs able to run the advances operating systems (AOS) programs such as windows.

The  $\checkmark$  ALUs with two levels cache memory in Pentium Processor represent the early attempt to use the *parallel* processing approach that increases the power of processor. The additional important unit emerged in 199% with PII used MMX internally unit that made the processor able to run a large set of instructions that called MMX set instruction, therefore sometimes this processor is called Pentium MMX. The PIII expands to run a new set of instructions called SSE also it used  $\pounds$  ALUs.

|      | Iable                | (), "       | ie praoti      | our ger       |                   | inter µr s i                                                   | 0110.                 |                  |
|------|----------------------|-------------|----------------|---------------|-------------------|----------------------------------------------------------------|-----------------------|------------------|
| Year | μΡ                   | Data<br>bus | Address<br>bus | Clock<br>(Hz) | Cache<br>(Byte)   | Memory<br>(Byte)                                               | I/O                   | Other<br>factors |
| 1972 | ٨.١٨٦                | ١٦          | ۲.             | ۸M            | no                | ۱ M seg.                                                       | DMA<br>Interrupt      | MSDOS            |
| 1971 | ٨٠٢٨٦                | ١٦          | ۲ ٤            | ١٠Μ           | no                | い M seg.                                                       | DMA<br>Interrupt      | MSDOS            |
| 1970 | л.тлт <u>р</u><br>Х  | ۳۲          | ۳ ۲            | Mגו           | no                | <sup>٤</sup> G virtual<br><sup>ヽ</sup> <sup>٤</sup> T<br>paged | DMA<br>Interrupt      | MSDOS            |
| ١٩٨٩ | л. £ЛЪ<br>Х          | ٣٢          | ٣٢             | ۲٥Μ           | L \ ^K            | ÉG TÉT                                                         | DMA<br>Interrupt      | MSDOS            |
| 1997 | P° (PI)              | ٦ ٤         | ٣٢             | ۲۰M           | L) אין א          | ٤G ٦٤T                                                         | Imbedded<br>processor | Predict<br>+AOS  |
| 1990 | Pentium<br>Processor | ٦٤          | ٣٢             | ١°•M          | L1 17K<br>L7 707K | ٤G ٦٤T                                                         | Imbedded<br>processor | ۲ ALUs           |
| ١٩٩٧ | PII                  | 75          | ٣٢             | ۳۰۰Μ          | L1 87K<br>L1 87K  | ές τέτ                                                         | Imbedded<br>processor | MMX              |
| ١٩٩٩ | PIII                 | ٦٤          | ٣٢             | ۰۰۰M          | L) 707K<br>L7 7M  | ٤G ٦٤T                                                         | Imbedded<br>processor | ・ALUs<br>SSE     |

Table (<sup>†</sup>). The practical generations Intel µPs for PC.

The ALUs in Pentium Processor., PII and PIII are half parallel processing because they are connected as a parallel input serial output, the real parallel processing starts with *Net Burst* in  $P^{\xi}$  as in table (<sup>r</sup>) it has  $\xi$  ALUs work as parallel input and parallel output units.

Table ( $\mathcal{T}$ ) shows how the micro-architecture of the classic  $\mu P$  reaches saturation and the improve start to use the other factors as in the last column in table ( $\mathcal{T}$ ). P<sup> $\xi$ </sup> satisfies the parallel processing using Net Burst and enhances the set of instructions using *SSE/SSE*  $\mathcal{T}$  sets.

Itanum (Dual core) grouped two  $P \notin \mu Ps$  internally and expands its instruction set to new set called SSE<sup>r</sup>. The growing generates a new technology called Intel  $\forall \notin$  that is used to implement the Intel Core  $\forall$  type Kentfield which has two dual cores internally and expands it to the Intel Core  $\forall$  type Yorkfield that has  $\forall$  cores internally with additional instructions set called SSSE<sup>r</sup>.

Core i° technology (Nehalem type Lynnfield as example) used  $\xi$ - physical cores with  $\gamma$ channel DDR<sup> $\gamma$ </sup> memory also its designer separates the cache memory L<sup> $\gamma$ </sup> to two parts one for data and the other for instructions, expand  $L^{\gamma}$  to  $\xi$  sections each one is  $\gamma \circ \gamma$  KB adding the third common cache level  $L^{\gamma}$  that has  $\wedge$  MB capacity.

The Dual core, Two Dual core, Core  $\checkmark$  Quad,  $\ddagger$ -physical cores with  $\checkmark$ -channel DDR $\checkmark$ and in the last the  $\urcorner$ -physical cores with  $\ddagger$ -channel DDR $\checkmark$  and the *Turbo Boost Technology* in Pentium Core<sup>TM</sup> i $\lor$  (Westmere type Gulftown)  $\mu$ P strategies in March  $\curlyvee$ . $\lor$ . enhanced the parallel processing but all the new technologies will not add high features for the total power of the  $\mu$ P because the limitations of the parallel processing that will be discussed later. For more detail about the history of microprocessor see Ref [!-"].

| Year         µP         Data         Address         Clock         Cache         Memory         Other factorial |                           |     |     |      |                            |        |                               |
|-----------------------------------------------------------------------------------------------------------------|---------------------------|-----|-----|------|----------------------------|--------|-------------------------------|
| Year                                                                                                            | μP                        |     |     |      |                            | Memory | Other factors                 |
|                                                                                                                 |                           | bus | bus | (Hz) | (Byte)                     | (Byte) |                               |
| ۲۲                                                                                                              | P٤                        | 7 £ | 37  | ۲,٤G | $\Gamma$ ) $\Gamma$ K      | ٤G     | Net Burst                     |
|                                                                                                                 |                           |     |     |      | L7 017 K                   | ٦٤Τ    | SSE/SSE Y                     |
| 77                                                                                                              | ۲ Itanum                  | 75  | 37  | ۲,٤G | $\Gamma$ ) $\Gamma$ K      | ٤G     | Dual core                     |
|                                                                                                                 |                           |     |     |      | L۲ ٤ M                     | ٦٤Τ    | SSE <sup>r</sup>              |
| 77                                                                                                              | Intel ٦٤,                 | 75  | ۳۲  | ۳G   | $\Gamma$ ) $\Gamma$ K      | ٤G     | Two Dual Core                 |
|                                                                                                                 | Intel Core <sup>۲</sup> , |     |     |      | L۲ ۲x٤ M                   | ٦٤Τ    | SSSE                          |
|                                                                                                                 | Kentfield                 |     |     |      |                            |        |                               |
| ۲۸                                                                                                              | Intel ٦٤,                 | 75  | ۳۲  | ۳G   | L1 "7K                     | ٤G     | Core <sup>v</sup> Quad        |
|                                                                                                                 | Intel Core ۲,             |     |     |      | L۲ ۲x٦M                    | ٦٤Τ    | SSSE                          |
|                                                                                                                 | Yorkfield                 |     |     |      |                            |        |                               |
| ۲۹                                                                                                              | Nehalem,                  | 75  | ۳۲  | ٣,٢G | L) $f_X$ $f_+$ $f_+$ $f_+$ | ٤G     | <sup>٤</sup> - physical cores |
|                                                                                                                 | Core i°,                  |     |     |      | К                          | ٦٤Τ    | <sup>r</sup> -channel         |
|                                                                                                                 | Lynnfield                 |     |     |      | LY fxyon K                 |        | DDR۳                          |
|                                                                                                                 | -                         |     |     |      | L۳ ۸ M                     |        |                               |
| ۲۰۱۰                                                                                                            | Westmere,                 | 75  | ۳۲  | ٦,٤G | L) $f_{X}$ $f_{+}$ $f_{+}$ | ٤G     | ۶- physical cores             |
|                                                                                                                 | Core™ i <sup>∨</sup> ,    |     |     |      | К                          | ٦٤Τ    | ٤ × DDR۳                      |
|                                                                                                                 | Gulftown                  |     |     |      | Lt VXLOJ K                 |        | Turbo Boost Technology        |
|                                                                                                                 |                           |     |     |      | L۳ ኘ፥ M                    |        |                               |

Table (<sup>r</sup>). The modern generations of Intel µPs.

## **\***. Parallel Processing

There are two main types of the parallel processing, each one has advantages and disadvantages over the other type. These two types are the Distributed Memory and the Shared Memory. Also these types are improved to become hybrid to generate a large amount of parallel processing systems. For more detail about the parallel processing see Ref [ $\xi$ , °].

## **".**\. Distributed Memory

Distributed memory systems vary widely but share a common characteristic. It requires a communication network to connect inter-processor memory as in Fig (1).

The processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes that made to any local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. [ $\xi$ ]

When a processor needs access to a data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. The network "fabric" used for data transfer varies widely, though it can be as simple as Ethernet.



Fig (1): Distributed memory systems parallel processing.

The advantage of this method is each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain cache coherency. While the disadvantage of this method, it is used for networks and for independent processors. [2]

#### ۳,۲. Shared Memory

The shared memory parallel computers Fig  $(\)$  vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors.

The advantages of this method are Global address space provides a user-friendly programming perspective to memory and the data Sharing between tasks is both fast and uniform due to the proximity of memory to CPUs.

The disadvantages of this method are the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increase traffic on the shared memory-CPU path and for cache coherent systems and geometrically increase traffic associated with cache/memory management, programmer responsibility for synchronization constructs that insure "correct" access of global memory. [ $\xi$ ]

This paper will use the shared memory technique for all the next sections and call for simplicity "parallel processing".

### ",". Multi Levels Parallel Processing

The shared memory method is active for a small number of processors (max.  $\land$  or  $\lor$  processors). The system with a large number of processors will use a multi levels parallel processing as shown in Fig ( $\checkmark$ ). The total result is like a parallel inside parallel processing. [ $\circ$ ]

The number of processors in each group is m; this number in default is equal for all groups because that gives optimal results and simple control. These (n) groups will be connected to a common memory in form of shared memory, this system is indicated to as m/n. Theoretically there is no condition that limits the number of levels or the number of groups in each level or the number of processors in the higher level, but practically the symmetry saves the best case.



Fig (<sup>†</sup>): Shared memory systems parallel processing.





## ۳, ٤. Virtual Processor

Some types of processors such as  $P^{\xi}$  use multiple levels of memories (L<sup>1</sup> and L<sup>7</sup>) to interchange the data. Hence, the relation between them is not memory to memory but it is like the

relations between processor and the memory, therefore  $L^{\gamma}$  will regard the  $L^{\gamma}$  as a processor; hence, the memory unit  $L^{\gamma}$  in this case will be called a *virtual processor*.

## 4. Efficiency Measurement of the Parallel Processing System

There are many factors that affect the efficiency of the parallel processing system like the number of processors in the system, the distribution form of these processors, the design strategy, type of connection, the processor characteristics, the ratio of used memory, memory speed, the size of the memory and other different factors that depend on the system and problems. [7, Y]

However, for simplicity let:

- 1. All the processors are of the same type.
- <sup>Y</sup>. All processors and memory have same speed (equal clocks time).
- <sup> $\gamma$ </sup>. All processors have same memory used ratio (P<sub>mi</sub>).
- $\xi$ . The value of P<sub>mi</sub> is less than 1/n where n is the number of processors in the system.
- °. Neglect all other factors.

Under these conditions the total speed of the system can be written as in the following equation (1). [7, Y]

$$S_s \leq S_p \times \frac{n+1}{2}$$
 if  $\sum P_{mi} \leq 1$  (1)

Where  $S_s$  is the speed of the system, Sp is the speed of a single processor.

Define a new factor Rs as the maximum rational system speed that is given in equation (7) as:

$$R_s = \frac{n+1}{2} \qquad if \ \sum P_{mi} \le 1 \tag{(7)}$$

The solutions for equation (1) show that the use of two parallel processors will give a system of a single processor with virtual speed less than 1,0 times of the original speed of a single processor ( $S_p$ ). Table ( $\epsilon$ ) shows some results of a simple simulation (enhanced with some practical results) for low rate data systems ( $P_{mi} < \cdot, 1$ ).

From table ( $\xi$ ) we can note that, the efficient case is the system of two parallel processors with low traffic rate in memory. Another important note is the increase of the number of parallel processor will decrease the efficiency of the system for all cases and forms. The maximum practical number of processor is 17 processor and maximum acceptable levels is % levels. So any specifications over this range will reduce the efficiency of the processor to not useful range.

The practical results are lower than these values except for some special problems that accept the parallel computation. For example the efficiency of a  $\checkmark$  parallel processors full in rang from  $\checkmark \cdot \cancel{2}$  to  $\land \circ \%$  and the efficiency of a  $\ddagger$  parallel processors full in rang from  $\pounds \circ \cancel{2}$  to  $\neg \cdot \%$ . [ $\checkmark$ ]

## o. P<sup>t</sup> Intel Processor

The designs of Intel processors are very different as we noted in section , which means that there is no common design for these processors. Hence, we will select the design of the P<sup> $\xi$ </sup> processor as a good example to show the simple design of a modern parallel processor.

The architecture of  $P^{\xi}$  and its working strategy is very complex but we will simplify the main important points of the architecture and focus on the parallel processing features. Then we discuss its main algorithm of the memory cycle that will help to estimate the speed and efficiency of the processor.

|                     | -<br>-            |                |            |
|---------------------|-------------------|----------------|------------|
| Number of processor | Form              | Rational speed | Efficiency |
| (shared memory)     |                   |                | Rs/n*)     |
| ۲                   | Single level      | ١,٨٥           | ٩٣٪        |
| ٤                   | Single level      | 7,70           | ٦٩٪        |
| ٦                   | Single level      | ٣,٥            | ٥٨٪        |
| ٨                   | Single level      | ٤              | 0.%        |
| ٨                   | ٤/٢ or ٢/٤        | ٤,٦            | ٥٨٪        |
| ١٦                  | Single level      | ٤,٩            | ۳١٪        |
| ١٦                  | ۸/۲               | ٦,٣            | ٤ • ٪.     |
| ١٦                  | ۲/۸               | ٦,٣            | ٤ • ٪.     |
| ١٦                  | ٤/٢/٢             | ٧,٦            | ٤ ٨٪       |
| ١٦                  | ٤/٤               | ٧,١            | ٤٤%        |
| ٦٤                  | $\Lambda/\Lambda$ | ١٤.٤           | ۲۳٪        |
| ٦ ٤                 | ٤/٤/٤             | 17.9           | ۲٦%        |

| Table ( | ٤).  | Rational s  | svstem s         | peeds for | parallel | processing | (shared  | memorv). |
|---------|------|-------------|------------------|-----------|----------|------------|----------|----------|
|         | · /• | itational c | <i>y</i> otonn o |           | paranor  | proceeding | 10110100 |          |

## •. 1. The Architecture of P<sup>1</sup> Processor

The architecture of the Intel P<sup> $\xi$ </sup> processor [<sup> $\Lambda$ </sup>] is very complex as in Fig ( $\xi$ ) but the main important advantages are:

- ). It has  $\xi$  ALU's,  $\xi$  ways allocation registers and Net Burst unit to work as  $\xi$  parallel processors.
- <sup> $\gamma$ </sup>. The four ALU's are not symmetric but each one has special properties as indicated in Fig ( $\xi$ ).
- $\tilde{}$ . The allocation and the  $\tilde{}$  entry station unit garnet the parallel in the input side of the  $\epsilon$  ALU's.
- The Net Burst (impeded circuit in the control unit) garnet the parallel in the output side of the <sup>£</sup> ALU's.
- •. There are three auxiliary ALU's used for auxiliary missions such as memory load and store, these ALU's do not appear in the calculation of the efficiency.
- It has two levels of internally caches, the first L' has A-wayes and the second L' has Mayes. The size of these caches is different from type to another, so the sizes of their blocks are different in these types. But the default values for these factors are that used in table (Y) and in the example later.
- <sup>V</sup>. These four processors are connected as a shared memory with L<sup>1</sup>. The L<sup>1</sup> memory connected with L<sup> $\gamma$ </sup> (as virtual processor) the final form is like *two levels* ( $\frac{\epsilon}{1}$ ) shared memory systems parallel processing.

<sup>A</sup>. The processor will be connected with a single main memory in PC. The result will become like *three levels* ( $\frac{\epsilon}{1}$ ) *shared memory systems parallel processing.* 

## •,<sup>۲</sup>. The Memory Cycle of P<sup>€</sup> Processor

- The traditional memory cycle [<sup>9</sup>] is as in Fig (°) summarized in four main steps:
- Start with the data request from instructions fetch unit to L<sup>1</sup> the hit in this point will return the successful order to read/write order in step <sup>r</sup>, while the miss will freeze the processing operations and go to step <sup>r</sup>.





Fig (1): micro architecture of Intel P<sup>1</sup> processor.

- <sup> $\gamma$ </sup>. Cheek the write instruction algorithm to move the new data in L<sup> $\gamma$ </sup> to L<sup> $\gamma$ </sup> and later to main memory in the ideal memory time.
- <sup> $\Upsilon$ </sup>. Start the search in L<sup> $\Upsilon$ </sup> cache memory, if L<sup> $\Upsilon$ </sup> hit move the wanted data block from L<sup> $\Upsilon$ </sup> to L<sup> $\Upsilon$ </sup> and go to step <sup> $\Upsilon$ </sup>, while the miss sends processor to step <sup> $\xi$ </sup>.
- 5. Start the search in the main memory, if the main hit moves the data from the main to L<sup>Y</sup> to L<sup>Y</sup> and go to step <sup>Y</sup>, while the miss sends processor to step <sup>o</sup>.
- •. Defreeze the processor to start processing with the next instruction and send order to DMA unit to start the search in the hard disk of PC.
- <sup>7</sup>. When DMA find the wanted data start to move the data from the hard disk to the main memory and indicate the processor.
- <sup>V</sup>. Refreeze the processor and move the data from main to  $L^{\gamma}$  to  $L^{\gamma}$  and go to step  $\gamma$ .

Note that the write instruction algorithm to move the new data from  $L^{\uparrow}$  to  $L^{\uparrow}$  and to main memory will add a latency time because it works in the ideal memory time.

The miss in L<sup>1</sup> and L<sup>7</sup> causes a delay in the processor, because of the freeze in processing in this time. The freeze order is compulsory because the two caches are busy to move the wanted data to L<sup>1</sup>.

The miss in the main memory is latency because the defreeze order, but this time will become delay if the next instruction depends on the present instruction or if it requires data at the same block as we will discuss in section  $\vee$ .

## **5. Proposed Parallel Memory Technique**

The deep analysis for the P<sup> $\xi$ </sup> processor shows that, the weakness point in this system is the *crowding* of the virtual processors (ALU's and L<sup> $\gamma$ </sup>) on the L<sup> $\gamma$ </sup> cache. This problem is solved partially in P<sup> $\xi$ </sup> by using  $^{\Lambda}$  way cache L<sup> $\gamma$ </sup>.

The  $\land$  way cache L $\land$  [ $\land$ ] means the L $\land$  is divided into  $\land$  independent parts but this solution will reduce the problem by factor less than  $\land \land \checkmark$  because the stream of instructions is stored in the same block so their data are also stored in the same block generally.

At this time we need to *crumble* the single block to independent bytes form, this solution (no blocking) is very complex and impractical for large memories such as in  $P^{\xi}$ .

This paper proposes a simple solution with a small additional cost and a small additional complexity for the control circuit in the  $\mu P$  but eliminates the problem of crowding in L<sup>1</sup> in about 9.%.

The basic idea depends on the use of *multi independent internally channels*, each channel has *data bus and address bus* inside the memory as in Fig (7 a) [??] that shows the architecture for the control of the single cell in the present memory while Fig (7 b) shows the architecture for the control of the single cell in the proposed memory with two channels. The channel that uses

the memory doesn't care to the other channels. Hence, each channel is free to use the memory, so that each processor will become *free* to use the memory.

The collision is found when two channels need the same byte, this problem has very low probability so it will be solved using simple controller inside the memory giving a priority for channels. The controller will delay the lower priority in one cycle (very short time). The low probability and very short time delay means the effect of this case can be neglected in the calculations of the time processing.



### Fig (°). The memory cycle in two levels caches Intel processor.

Hence, this approach will be used in all the memories of the  $\mu$ P and PC. The effect of this idea will show in the processor example in the next section.

## **v**. Time Delay Calculations and Simulation

## V, V. Delay and Latency

If we have a queue of instructions under execution in a  $\mu$ P, so any frozen for an instruction in this queue because, generally; it requires data as the miss in L<sup>1</sup> case in memory

cycle. This freeze will be solved by stopping all the next instructions. This stopping time is called *Delay* and all the next instructions results will be delayed. Sometime this delay is large as in case of miss in the main memory, the solution to illuminate this delay is by putting aside this instruction and executing the next instructions until the required data is prepared which is necessary to run this instruction, this time delay is called *Latency*. The latency delays the result of



Fig (<sup>1</sup>). Memory cell architecture, (a) single channel (b) proposed two channels.

the present instruction while it passes the results of the next instructions. Some times one (or more) instruction from the next instructions queue is depends on the aside instruction or sometimes it share the same block with it. In theses two cases all the latency of these instruction will become a delay.

#### **V, Y- Processor Example**

The practical speed calculations of the processor with and without the parallel memory require practical values. These proposed values are the default values of  $P^{\xi}$  processor [ $^{, 1}$ ] in documents and statistics. They will be classified for simplicity into three groups as follows:

Group A: memory specifications

- ). The block size in L) (bi) is  $7 \xi$  B.
- <sup> $\gamma$ </sup>. The block size in L<sup> $\gamma$ </sup> (Bj) is <sup> $\gamma$ </sup> kB.
- ". The page size in main memory (Pk) is  $\forall \xi$  kB.

Group B: processor times

- ). t is the clock time = the time of executing one instruction in fast ALU in the processor and let t=).
- <sup> $\gamma$ </sup>. The time to read or write any data from L<sup> $\gamma$ </sup> is <sup> $\gamma$ </sup> (<sup> $\gamma$ </sup>t).
- $^{\circ}$ . The time to read or write any data from L<sup> $\gamma$ </sup> is  $^{\circ}$  ( $^{\circ}$ t).
- $\xi$ . The time to read or write any data from main memory is  $\gamma \cdot (\gamma \cdot t)$ .
- °. The time to transfer one block from (to) L<sup> $\gamma$ </sup> to (from) L<sup> $\gamma$ </sup> is Tb=<sup> $\gamma$ </sup> · (<sup> $\gamma$ </sup> · t).

- 7. The time to transfer one block from (to) L<sup> $\gamma$ </sup> to (from) main memory is TB=1... (1...t).
- V. The time to transfer one page from (to) hard disk to (from) main memory is Tp=10...1

**Group C**: probabilities of the work

- 1. The number of run instruction in the CPU is 1.....
- See table (°) for the used probabilities. These values are the most used in P<sup>£</sup> processor.

| Symbo | Value | No. of        | Definition                                                 | Note                      |
|-------|-------|---------------|------------------------------------------------------------|---------------------------|
| 1     |       | instructions  |                                                            |                           |
| Pm    | ۰,۲   | 7             | Processor need to external data                            | Pm + Pp =                 |
| Рр    | ۰,۸   | A             | Processor not need to data                                 | $r_{III} + r_{P} = r_{P}$ |
| Ph    | ۰,۹   | ۱۸۰,۰۰۰       | Hit in L                                                   | Ph                        |
| Pmh   | ۰,۰۹  | 1             | Miss in L <sup>1</sup> Hit in L <sup>7</sup>               | + Pmh                     |
| Pmmh  | ۰,۰۰۹ | ۱،۸۰۰         | Miss in L <sup>1</sup> Miss in L <sup>7</sup> Hit in Main  | + Pmmh                    |
| Pmmm  | ۰,۰۰۱ | ۲             | Miss in L <sup>1</sup> Miss in L <sup>7</sup> Miss in Main | + Pmmm =                  |
| Pr    | ۰,٥   | 1 • • • • • • | Read data                                                  | Pr + Pw = 1               |
| Pw    | ۰,٥   | 1 • • • • • • | Write data                                                 | PT + PW = 1               |
| Pd    | ۰,۰۰۱ | 7             | Dependency*                                                |                           |
| Pwo   | ۰,۰۲  |               | Over flow in L                                             |                           |
| Pwo۲  | ۰,۰۰۱ |               | Over flow in L <sup>Y</sup>                                |                           |
| Prw   | [••)] |               | Run in the waiting time                                    | Practical value           |

Table (°). Definition of the probabilities set and their default values in the example.

\* Dependency probability means the run of the next instruction depends on the results of the present instruction or the data of the two instructions are inserted in the same block (bi).

The application of the values in table (°) over the memory cycle in Fig (°) will result to calculate the average required time to run a single instruction in P<sup> $\epsilon$ </sup> processor Tav. These calculations of the average time bellow will shows that the approximated time without parallelmemory approach required is about <sup>7</sup>.<sup>AV</sup> cycles while the required time using parallel-memory approach is about  $\cdot.^{7^{\circ}}$  cycles. In other words the parallel-memory approach will increase the speed of the processor in more than  $\epsilon$  times.

**a.** The calculations of the average required time in the example without parallel-memory approach.

$$Tav = Pp \times 0.25 \times (1 - \Pr w) + Pm \times \{[Ph \times 2 + Pmh \times (2 + Tb) + Pmmh \times (2 + Tb) + TB) + Pmmm \times (2 + Pmmm \times Tp)] + Pw \times [Pwo1 \times Tb + Pwo2 \times TB]\}$$

$$Tav = 0.8 \times 0.25 \times (1 - \Pr w) + 0.2 \times \{ [0.9 \times 2 + 0.09 \times 22 + 0.009 \times 1022 + 0.001 \times (2 + 0.001 \times 150,000) \} + 0.5 \times [0.02 \times 20 + 0.001 \times 1000] \}$$
  
= 0.2 \times (1 - \Pr w) + 2.626 + 0.14  
$$Tav = 2.866t$$
  
= 0.2 \times (1 - 0.5) + 2.626 + 0.14

**b.** The calculations of the average required time in the example using parallel-memory approach.

 $Tav = Pp \times 0.25 \times (1 - \Pr w) + Pm \times \{Ph \times 2 + Pmh \times (2 + Pmh \times Tb) + Pmmh \times (2 + Pmh \times Tb + Pmmh \times TB) + Pmmm \times (2 + Pmmm \times Tp)\}$ 

$$Tav = 0.8 \times 0.25 \times (1 - \Pr w) + 0.2 \times \{0.9 \times 2 + 0.09 \times (2 + 0.09 \times 20) + 0.009 \times (2 + 0.09 \times 20 + 0.009 \times 1000) + 0.001 \times (2 + 0.001 \times 150,000)\}$$
  
= 0.2 \times (1 - \Pr w) + 0.12184 = 0.2 \times (1 - 0.1) + 0.449764  
$$Tav = 0.629764t$$

### ۷, ۳. The Results of the Simulation

This sub-section will summarize the results of a simple simulation for the memory cycle of the Intel processors. This simulation will use the default values in the previous sub-section and the information of processors in table ( $\mathcal{V}$ ). The simulation will run over a proposal case (multichannels parallel memory) to show the effects of the parallel-memory approach. Hence, the number of channels will be indicated in two arcs {} in table ( $\mathcal{V}$ ).

The times in table (7) measured for a universal clock for all processors that write in CPO (clock per operation) while the inverse term OPC (operation per clock) will be use to measure the speeds, the results in the last column is the percentage ratio for the ideal case.

The first row in table ( $^$ ) shows the properties of the present P<sup> $\epsilon$ </sup> processor that describe as  ${}^{\epsilon}{^{1}}/{^{1}}$  that's mean it has  ${}^{\epsilon}$  ALUs connected to L<sup>1</sup> which is connected to L<sup>7</sup> and with L<sup>7</sup> connected to main memory. Hence, all these memories (L<sup>1</sup>, L<sup>7</sup> and main) have a single internally channel.

The row  $(row \, \xi)$  satisfies the ideal case for the processor that has an ideal time  $\cdot, \uparrow \circ$   $(\cdot, \cdot \xi \uparrow)$  for the P $\xi$  (Gulftown) processor but the ALUs are not symmetrical and not all of them are general but each one is specialized for a type of instructions as in Fig ( $\xi$ ) and these instructions cannot be balance for all applications, therefore some ALUs in processor will be idle for some time. The idle time in the case of no miss means this case is not ideal case.

### A. Design of Super Future Processor

The design of a super future processor requires a deep study of the results in table (7), and then tries some designs under simulation to select the most practical one.

row  $\wedge$  in the table shows the form of the imminent future processor that has  $\forall \xi$  ALUs and speed  $\forall \forall$  time over P $\xi$  (for the same clock) ideally but the simulation shows its practical speed is less than  $\forall$  times over the practical speed of P $\xi$ , with very low efficiency which is about  $\forall \%$ . The enhancement of the imminent future processor using parallel-memory approach (row  $\vartheta$ ) will jump up to  $\forall \xi \%$  efficiency and the speed to  $\flat \circ$  OPC.

The try and error for simulation shows the maximum acceptable number for ALUs in processor is  $10^{1}$ . The row  $1^{1}$  shows a processor  $0^{1}$  times faster than P<sup> $\xi$ </sup> while the row  $1^{1}$  shows a processor  $1^{1}$  times faster than P<sup> $\xi$ </sup>. The two proposed super processors have  $10^{1}$  ALUs but the second design has lesser memory layers with more complexity control circuits.

| No.  | Type of µP                                                        | Parallel form                     | Time range   | Tav     | Speed | Efficiency  |
|------|-------------------------------------------------------------------|-----------------------------------|--------------|---------|-------|-------------|
| 1.00 | 1 ypc or µr                                                       |                                   | CPO          | CPO     | OPC   | Linciency   |
| ١    | P <sup>ε</sup> without miss.                                      | £{`}/`{`}/`{`}                    | •, ٣٢*       | ۰,۳۲    | 8,170 | ٧٩٪         |
| ۲    | P <sup>2</sup> with single channel                                | £{`}/`{`}/`{`}                    | ۰,۷٦ _ ۱,۸   | ١,١١    | ۰,۹   | ۲۳ <u>٪</u> |
| ٣    | P <sup>2</sup> with two<br>channels                               | £{Y}/\{Y}/\{Y}                    | ۰,٣٤ - ۰,٧٩  | ۰,۳۹    | ۲,٦   | ٦٥٪         |
| ٤    | Gulftown without miss.                                            | ٤{١}/١{١}/٦{١}/١{١}               | •,•٩١*       | ۰,۰۹۱   | 11    | ٤٦٪         |
| ٥    | Gulftown with single channel                                      | ٤{`}/`{`}/٦{`}/`{`}               | •,19_•,71    | ۰,۳۷    | ۲,۷   | 11%         |
| ٦    | Gulftown with two channels                                        | ٤ { ۲ } / ۱ { ۲ } / ۲ { ۲ } / ۲ } | •,17 - •,77  | ۰,۱۹    | ۰.۳   | YY <u>X</u> |
| v    | Gulftown with<br>two channels and<br>t channels in L <sup>r</sup> | £{Y}/\{Y}/\{ <b>\$</b> }/\{Y}     | •,•97_•,٢٣   | •,170   | ٦     | ٢٥٪         |
| ٨    | Future processor                                                  | $^{1}/^{1}/^{1}/^{1}/^{1}$        | •,10_•,77    | ۰,۲۱    | ٤,٨   | ٧%.         |
| ٩    | Proposed<br>processor \                                           | ^{ £ }/\{Y }/^{T }/\{Y }          | •,•٣٢ - •,١٢ | ٠,٠٦٦   | 10    | ٢ ٤ ٪       |
| ١.   | Proposed<br>processor <sup>۲</sup>                                | )]{)]/2{^}/2{^}/2{^}/1{Y<br>}     | •,•\ź_•,•0   | • , • ٢ | ٥.    | 19%         |
| ))   | Proposed<br>processor ۳                                           | ٣٢ {٦٤}/^{^}/١ {٢}                | •,•))_•,•٦   | •,•1٨   | ०٦    | 11%         |

Table (<sup>1</sup>): The results of a simple simulation for the memory cycle of Intel processors.

The large number of processor (<sup>Yol</sup> ALUs) in PPPM approach cannot be successful without the good additional units. This process requires at least to:

- 1. An enhanced prediction unit that predicates more than one instruction as example set prediction unit.
- <sup>7</sup>. Extra flexible allocate unit.
- ۳. Enhanced Net Burst circuit.
- <sup>£</sup>. Suitable classification and selection for ALUs with suitable distribution.

# **9.** Conclusions

Having done this work we can conclude some points that help the designer to indicate the optimal way to improve the microprocessor design in the future.

- <sup>1</sup>. Increase the number of parallel processing will be useful for limited number of processors; maximum number is <sup>A</sup> processors for single level shared memory systems parallel processing.
- Increase the number of levels also be useful for a limited number of levels; maximum number is <sup>ε</sup>.
- $\tilde{}$  The crowded ALUs over L<sup>1</sup> in the  $\mu$ P is the main problem in the design of present processors.
- ε- The PPPM approach is a very useful technique that will increase the speed of processor to about ε times.
- o- The design of the super processors in the future will not be satisfied by using more ALUs and more cores but it requires radical new ideas.
- <sup>7-</sup> The success of the PPPM approach required to enhance and expand the other units in the  $\mu$ P such as prediction unit, allocate unit and Net Burst circuit.

## **v. Future Work**

The key implementation of the super fast processor that covers the future requirements and applications can be summarized in three main layers:

- 1. Use the PPPM proposed approach in this paper with enhancement and expand the other units in the processors such as the prediction unit, the allocation unit and the Net Burst unit.
- Y. Use the assistant special processors; this idea is like the use of the math co. processor with the A.YAT in some PCs. The proposed assistant processors are coding/decoding processor, "D image processor and complex function processor."
- <sup>r</sup>. Discuss and deduce new ideas for the data processing, which are suitable with burst data processing. These ideas must be left the classic idea of the ALU that depends on the using a single point (ALU) to execute a complete single instruction and use algorithms able to execute a large number of instructions partially at the same time. Hint, the burst data processing approach may burn a complex and slow processors but it is able to grow faster than the classical ideas.

## References

- [1] List of Intel microprocessors, Wikipedia, the free encyclopedia, "http://en.wikipedia.org/wiki/List\_of\_Intel\_microprocessors".
- [<sup>Y</sup>] Intel Consumer Desktop Pc Microprocessor History Timeline, http://www.intel.com/intel/intelis/museum/exhibit/.
- [<sup>۳</sup>] Intel® ٦٤ and IA-<sup>۳</sup><sup>۲</sup> Architectures Optimization Reference Manual, Order Number: ۲٤/۹٦٦-۲۰, November ۲۰۰۹.
- [٤] Blaise Barney," Introduction to Parallel Computing", Lawrence Livermore National Laboratoy, "http://www/Introduction to Parallel Computing.htm".
- [°] Trupti Patil, "Evaluation Of Multi-Core Architectures For Image Processing Algorithms", Thesis Presented to the Graduate School of Clemson University August ۲۰۰۹.

- [<sup>7</sup>] B. L. Buzbee, "The Efficiency of Parallel Processing", Frontiers of Supercomputing, Los Alamos Science Fall 1947, pp <sup>V</sup>1.
- [<sup>V</sup>] Quentin Smith, "Efficiency of Parallel Processing with JCilk", under the direction of Professor Charles E. Leiserson Massachusetts Institute of Technology Research Science Institute, August <sup>v</sup>, <sup>v</sup>··<sup>o</sup>.
- [^] Glenn Hinton, Dave Sager, Mike Upton, Darrell Boggs, Doug Carmean, Alan Kyker, Patrice Roussel, "The Microarchitecture of the Pentium ' Processor ", Intel Technology Journal Q', '...', pp '.
- [<sup>4</sup>] Mostafa Abd-El-Barr, Hesham El-Rewini, "Fundamentals Of Computer organization And Architecture", Copyright # <sup>Y</sup>···<sup>o</sup> by John Wiley & Sons, Inc. All rights reserved.
- [1] Samuel Naffziger, Blaine Stackhouse, Tom Grutkowski, Doug Josephson, Jayen Desai, Elad Alon, and Mark Horowitz," The Implementation of a <sup>7</sup>-Core, Multi-Threaded Itanium Family Processor "IEEE journal of solid-state circuits, vol. <sup>٤</sup>), no. <sup>1</sup>, January <sup>7</sup>··<sup>7</sup>, pp <sup>1</sup><sup>9</sup>.
- [1] Trent Rolf, "Cache Organization and Memory Management of the Intel Nehalem Computer Architecture", University of Utah Computer Engineering, CS 141. Final Project, December 7...9.