With the development of the semiconductor industry,
Because CPU and GPU are created to handle quite different applications at the time of their birth, so there are a lot of differences in their architectures. For instance, CPU hides the memory latency by introducing cache, while GPU attains that goal through switching threads, so the cache mechanism of CPU is much more complicated than GPU; Also, CPU thread scheduling is implemented by OS, while GPU thread scheduling relies largely on hardware; The number of CPU's concurrent threads is far less than GPU; CPU cache coherence is based on the hardware, while GPU cache consistency among different shader clusters is completed by software; There is a substantial number of units existed in GPU for graphic processing such as read-only texture cache. Even though Intel draws a lot of GPU design style, its Teraflops implementation, Larrabee, has several typical characteristics of CPU. Similarly, Nvidia's GPGPU products, GeForce GT280, its primary mission is still to run 3D games better, so its general purpose computing is still based on the nature of GPU.
Being different from X86’s apparent dominance in single core era, at the time of entering the multicore era, it is still too early to predict whether Many Core CPU or GPGPU will become the mainstream of the future multi-threaded computing. Comparing to the hardware engineers who are only required to upgrade high-performance processors under the framework established by their employer, software engineers' tasks are more difficult, because they need to face the challenges of cross-platform. If every multi-threaded application has to completely reprogram when ported from Nvidia platform to Intel platform, then the software engineers will have no chance to see the tomorrow's sun. Clearly, in the history of GPU development, it also had the similar problem, and later the debut of DX solved everything. Programmers only need to call the fixed APIs to perform required functions, without directly addressing a wide range of GPU hardware itself. Similarly, multi-threaded computing today is experiencing the same things, such as DX11's Computing Shader and OpenCL. Provided with a middle layer, the programmers are only need to be familiar with the APIs, and don't have to deal directly with processors of different architectures. In order to adapt to such changes, the compiler is also divided into two stages. First of all, the compiler will compile the source code into a unified intermediate language, and then compiled dynamically at different platforms by the main CPU to binary code which fits into multi-threaded processors; Or directly compiled into suitable binary code at the application installation process, and at the time of program execution, it will be directly downloaded to the multi-threaded processor's local memory (such as GPU memory) to run.
Thus, at multi-threading era, the compiler can be divided into two categories. One type compiles the source code into the compiler intermediate language. The optimization technology of this type has nothing to do with architecture. It uses the general automatic multi-threading technology and single-thread optimization techniques. The other is the compiler between the intermediate language and binary code, this need to consider the characteristics of its hardware architecture and make targeted optimization. Comparatively speaking, it is relatively easy to implement the former by using existing multi-threading tools, while the latter are relatively scarce from theory to practice. And it still need constantly evolving with the hardware development, this is what I am interested in the direction of the research.
In my opinion, in the process of compiling intermediate language to binary code, the following aspects of computer architecture will become important factors that affect performance:
- Instruction-Level parallelism (ILP). Such as Intel Larrabee architecture which uses SIMD instructions of 16 operands. In order to achieve effective execution, new and existing ILP mining methods must be introduced and developed.
- Core to core communications and communications between the core and its different memory hierarchies. No matter what architecture it is, NV or Intel, multi-threaded processor's core to core bandwidth is far below the throughput capacity of each core, and because of the bit-width and frequency limitation of local memory, the bandwidth that a core can provide is limited. In order to effectively enhance the efficiency of multi-threaded processor, we need to make good use and management of the core's subsidiary cache, reduce inter-core communications and reduce local memory access.
- Thread instruction stream and data structure rearranging. There is a large difference of the number of Light-Weight Threads (LWT) which could be executed by different architectures, such as Intel architecture can deal with the light-weight threads on the quantity that is far less than the NV architecture, so the number of thread must be set appropriately, re-structuring the operation tasks and corresponding data sets of these threads, it is also necessary to optimize the way of thread dispatch.
- Multiple multi-threaded processor optimizations. Currently, it is very common to see multiple-CPU systems and multiple-GPU systems, and multiple multi-threaded processor system will also become a popular configuration. So in that case, the load balancing and communication optimization among the multi-threaded processors are also topics worthy studying on.
