Intel flirts with exascale leap in supercomputing

Jun 19, 2012 by Nancy Owano report
Xeon Phi

(Phys.org) -- If exascale range is the next destination post in high-performance computing then Intel has a safe ticket to ride. Intel says its new Xeon Phi line of chips is an early stepping stone toward exascale. Intel on Monday announced its high performance chip family as Xeon Phi at the International Supercomputing conference in Hamburg, Germany. Xeon Phi is now the brand name for future Intel’s Many Integrated Core (MIC) architecture-based products. The announcement was obviously made at the right venue. The ISC is a key gathering for high-performance computing, networking and storage experts.

Intel chips are used in the majority of the 500 fastest supercomputers. The processor is found in 70 percent of these top 500.

The Xeon Phi is intended as a Xeon complement, optimized for highly parallel supercomputing. Xeon Phi works as a co-processor alongside a server CPU to accelerate workloads. The coprocessor will be compatible with x86 programming models, running as an HPC-optimized, highly parallel, separate compute node with its own Linux-based operating system, independent of the operating system of the host. The Phi technology is to enable both high performance and energy efficiency when processing highly parallel applications.

The first Xeon Phi chip, codenamed Knights Corner, will be out at the end of this year. Knights Corner will have more than 50 cores and will deliver four or five gigaflops per watt. It will use 22 nanometer silicon and ’s 3-D TriGate Transistors. The architecture is ready for cluster-based experimental deployments.

To reach exascale requires 40 to 50 gigaflops of performance-per-watt. The first Phi chip would deliver about four to five gigaflops per watt, according to Intel’s John Hengeveld, director of marketing for high-performance computing.

Intel is working with other technology companies to support the new family. Intel has commitments from a number of computer partners, say reports, to make use of the Xeon Phi in their roadmaps. In its press announcement, Intel noted that its acquisition of Infiniband and interconnect assets from QLogic and Cray further present areas for Intel to innovate in delivering future scalable exascale-class platforms.

Supercomputers can harness the parallel-processing capabilities of graphics processors and chips like Phi to carry out complex calculations in scientific and math research. The task of building specialized chips that can execute more calculations per second while keeping power consumption in check, though, is not trivial. Intel is targeting no sooner than 2018 as the year it reaches “exascale performance.”

Explore further: Technology turns eyewear into a smart device capable of displaying visual information

More information:
www.intel.com/content/www/us/e… ver-reliability.html
www.intel.com/newsroom/kits/is… 012-Presentation.pdf

Related Stories

Intel, Sun Hint at Future Plans

Apr 18, 2007

After announcing their new partnership three months ago, the two companies were in China together this week at Intel's Developer Forum.

New Intel Server Processors: Fewer Watts, High Performance

Mar 25, 2008

Intel Corporation has further increased its energy-efficient performance lead today with the introduction of two low-voltage 45 nanometer processors for servers and workstations that run at 50 watts, or just 12.5 watts per ...

Recommended for you

Impoverished North Korea falls back on cyber weapons

8 hours ago

As one of the world's most impoverished powers, North Korea would struggle to match America's military or economic might, but appears to have settled on a relatively cheap method to torment its foe.

Five ways to make your email safer in case of a hack attack

9 hours ago

The Sony hack, the latest in a wave of company security breaches, exposed months of employee emails. Other hacks have given attackers access to sensitive information about a company and its customers, such as credit-card ...

2012 movie massacre hung over 'Interview' decision

9 hours ago

When a group claiming credit for the hacking of Sony Pictures Entertainment threated violence against theaters showing "The Interview" earlier this week, the fate of the movie's big-screen life was all but ...

User comments : 24

Adjust slider to filter visible comments by rank

Display comments: newest first

Sonhouse
not rated yet Jun 19, 2012
Odd way to publish performance. Green and all that but it doesn't tell you exactly what the level of performance is, how many watts does it consume? 10 watts? 100? Makes a bit of a difference. Can you actually modulate the power usage in this device, say if it runs 10 watts minimum and 100 watts maximum, can it actually do that? So 10 watts= 50 gigaflops, 100 watts = 500 gigaflops?

So at that level, it sounds like this iteration will max out at about 0.1 exaflop, 100 petaflops. That is still a lot faster than anything else out there.
TkClick
not rated yet Jun 19, 2012
I don't quite understand the role of these processors under the situation, when common graphical chips have hundreds of cores and they're used for scientific calculations already. Why fully featured cores are needed for apparently paralleled applications?
Burnerjack
4.5 / 5 (2) Jun 19, 2012
While it IS innovation, I can't help thinking this is "Stealth Advertising" for Intel.
eachus
not rated yet Jun 19, 2012
I don't quite understand the role of these processors under the situation, when common graphical chips have hundreds of cores and they're used for scientific calculations already. Why fully featured cores are needed for apparently paralleled applications?


They are not. But it is worth understanding what we are looking at here. Knights Corner replaces the GPU with 50 x86 cores. (I'd have to check, but I think it doesn't support x87 instructions, so you need to use SSE instructions for floating-point.) The other thing it doesn't do is manage a lot of memory.

To program Knights Corner, you need to use the (standard) CPU to keep the KC chip fed with data, and save final results. For many problems solved on supercomputers, this type of architecture works just fine, and most CPU/GPU combos in the supercomputer world use it. KC can access 8 Gigs of GDDR5, but moving data into or out of that memory is done by the host CPU.
eachus
not rated yet Jun 19, 2012
AMD is pushing its Trinity APU which combines CPU cores and a GPU on a single chip, for this type of system. The advantage is that the GPU can directly address main memory. The disadvantage is that you have to be careful of memory contention. The CPU and GPU share L2 caches, with no L3. Reduces memory latency a lot, but the L2 caches are much smaller (and faster) than GDDR5 memory.

Can you add GDDR5 to an AMD APU? Not currently. You can add a discrete GPU card (or cards) but the GDDR5 memory on the card requires extra data moves.
vlaaing peerd
not rated yet Jun 19, 2012
the performance is worse than a Dutch team on the euro finals. They reached 118Tflop with 9800 Xeon E5-2670 cores and these cards. Strangely enough one should already be able to reach 120Tflop with 8800 E5-2670 cores and NO GPU cards ...must be missing the clue somewhere here.

One good thing is that the cards support x86 code, something you could still have trouble with on the Nvidia cards. Which btw still pack more punch than this GPU card.

So we might have exascale machines in 2018, but definitely this card isn't going to be part of it.
Sonhouse
not rated yet Jun 19, 2012
the performance is worse than a Dutch team on the euro finals. They reached 118Tflop with 9800 Xeon E5-2670 cores and these cards. Strangely enough one should already be able to reach 120Tflop with 8800 E5-2670 cores and NO GPU cards ...must be missing the clue somewhere here.

But that is still quite a few magnitudes out from Exa flop performance, 0.2 Petaflops, times 2000=2 exaflops. Little bit of work to go there:)

One good thing is that the cards support x86 code, something you could still have trouble with on the Nvidia cards. Which btw still pack more punch than this GPU card.

So we might have exascale machines in 2018, but definitely this card isn't going to be part of it.

Deathclock
1 / 5 (1) Jun 19, 2012
"0.2 Petaflops, times 2000=2 exaflops"

ur gud wit teh maths!

0.2 Petaflops * 10,000 = 2 Exaflops
0.2 Petaflops * 5,000 = 1 Exaflop.

That's okay, you were only off by a factor of 5
hyongx
not rated yet Jun 19, 2012
While it IS innovation, I can't help thinking this is "Stealth Advertising" for Intel.

hahaha--- does intel NEED to advertise? I don't think so.
Vendicar_Decarian
5 / 5 (1) Jun 19, 2012
Because most problems are not readily vectorizable.

"Why fully featured cores are needed for apparently paralleled applications?" - TkClick

Typically heterogenius computers run at only a fraction of their "peak" performance.

Until recently the graphics core derived vector processors have been for example, incapable of conditional code execution, and code branching had to be done by a conventional CPU.

Further, some code can simply not be vectorized - it an be parallelized - but not vectorized. If there are a lot of contitional execution paths through a block of code - and with my code there is typically one branch for every 5 or 10 opcodes, then it can not even be run on GPU cores.

Nvidia's latest GPU cores now have some branching capacity although I don't know how this works at the moment.

Intel is betting that MMX (and other) vectorization per CPU core and a homogeneous instruction set will serve the needs of massively networked computers than the existing methods.

cont.
Vendicar_Decarian
not rated yet Jun 19, 2012
I agree, but the x86 core is not the best CPU to use. In fact it is among the worst.
Vendicar_Decarian
not rated yet Jun 19, 2012
http://marketings...ok-like/

"does intel NEED to advertise? I don't think so." - HvonQx

Intel spent $50 million in advertising in 2010.

But you know better... Right?

Vendicar_Decarian
4 / 5 (1) Jun 19, 2012
Doing what? Anything useful or were they just tweaking compiler output for synthetic benchmarks?

"the performance is worse than a Dutch team on the euro finals. They reached 118Tflop with 9800 Xeon E5-2670 cores and these cards." - Vlaainq....

"They reached" = They spent their time playing around.
Vendicar_Decarian
not rated yet Jun 19, 2012
What do you think PR-Newswire is about?

"While it IS innovation, I can't help thinking this is "Stealth Advertising" for Intel." - Burnerjack
Noumenon
2 / 5 (4) Jun 19, 2012
and with my code there is typically one branch for every 5 or 10 opcodes, then it can not even be run on GPU cores.


Spaghetti code no doubt. :)
QQBoss
5 / 5 (1) Jun 20, 2012
I agree, but the x86 core is not the best CPU to use. In fact it is among the worst.


While the x86 instruction set and the effort it takes to convert decades of cruft into high performance computing does suck, MMX is not the x86 instruction set and it doesn't suck (I was a PowerPC architect, I do have a clue). Knights Bridge doesn't exist to make it easy to run 50 versions of Microsoft Word simultaneously, it exists to make it trivial for programmers to take vectorized code written for MMX and distribute it out across 50 cores if the problem has adequate inherent parallelism. Many do.

If you know anything about the Cell processor from IBM (and used in the PS3), it is a similar approach on a bigger scale and partitioned differently due to the constraints of the implementation. Exactly why and how would take far more than 1000 characters to explain.

Roland
5 / 5 (1) Jun 23, 2012
Intel's big advantage here is not hardware performance, it's open-source drivers:
http://www.bright...res.aspx
Vendicar_Decarian
not rated yet Jun 23, 2012
"Spaghetti code no doubt." - NumenTard

Code is generally that. A dense set of conditionals.

Here is a very old example.

INCLUDE C:\MASM\MYDEFS.INC

COMTOP CSEG

QSORT: PUSH DS ; ES = DS
POP ES
MOVFW AX,ARRAYSIZE ; AX = ARRAY SIZE
MOVFW SI,PMEM ; SI = DX = PTR TO ARRAY
MOV DX,SI
ADD DX,AX ; DX = PTR TO END OF ARRAY 1
ADD DX,AX
SUB DX,2 ; DX = PTR TO END OF ARRAY

CALL RECURSE
RET

ARRAYSIZE DW 00
PMEM DW 00

Cont...
Vendicar_Decarian
not rated yet Jun 23, 2012
;----------------------------------------------------------------------
; QUICK SORT - BREAK LIST INTO TWO PARTS, SUCH THAT
; ALL ELEMENTS IN PART 1 > ALL ELEMENTS IN PART 2.
; VALUE USED TO DETERMINE SPLIT POINT = FIRST ELEMENT IN
; IN LIST.
; RECURCIVLY PROCESS BOTH RESULTING TABLES.
;
; CALL WITH SI = PTR TO FIRST ELEMENT IN TABLE
; DX = PTR TO LAST ELEMENT IN TABLE
;
; RETURN WITH SI NEW SPLIT POINT
; DI = DX
; TABLE 0-SI <= VALUE ?
; TABLE SI 1-DX > VALUE ?
;

Cont...

Vendicar_Decarian
not rated yet Jun 23, 2012
RECURSE: CMP SI,DX ; QUIT IF 1 ELEMENT OR LESS IN LIST
JAE ENDQ
LEA DI,[SI 2] ; DI = CHECK ELEMENT = INIT TO ELEMENT 1

MOV AX,[SI] ; AX = FIRST VAL IN ARRAY (SPLIT VALUE)
RQLOOP1: CMP [DI],AX ; GO IF CHECK ELEMENT >= SPLIT VALUE
JAE RQL1N1
; CHECK ELEMENT < SPLIT VALUE SO...
ADD SI,2 ; BUMP SPLIT POINT

MOV CX,[SI] ; XCHG [DI],[SI]
XCHG CX,[DI]
MOV [SI],CX

RQL1N1: ADD DI,2 ; BUMP LAST ELEMENT PTR
CMP DI,DX ; LOOP IF NOT AT END OF LIST
JBE RQLOOP1

MOV AX,[SI] ; XCHG FIRST AND LAST ELEMENTS IN LIST 0
XCHG AX,[TOP]
MOV [SI],AX

Cont...
Vendicar_Decarian
not rated yet Jun 23, 2012
MPUSH ; PRESERVE PTRS
PUSH DX ; PRESERVE OLD END PTR
MOV DX,BX ; DX = NEW END PTR
CALL RECURSE ; GO SPLIT AGAIN
POP SI ; RESTORE OLD END PTR
ADD SI,2 ; MOVE TO FIRST ELEMENT
CALL RECURSE ; GO SPLIT AGAIN
MPOP ; RESTORE PTR AND END
ENDQ: RET

COMBOT CSEG
Vendicar_Decarian
not rated yet Jun 23, 2012
A small snippit of code from a terminal emulator.

talk2: call pc_stat ; keyboard character waiting?
jz talk4 ; nothing waiting, jump
call pc_in ; read keyboard character
cmp al,0 ; is it a function key?
jne talk3 ; not function key, jump
call pc_in ; function key, get 2nd part
cmp al,AltX ; was it Alt-X?
je talk5 ; yes, terminate program
push ax ; no, send to comm port
xor al,al ; send lead byte (0)
call com_out
pop ax ; send function key code
call com_out
jmp talk2 ; get another key
Vendicar_Decarian
not rated yet Jun 23, 2012
MMX sucks big time.

MMX is to parallelism what integers are to floating point.

"MMX is not the x86 instruction set and it doesn't suck" = QQBoss

MMX was a way for Intel to put it's poorly designed x87 floating point unit to use as an integer extension of the main CPU. And in addition providing a method of solving the x87 design flaw of using stack addressed registers rather than absolute addressing.

Proper vectorization requires vectors of arbitrary length and type rather than instructions that operate on 2 or 4 chucks of a vector at a time, which is certainly an improvement, and certainly meshes with the cache line width of the x86, but is a poor design for an instruction set.

Vendicar_Decarian
not rated yet Jun 23, 2012
We all know that it was a marketplace failure.

x86 and the X-Box put an end to that chip.

"If you know anything about the Cell processor from IBM..." = QQBoss

And Apple's abandonment of the PowerPC archetecture for Intel effectively put an end to the Power PC as well.

By the way, why didn't you include a quad precision float load/store instruction?

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.