New software design technique allows programs to run faster

April 5, 2010

( -- Researchers at North Carolina State University have developed a new approach to software development that will allow common computer programs to run up to 20 percent faster and possibly incorporate new security measures.

The researchers have found a way to run different parts of some programs - including, for the first time, such widely used programs as word processors and Web browsers - at the same time, which makes the programs operate more efficiently.

In order to understand how they did it, you have to know a little bit about computers. The brain of a computer chip is its , or "core." has advanced to the point where it is now common to have between four and eight cores on each chip. But for a program to utilize these cores, it has to be broken down into separate "threads" - so that each core can execute a different part of the program simultaneously. The process of breaking down a program into threads is called parallelization, and allows computers to run programs very quickly.

However, some programs are difficult to parallelize, including word processors and Web browsers. These programs operate much like a flow chart - with certain program elements dependent on the outcome of others. These programs can only utilize one core at a time, minimizing the benefit of multi-core chips.

But NC State researchers have developed a technique that allows hard-to-parallelize applications to run in parallel, by using nontraditional approaches to break programs into threads.

Every computer program consists of multiple steps. The program will perform a computation, then perform a memory-management function - which prepares memory storage to contain data or frees up which is currently in use. It repeats these steps over and over again, in a cycle. And, for difficult-to-parallelize programs, both of these steps have traditionally been performed in a single core.

"We've removed the memory-management step from the process, running it as a separate thread," says Dr. Yan Solihin, an associate professor of electrical and computer engineering at NC State, director of this research project, and co-author of a paper describing the research. Under this approach, the computation thread and memory-management thread are executing simultaneously, allowing the to operate more efficiently.

"By running the memory-management functions on a separate thread, these hard-to-parallelize programs can operate approximately 20 percent faster," Solihin says. "This also opens the door to development of new memory-management functions that could identify anomalies in program behavior, or perform additional security checks. Previously, these functions would have been unduly time-consuming, slowing down the speed of the overall program."

Using the new technique, when a memory-management function needs to be performed, "the computational thread notifies the memory-management thread - effectively telling it to allocate data storage and to notify the computational thread of where the storage space is located," says Devesh Tiwari, a Ph.D. student at NC State and lead author of the paper. "By the same token, when the computational thread no longer needs certain data, it informs the memory-management thread that the relevant storage space can be freed."

Explore further: Intel Unveils Developer Kits, Labs Program To Accelerate Future Memory

More information: The paper, "MMT: Exploiting Fine-Grained Parallelism in Dynamic Memory Management," will be presented April 21 at the IEEE International Parallel and Distributed Processing Symposium in Atlanta.

Related Stories

From shared to distributed memory systems for applications

September 23, 2005

Shared-memory computing applications have never taken particularly well to operating on distributed-memory systems, at least until now. A possible solution has emerged, of interest to NASA and IBM, and is being tested on ...

Recommended for you

The ethics of robot love

November 25, 2015

There was to have been a conference in Malaysia last week called Love and Sex with Robots but it was cancelled. Malaysian police branded it "illegal" and "ridiculous". "There is nothing scientific about sex with robots," ...


Adjust slider to filter visible comments by rank

Display comments: newest first

not rated yet Apr 05, 2010
I wonder how they get past the fact that the computational portion of the program has to have access to the memory, in general, so this is now being copied back and for between the two process and vise versa of course -- it seems like this makes the memory footprint a lot bigger and introduces the need to locking memory --

A link to the paper would answer all of my questions
not rated yet Apr 05, 2010
This can be implemented it at least two ways:
1. The compiler identifies the in line threads related to memory managemant and assigns the threads as "called" processes assignable to parallel processing. The run-time identifies the CPU as single or multi core and makes the appropriate CPU core assignments. Appropriate semaphores may be implemented.
2. The pipeline for the hardware CPU(s) identifies system calls to the O/S (calls into H/W protected memory) and diverts these to a second core (when present) as a separate thread. The H/W sets a semaphore instruction in the application thread to await return from the system call, but sends the O/S call immediately to the separate core.
In both cases, the application process can give up the CPU until the O/S thread(s) completes (assuming no other task pre-empts this task). If the task is not able to safely give up control, semaphores are used.
not rated yet Apr 05, 2010
@el nose: memory is never moved or shuffled between cores except when performing math computations which may use CPU registers for speed. Memory is referenced by each core.

@winthrom: you are still suggesting single-threading the request, which makes it no more efficient than a single core EXCEPT that the other core may not be in use, thereby getting some speed due to no wait state (thus only 20% not 2X [100%] faster). Garbage Collection (memory management) using seperate threads is not new.

Determining how to maximize CPU cores is.
not rated yet Apr 05, 2010
i have absolutely no confidence in this idea, other than its ability to create almost impossible to fix memory access violations and unexplainable data corruption.
not rated yet Apr 05, 2010
@DeadMike: 20% is all the authors expect. 2X[100%] is not in the article. Garbage collection is an O/S function not an application request to the O/S. This is more like an IOWAIT and IOGO situation where the additional CPU core(s) are assumed to be available.
1 / 5 (1) Apr 05, 2010
As written, the article here describes nothing more than doing garbage collecting and data allocating in a separate thread. Which is old and boring, and definitely NOT non-traditional today. Some languages (e.g. the D programming language) do this implicitly. Hence, "a FIRST" is a gross error here.

The part "Exploiting Fine-Grained Parallelism" of the paper heading may be the clue to what should have been written here.

Not having read the original paper, on would guess a modified compiler or (much cooler) an object code analyser that inserts hints for the MM thread well in advance when a new variable needs space or another goes out of scope.
These hints would then simply be asynchronous messages.

No doubt, someone will post what it's really about?
not rated yet Apr 05, 2010
I agree. What the article describes is nothing new at all. Memory management is done in a similar manor on the XBox 360. Microsoft .NET technology is similar. But, introducing a new thread to request memory only adds overhead. In both cases, the calling thread needs to wait for the request to return. Doesn't matter if the allocation is done on the same core or another. It takes the same amount of time. Forcing it on another CPU just adds extra thread management. I see no gains on allocation, unless it can be predicted and spun off before it's needed. Garbage collection is already done in a separate thread in many existing technologies. I'd like to see the paper. My guess is the author of this article probably isn't a programmer and the technique just got lost in translation.
not rated yet Apr 05, 2010
This is a cool idea for legacy systems where a 20% improvement is a useful gain. However, it's ultimately a work-around for a single threaded architecture. Competent multi-core engineering practices can easily push that 20% to 200% and more.
3 / 5 (1) Apr 05, 2010
Link to the source paper:
4.5 / 5 (2) Apr 05, 2010
Thanks to MysticFear for the link.

I read the paper, and had the feeling that I had been through most of this 15-20 years ago when we added protected objects to Ada. A thousand characters is a very short space to describe the difference in philosophy, but I'll try. ;-)

In Ada, unlike in C and C++, most objects are allocated on the stack, not the heap. So for those cases the overhead of malloc and free is irrelevant. (Allocating an object on the stack requires computing the object size, then incrementing the stack pointer. The memory all goes away on function return.)

But what if you need to create a list or tree of (usually small) objects? You can use C style allocation, or a take a tree data type from a library and instantiate it. You can even roll your own. The trick is to make the tree a protected object, not the leaves, and use a bulk allocation strategy as here. (Ada finalization means that instead of freeing leaves, you can just throw the whole tree away.)
not rated yet Apr 05, 2010
Completely overlooked is the total CPU and power consumption of this approach. It seems to indicate that a relatively idle core can be (over?) utilized to gain some speed in the main thread core by using the second core to constantly map memory.

I'm fairly certain that proprietary operating systems on multi cores are already smarter than this approach.

Nice read and good work to the researchers, but I find this difficult to believe as being ground-breaking or new as others have noted.
not rated yet Apr 06, 2010
I read the paper. It turns out they propose two main memory management techniques : 1) Speculation on Memory Management 2) Bulk Memory Management. These two memory management techniques have not been proposed before.
Predicting memory allocation requests correctly and performing them speculatively looks a very neat idea, esp w/o any support from programmer, compiler, hardware or underlying allocator. Moreover, they go in detail how to perform such speculation efficiently. To reduce the communication overhead from other head, they introduced bulk delayed deallocation requests, which again looks neat-- as it reduces frequent communication. They also show to design various offloading mechanisms. This paper also claims to have improved the cache locality as a by-product. It is true that there is no power evaluation and authors assume availability of one extra core. I think it is okay with many-cores coming in the market.
not rated yet Apr 06, 2010
This is pretty neat. I remember back in school studying about branch prediction and lookahead buffering, this takes it a few steps further. Real gains to be had here. Having an extra "usually idle" core to do the prediction can speed up the other cores that are actually doing the work.
not rated yet Apr 06, 2010
It seems to me that their memory managment approach is based mostly on C++, where garbage collection is not an automatic process when elements lose their scope, unlike .NET, which already has garbage collection running in its own thread. Now, the actual allocation process could stand to favor from being recoded to run in a specific thread, but then again, when the program's main thread needs to make something, it still takes its time to make the request of the allocation thread, and still has to wait for the object to become available before it can continue.
The only way I can think to make use of that is for the module to request everything at the start and go on from there, leaving it to the allocation thread to create everything.
Now, are they suggesting that the allocation thread would also initialize the objects with the initial data?
not rated yet Apr 06, 2010
when the program's main thread needs to make something, it still takes its time to make the request of the allocation thread, and still has to wait for the object to become available before it can continue.
The only way I can think to make use of that is for the module to request everything at the start and go on from there, leaving it to the allocation thread to create everything.

I think the key is that the allocation thread (MMT) predicts the future allocation requests, and performs it an advance before the main computation thread even asks for it.

Please sign in to add a comment. Registration is free, and takes less than a minute. Read more

Click here to reset your password.
Sign in to get notified via email when new comments are made.