The Intel 80x86 Family
7.0 seconds user time
Compile and time the run of the above program two different ways, first as it is, and then with the macro call changed to DUMBCOPY. We measured this on a SPARCstation 2, and there was a consistent large performance degradation with the dumb copy.
The slowdown happens because the source and destination are an exact multiple of the cache size apart. Cache lines on the SS2 aren't filled sequentially—the particular algorithm used happens to fill the same line for main memory addresses that are exact multiples of the cache size apart. This arises from optimized storage of tags—only the high-order bits of each address are put in the tag in this design.
All machines that use a cache (including supercomputers, modern PC's, and everything in between) are subject to performance hits from pathological cases like this one. Your mileage will vary on different machines and different cache implementations.
In this particular case both the source and destination use the same cache line, causing every memory reference to miss the cache and stall the processor while it waited for regular memory to deliver. The library memcpy() routine is especially tuned for high performance.
It unrolls the loop to read for one cache line and then write, which avoids the problem.
Using the smart copy, we were able to get a huge performance improvement. This also shows the folly of drawing conclusions from simple-minded benchmark programs.
The Data Segment and Heap
We have covered the background on system-related memory issues, so it's time to revisit the layout of memory inside an individual process. Now that you know the system issues, the process issues will start making a lot more sense. Specifically, we'll begin by taking a closer look at the data segment within a process.
Just as the stack segment grows dynamically on demand, so the data segment contains an object that can do this, namely, the heap, shown in Figure 7-5. The heap area is for dynamically allocated storage, that is, storage obtained through malloc (memory allocate) and accessed through a pointer.
Everything in the heap is anonymous—you cannot access it directly by name, only indirectly through a pointer. The malloc (and friends: calloc, realloc, etc.) library call is the only way to obtain storage from the heap. The function calloc is like malloc, but clears the memory to zero before giving you the pointer. Don't think that the "c " in calloc() has anything to do with C programming—it means
"allocate zeroized memory". The function realloc() changes the size of a block of memory pointed to, either growing or shrinking it, often by copying the contents somewhere else and giving you back a pointer to the new location. This is useful when growing the size of tables dynamically—more about this in Chapter 10.
Figure 7-5. Where the Heap Lives
Heap memory does not have to be returned in the same order in which it was acquired (it doesn't have to be returned at all), so unordered malloc/free's eventually cause heap fragmentation. The heap must keep track of different regions, and whether they are in use or available to malloc. One scheme is to have a linked list of available blocks (the "free store"), and each block handed to malloc is preceded by a size count that goes with it. Some people use the term arena to describe the set of blocks managed by a memory allocator (in SunOS, the area between the end of the data segment and the current position of the break).
Malloced memory is always aligned appropriately for the largest size of atomic access on a machine, and a malloc request may be rounded up in size to some convenient power of two. Freed memory goes back into the heap for reuse, but there is no (convenient) way to remove it from your process and give it back to the operating system.
The end of the heap is marked by a pointer known as the "break". [2] When the heap manager needs more memory, it can push the break further away using the system calls brk and sbrk. You typically don't call brk yourself explicitly, but if you malloc enough memory, brk will eventually be called for you. The calls that manage memory are:
[2] Your programs will "break" if they reference past the break...
malloc and free— get memory from heap and give it back to heap
brk and sbrk— adjust the size of the data segment to an absolute value/by an increment One caution: your program may not call both malloc() and brk(). If you use malloc, malloc expects to have sole control over when brk and sbrk are called. Since sbrk provides the only way for a process to return data segment memory to the kernel, if you use malloc you are effectively prevented from ever shrinking the program data segment in size. To obtain memory that can later be returned to the kernel, use the mmap system call to map the /dev/zero file. To return this memory, use munmap.
Memory Leaks
Some programs don't need to manage their dynamic memory use; they simply allocate what they need, and never worry about freeing it. This class includes compilers and other programs that run for a fixed or bounded period of time and then terminate. When such a program finishes, it automatically
relinquishes all its memory, and there is little need to spend time giving up each byte as soon as it will no longer be used.
Other programs are more long-lived. Certain utilities such as calendar manager, mailtool, and the operating system itself have to run for days or weeks at a time, and manage the allocation and freeing of dynamic memory. Since C does not usually have garbage collection (automatic identification and deallocation of memory blocks no longer in use) these C programs have to be very careful in their use of malloc() and free(). There are two common types of heap problems:
• freeing or overwriting something that is still in use (this is a "memory corruption")
• not freeing something that is no longer in use (this is a "memory leak")
These are among the hardest problems to debug. If the programmer does not free each malloced block when it is no longer needed, the process will acquire more and more memory without releasing the portions no longer in use.
Handy Heuristic
Avoiding Memory Leaks
Whenever you write malloc, write a corresponding free statement.
If you don't know where to put the "free" that corresponds to your "malloc", then you've probably created a memory leak!
One simple way to avoid this is to use alloca() for your dynamic needs when possible.
The alloca() routine allocates memory on the stack; when you leave the function in which you called it, the memory is automatically freed.
Clearly, this can't be used for structures that need a longer lifetime than the function invocation in which they are created; but for stuff that can live within this constraint, dynamic memory allocation on the stack is a low-overhead choice. Some people deprecate the use of alloca because it is not a portable construct. alloca() is hard to implement efficiently on processors that do not support stacks in hardware.
We use the term "memory leak" because a scarce resource is draining away in a process. The main user-visible symptom of a memory leak is that the guilty process slows down. This happens because larger processes are more likely to have to be swapped out to give other processes a chance to run.
Larger processes also take a longer time to swap in and out. Even though (by definition) the leaked memory itself isn't referenced, it's likely to be on a page with something that is, thus enlarging the working set and slowing performance. An additional point to note is that a leak will usually be larger than the size of the forgotten data structure, because malloc() usually rounds up a storage request to the next larger power-of-two. In the limiting case, a process with a memory leak can slow the whole machine down, not just the user running the offending program. A process has a theoretical size limit that varies from OS to OS. On current releases of SunOS, a process address space can be up to 4 Gbytes; in practice, swap space would be exhausted long before a process leaked enough memory to grow that big. If you're reading this book five years after it was written, say around the turn of the millenium, you'll probably get a good laugh over this by then long-obsolete restriction.
How to Check for a Memory Leak
Looking for a memory leak is a two-step process. First you use the swap command to see how much swap space is available:
/usr/sbin/swap -s
total: 17228k bytes allocated + 5396k reserved = 22624k used, 29548k
available
Type the command three or four times over the space of a minute or two, to see if available swap space keeps getting smaller. You can also use others of the /usr/bin/*stat tools, netstat, vmstat, and so on. If you see an increasing amount of memory being used and never released, one possible explanation is that a process has a memory leak.
Handy Heuristic
Listening to the Network's Heartbeat: Click to Tune
Of all the network investigative tools, the absolute tops is snoop.
The SVr4 replacement for etherfind, snoop captures packets from the network and displays them on your workstation. You can tell snoop just to concentrate on one or two machines, say your own workstation and your server. This can be useful for troubleshooting connectivity problems—snoop can tell you if the bytes are even leaving your machine.
The absolute best feature of snoop, though, is the -a option. This causes snoop to output a click on the workstation loudspeaker for each packet. You can listen to your network ether traffic. Different packet lengths have different modulation. If you use snoop -a a lot, you get good at recognizing the characteristic sounds, and can troubleshoot and literally tune a net "by ear"!
The second step is to identify the suspected process, and see if it is guilty of a memory leak. You may already know which process is causing the problem. If not, the command ps -luusername shows the size of all your processes, as in the example below: