January 2008
Christopher Tacke
OpenNETCF Consulting, LLC
Download the Source
Download the PDF
Introduction
A few months ago a colleague of mine and I were invited to
visit a potential customer to give them our thoughts and a expertise in the
design of a new control system. The client makes machines that folds and fills
cartons – think of cereal boxes and the like – at amazingly high speeds.
They had been using a proprietary system that they has
designed with no OS and they were reaching the limits of just how fast they
could run and how many sensors and the like that they could put on the Sercos
bus so they figured they could leverage Windows CE (readily available for the
controllers they already had) to bring out a new generation of machines with
better throughput, faster reset times and all of the sorts of things their
customers would love.
After a full day of walking through wonderful UML diagrams
that detailed every last bit of how the process for their new system would work
a few things were very clear.
1. They
definitely knew how to build boxes. Seriously – it’s far more complex than I
imagined, and if I ever need a machine that folds cartons, these guys would
have my business.
2. They
spent a large amount of time thinking through a whole lot of the details of how
this new system architecture would work. And when I say a large amount of
time, I mean probably more time than I’ve spent developing entire products.
They had done serious due diligence here.
3. They
had no idea how managed code worked and a large majority of what they had done
was going to be unusable.
True story. They had designed a large, very flexible,
pluggable, extensible, deeply-abstracted architecture for a machine that should
be able to fold several hundred or more cartons a minute and it had all been
designed around a managed code foundation. These machines have motors, servos,
actuators, conveyors and all sorts of things that have to run in a very finely
tuned and orchestrated dance at a very high speed, and if any one item is in
the wrong place at the wrong time then the best case is that something gets
jammed or broken and worst case is a machine operator gets seriously injured.
The problem with their design was rooted in two fundamental
issues; the first being that managed code is not deterministic. That means
that when any action happens (like a sensor sees a moving box) then the time
for a reaction to occur (an arm moves to make a fold) is unbounded. If you’re
trying to make boxes that are all the same size and shape at any reasonable
rate of speed, you can imagine that having a fold in the same spot every time
might be fairly important.
Often you might hear the term “real-time” but the problem
with that term is the definition depends on the system you’re looking at. A
system is real-time if the maximum latency (or jitter) of a reaction is less
than what you need for your system to operate. If you’re calculating
trajectories for an air-to-air missile your definition of real-time is a whole
lot different than if you’re trying to measure the movement of a tectonic plate
in “real-time.”
At any rate, we know that any environment with a garbage
collector, whether it’s C#, VB.NET running on Windows or Java running on Linux,
is not deterministic by its very nature, so there’s no point in beating on this
point too heavily. If you want to know about the deterministic behaviors of
either Windows CE or managed code, there are good resources on the web to get
you well acquainted (and you’ll be seeing more from us in the coming months as
well).
The other problem with their design using managed code,
however, is not so obvious (unless of course you read the title of this
paper). If C# were somehow made deterministic, would it then be a good choice
for them? We know that a large amount of what they need to do is going to
require Platform Invokes (or P/Invokes) and I instinctively felt that there was
going to be a problem with performance because of that. Of course I had no
data to back that up and I knew of no specific papers or articles that
specifically looked at what the performance impacts of P/Invoke calls were, but
I knew in my gut that they were going to be a problem, so we went back to the
hotel, had dinner and a couple drinks, and I proceeded to put together some
test code to prove out what I knew deep-down just had to be true.
Test Theory
My original test code was far more complex that what we’ll
be looking at for this article. In it I tried to do more realistic simulations
of what they would be doing in practice, including a lot of thread interactions
and driver calls. For the most part a lot of those tests would just add
complexity and confusion to these tests so for this article I distilled the
problem down to just a few test points:
- How
long does it take for a call originating in managed code to arrive in the
called native library?
- How
long does it take for the call to return back to the managed caller?
- Does
a callback perform any better than a returned value or parameter?
To measure the first I decided that I would get the current
value of the performance counter right before the P/Invoke call and then in the
native library I would grab the value again as soon as the method was entered.
To make the second and third tests easier, I would simply return the
performance counter value collected in the native method back to the managed
caller either directly as a return value or through a callback. The managed
caller could then do some basic math and arrive at answers for all three
questions.
Of course the raw answers themselves are pretty useless. If
you want better performance, all you need is a faster processor. To really
make any use of these numbers, we need to do the same test in pure native code
doing the exact same steps and then do a comparison of the native and managed
results. What is important is not the hard numbers, but the ratios or
percentages of a managed test to an analogous native test.
The Native Callee
Step one was to create the native library that would get
called by both the native and the managed clients. Using the same library for
both tests eliminates it as a source of variability, and in this case is was
really simple. I simply exported two functions – one that would return the
time it received a call through an output parameter and the other through a
callback. The code is really simple:
typedef void
(*NATIVE_CALLBACK)(LARGE_INTEGER timeReceived);
static LARGE_INTEGER g_li;
extern "C"
__declspec(dllexport)
void DirectCall(LARGE_INTEGER *timeReceived)
{
// get the time
QueryPerformanceCounter(timeReceived);
}
extern "C"
__declspec(dllexport)
void CallBack(NATIVE_CALLBACK callback)
{
// get the time
QueryPerformanceCounter(&g_li);
callback(g_li);
}
Collecting Control Data
Step two was to gather some data to use as the “control” for
the experiment by writing a native client EXE that calls the native DLL and
receives back a value through both a return value and a callback. TO make sure
I had a decent amount of data I ran 10,000 iterations of the test and output
the results to a CSV file so I could plot and analyze the data in Excel.
Again, the code is very simple. I opted to load the library dynamically to
replicate as close as possible what managed code would be doing:
static LARGE_INTEGER li1;
static LARGE_INTEGER li2;
static HANDLE hFile;
static char
line[500];
typedef void
(*NATIVE_CALLBACK)(LARGE_INTEGER timeReceived);
typedef void
(*DIRECTCALL)(LARGE_INTEGER *timeReceived);
typedef void
(*CALLBACKFCN)(NATIVE_CALLBACK callback);
void CallbackProc(LARGE_INTEGER
timeReceived);
int _tmain(int argc, _TCHAR* argv[])
{
HINSTANCE hLib =
LoadLibrary(_T("NativeDLL.dll"));
DIRECTCALL DirectCall
= (DIRECTCALL)GetProcAddress(hLib,
_T("DirectCall"));
CALLBACKFCN CallBack =
(CALLBACKFCN)GetProcAddress(hLib,
_T("CallBack"));
OpenLog();
LARGE_INTEGER
timeReceived;
WriteLog("point#, outDirect, backDirect, outCallback,
backCallback\r\n");
for(int i = 0 ; i
< 10000; i++)
{
QueryPerformanceCounter(&li1);
DirectCall(&timeReceived);
QueryPerformanceCounter(&li2);
sprintf(line, "%d, %d, %d,", i,
timeReceived.LowPart
- li1.LowPart,
li2.LowPart
- timeReceived.LowPart);
WriteLog(line);
QueryPerformanceCounter(&li1);
CallBack(&CallbackProc);
}
CloseLog();
FreeLibrary(hLib);
return 0;
}
void CallbackProc(LARGE_INTEGER
timeReceived)
{
QueryPerformanceCounter(&li2);
sprintf(line, "%d, %d\r\n", timeReceived.LowPart -
li1.LowPart,
li2.LowPart -
timeReceived.LowPart);
WriteLog(line);
}
Collecting Compact Framework Data
The code for collecting the data from the managed
application is obviously the same basic idea as the native application. Each
of the native APIs are called and the managed application calculates the times
and logs them to a file. The app runs this set 10,000 times to get a good
sampling. The meat of the logic is below – the full source is in the download
for this article.
public delegate
void CallbackDelegate(ulong
timeReceived);
public class
Test
{
ulong
start;
ulong
recv;
ulong
stop;
CallbackDelegate
callbackDelegate;
IntPtr
callbackPtr;
public
void Run()
{
int outDirect;
int outCallback;
int backDirect;
int backCallback;
callbackDelegate = new CallbackDelegate(CallbackProc);
callbackPtr = Marshal.GetFunctionPointerForDelegate(callbackDelegate);
OpenLog("\\managedNoGC.csv");
WriteLog("point#, outDirect, backDirect, outCallback,
backCallback\r\n");
for (int i = 0; i
< 10000; i++)
{
RunOne(out outDirect, out
backDirect, out outCallback,
out backCallback);
WriteLog(string.Format("{0},{1},{2},{3},{4}\r\n",
i, outDirect,
backDirect,
outCallback, backCallback));
}
CloseLog();
}
private
void CallbackProc(ulong
timeReceived)
{
QueryPerformanceCounter(out stop);
recv = timeReceived;
}
private
void RunOne(out
int timeOutDirect, out
int timeBackDirect,
out int
timeOutCallback, out int
timeBackCallback)
{
QueryPerformanceCounter(out start);
DirectCall(out recv);
QueryPerformanceCounter(out stop);
timeOutDirect = (int)(recv - start);
timeBackDirect = (int)(stop - recv);
QueryPerformanceCounter(out start);
CallBack(callbackPtr);
timeOutCallback = (int)(recv - start);
timeBackCallback = (int)(stop - recv);
}
}
Since managed code also has the huge variable of the GC, I
also decided to rerun the test with a background thread continually allocating
and throwing away data.
void ThreadProc()
{
while
(running)
{
garbage = new byte[63000];
}
}
Now if you’re paying close attention you’ll notice that our
timing code here is not identical to that in the managed code.
QueryPerformanceCounter itself is a P/Invoke, so we’re inherently adding the
time for the return from that P/Invoke to our results. Eliminating this effect
would take a bit of work that would make this test setup a lot more convoluted,
and we’re trying to keep this simple, but keep it in mind when looking at the
results later.
Results
Once I’d run all the tests, I imported the data files into
an Excel spreadsheet for analysis and plotting. The following plots provide
comparative views of the times for each “segment” of the test (i.e. each of our
three initial questions. Each plot has three data sets – the control (native)
data and then managed data run with and then without a background thread
generating garbage (and therefore garbage collection). The plots show
performance counter ticks (elapsed time) on the Y axis and the sample number of
the X axis.
NOTE
Unless otherwise noted, the results reported in the charts in this
section are in performance counter ticks. The test device used was an iCOP
eBox-2300 with a 200MHz Vortex86 processor that reports a performance counter
frequency of 1,193,180 ticks per second.
Outgoing Direct Call
In Figure 1 we see the times required to make a direct call
into the DirectCall method. We can see that a vast majority of the times for
all three data sets appear to be about the same and very small. We’ll look at
the actual mean values later, but this plot shows some important points:
- The
managed calls occasionally have gigantic outliers. We have a couple calls that
took over 15,000 ticks (over 12 milliseconds on the test device).
- Even
the native code has some jitter (we see a maximum of 1024 ticks, or about 860
microseconds) but the jitter, or outlier variation from the mean, is much less
pronounced than the managed code.
- We
see the same outlier frequency whether or not we have GC occurring.
So from this plot we see that managed code appears to
inherently increase the overall jitter of the system by over an order of
magnitude. When you’re designing a system that must perform a task within a
certain time boundary, you must look at the worst-case scenarios for your
maximum speed, so just on this small sample of raw data we would say that we
could safely run native code over 15 times faster than managed (15955 / 1024).
Of course this is assuming that the P/Invoke call is part of the time-critical
logic.
Figure 1


Figure 2 is shows the same data but with the Y-axis rescaled
to show just the low-end mean data. The interesting point that we see here is
that the average speed to make the call from managed code is about half the
speed of the native call.
Figure 2


Direct Call Return
In Figure 3 we look at the time required to return from a
method. Again, it looks like the mean for the native and the managed tests are
about the same and very small, but we still see some critical points:
- Both
the managed and native code have occasional outliers
- The
managed outliers are an order of magnitude larger than those from the native
code.
Figure 3


Figure 4 again is the same data as Figure 3, but with the
Y-Axis rescaled. You can see that the managed code appears to be slightly
faster on average (it looks like the “blue” area extends about 1 tick below the
green) but all three data sets seem to be about the same.
Figure 4


Returning Through a Callback
Figures 5 and 6 show the results for the callback times and
are the most interesting of the group. Figure 5 is simply the same data as
Figure 3 but with the Y-Axis scaled to show just data with times below 600
ticks.
Right away we see the same trend as the earlier plots – that
the managed code produces outliers much greater in magnitude than the native
code. What’s different here, though, is the frequency of those outliers.
Instread of just a couple over the 10,000 sample data set, we see a dozen very
large outliers (with the GC data set) and still a lot more low-end variation
(all the spikes above the average but below around 3000 ticks).
Figure 5


If we zoom in on the low-end data, Figure 4 show some
interesting points as well.
- The
average for the managed callbacks is substantially higher (turns out to be over
10 times slower as we’ll see later) than native callbacks.
- Managed
callbacks are never as fast as a native callback
- GC
appears to have a larger impact on callbacks than anything else we’ve looked at
Figure 6


Summary Data
Below are tables that summarize the collected data. The
upper pair show statistics for the direct call and return. The lower pair show
the callback (call and return statistics). You can see that the number
solidify the perceptions we got from looking at the plots. Making a P/Invoke
call does indeed take almost twice as long as the comparable native call. And
the return is even 20-30% slower.
|
outDirect
|
backDirect
|
|
Native
|
Managed- No GC
|
Managed - with GC
|
Native
|
Managed- No GC
|
Managed - with GC
|
|
Min
|
16
|
20
|
22
|
13
|
11
|
11
|
|
Max
|
16
|
15761
|
15955
|
12
|
544
|
16117
|
|
Mean
|
17
|
33
|
32
|
11
|
14
|
16
|
|
Median
|
17
|
28
|
28
|
11
|
13
|
13
|
|
Mode
|
17
|
28
|
28
|
13
|
13
|
13
|
|
% of Native
|
100.0%
|
51.5%
|
53.2%
|
100.0%
|
79.9%
|
69.8%
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
outCallback
|
backCallback
|
|
Native
|
Managed- No GC
|
Managed - with GC
|
Native
|
Managed- No GC
|
Managed - with GC
|
|
Min
|
15
|
13
|
13
|
12
|
135
|
137
|
|
Max
|
907
|
15207
|
870
|
506
|
16295
|
16779
|
|
Mean
|
18
|
19
|
16
|
14
|
190
|
271
|
|
Median
|
18
|
15
|
15
|
14
|
165
|
245
|
|
Mode
|
18
|
16
|
16
|
14
|
164
|
243
|
|
% of Native
|
100.0%
|
97.4%
|
116.3%
|
100.0%
|
7.5%
|
5.2%
|
Analyzing the Data
So now we have a whole lot of data. Data is fine, but what
we really need is information. Looking at these results raises more than a
couple questions that we can’t readily answer without access to the CLR
itself. I don’t have access to that kind of information, but I do have
access to people who do, so I sent off some questions to the developer who owns
the marshalling code for the Compact Framework to see if we could get some
answers.
To his credit he did a lot of footwork on this, injecting
stubs into the CLR and looking at outputs from the JIT compiler. He sent me
several excellent explanations for several question complete with assembly
code. As a managed developer (and even as a native developer) looking at
assembly is like looking at a gorgon and we really only need to know the
general answers, so I’ll save you the petrification and summarize in somewhat
less technical terms.
Why is the P/Invoke call half the speed of the native call?
The managed call doesn't directly call the native method.
Instead it calls into a JITted stub method that must perform some overhead
routines such as calls to determine GC Preemption status (to determine if a GC
is pending and we need to wait). It is also possible that some marshalling
code will get JITted into the stub as well. This all takes time. Also keep in
mind (as pointed out back in the “Collecting Compact Framework Data” section
earlier) that the managed test itself is actually timing the return from the
QueryPerformanceCounter (QPC) P/Invoke as well.
Why is the direct return 20-30% slower in managed?
Just like for the outbound call, the return in managed code
also includes a JITted stub that does some GC housekeeping to see if a GC is
pending plus we still have our QPC return time which slows the managed test.
Why are callbacks so crazy expensive in managed code?
The deep answer to this is a complex answer complete with a
lot of assembly code that no reasonable developer really wants to read. What
it boils down to, though, is this:
The IntPtr returned by GetFunctionPointerForDelegate is not
really the pointer for the JITted callback. It's actually a pointer to a stub
that does some additional marshaling. That stub in turn calls several
different functions down in mscoree, so the route is not straight and your call
stack for the callback itself is fairly deep.
These calls check to see if we're marshaling managed types,
packages and loads any arguments into buffers, makes the call and then packages
and marshals any return value. Throughout this process the CLR is also doing
bookkeeping on the thread states, checking and adjusting GC preemption status,
looking for events and exceptions and potentially sending out debugger and/or
profiler information. During all of this locks are getting acquired and
released and memory is getting allocated, so a GC can potentially be triggered
during the process as well.
So a whole lot more is going on that simply jumping
execution to the callback function address as happens in native code.
Why is the directcall in 2x slower than native, but the call in for the
callback is about the same as native?
This one is baffling and without some really deep
investigation with a native profiler it difficult to say exactly what is
happing. We would have expected the results to be the same. There may be some
issues with code alignment, memory controller caching or something else that
isn’t obvious that is affecting this result. For now it remains a mystery but
it’s likely to be an artifact of the environment, not the code itself.
What accounts for the diff between the two managed callbacks?
We see that individually they have a pretty solid mean, yet
they differ by 40%. Why? And what’s with the “steps” in the data? The start
the same – is the “higher” level due to GC activity. Does that make sense that
the “with GC” line would hop up after a short period and stay, since we’re
creating garbage as fast as we can? If that’s so, then why does the “no GC”
hop up to about the same level and then stay there for a period before dropping
back down?
These are all reasonable questions and unfortunately without
a ton of work with both native and managed profilers the only thing we can do
is make educated guesses. My guess is that the difference is that the
additional thread doing lots of allocation is contending for the same memory
allocation locks that the callback "goo" is using, and it's the
contention for these locks that's causing the callback to take longer. We
might have some thread context switching that is being done in one case and not
the other. Still your guess is probably as good as mine, and it’s not likely
that we’ll have a definitive answer soon (if ever), but I don’t think we really
need one. The lesson we’re really interested is that the managed callbacks are
an order of magnitude slower than native callbacks. A small jitter on exactly
how much isn’t terribly important in the overall scheme of things.
Conclusion
So what have we learned here? I’d say we can sum it up in a
few points:
1. Managed
callbacks, while extremely useful in some cases, are an order of magnitude
slower than native callbacks.
It’s probably advisable to not rely heavily on them or purposely architect a
solution to use callbacks if it can be avoided.
2. Managed
P/Invoke calls are slower than native method calls.
While the exact amount can vary, and the difference is small, the fact remains
that a P/Invoke is slower. If you’re doing work where time is critical it’s
important to keep this in mind.
3. Managed
code has a lot of jitter that must be considered in any solution.
The nature of managed code means that we have a garbage collector doing stuff
for us in the background, and when it does it’s work, the performance of other
threads is impacted and impacted severely. This isn’t just true for C# or
VB.NET. Any runtime with a collection mechanism (which includes Java) is going
to have this problem. This means that if you have deterministic constraints in
your solution, they either need to have a lot of room for sway or you need to
be using native code.
I specifically not using the term “real-time” here because
almost all code needs some level of determinism. When a user clicks a button
in your app, if it takes 30 seconds to respond that’s still unacceptable, so
you do have some deterministic requirement. If you’re measuring the real-time
speed of continental drift, a few seconds probably isn’t going to matter much.
However if you’re trying to run a precision machine tool in a factory, a
hundred milliseconds can easily mean a ruined part or broken tool.
The point is that you need to understand your solutions
requirements and understand the limitations of the code you’re writing. Don’t
dismiss managed code out-of-hand because of some general (often incorrect)
perception that it’s slow and likewise don’t architect it into a solution with
tight time requirements just because it’s the environment you know. Both good
and bad code can be written in any language and the difference is usually
knowledge and experience. The hope is that now you have a little more of that
and we can all try not to add to the large pool of bad code out there.