Compact Framework

Performance Implications of Crossing the P/Invoke Boundary


January 2008

Christopher Tacke

OpenNETCF Consulting, LLC

 

Download the Source
  Download the PDF

Introduction

A few months ago a colleague of mine and I were invited to visit a potential customer to give them our thoughts and a expertise in the design of a new control system. The client makes machines that folds and fills cartons – think of cereal boxes and the like – at amazingly high speeds.

They had been using a proprietary system that they has designed with no OS and they were reaching the limits of just how fast they could run and how many sensors and the like that they could put on the Sercos bus so they figured they could leverage Windows CE (readily available for the controllers they already had) to bring out a new generation of machines with better throughput, faster reset times and all of the sorts of things their customers would love.

After a full day of walking through wonderful UML diagrams that detailed every last bit of how the process for their new system would work a few things were very clear.

1.       They definitely knew how to build boxes. Seriously – it’s far more complex than I imagined, and if I ever need a machine that folds cartons, these guys would have my business.

2.       They spent a large amount of time thinking through a whole lot of the details of how this new system architecture would work. And when I say a large amount of time, I mean probably more time than I’ve spent developing entire products. They had done serious due diligence here.

3.       They had no idea how managed code worked and a large majority of what they had done was going to be unusable.

True story. They had designed a large, very flexible, pluggable, extensible, deeply-abstracted architecture for a machine that should be able to fold several hundred or more cartons a minute and it had all been designed around a managed code foundation. These machines have motors, servos, actuators, conveyors and all sorts of things that have to run in a very finely tuned and orchestrated dance at a very high speed, and if any one item is in the wrong place at the wrong time then the best case is that something gets jammed or broken and worst case is a machine operator gets seriously injured.

The problem with their design was rooted in two fundamental issues; the first being that managed code is not deterministic. That means that when any action happens (like a sensor sees a moving box) then the time for a reaction to occur (an arm moves to make a fold) is unbounded. If you’re trying to make boxes that are all the same size and shape at any reasonable rate of speed, you can imagine that having a fold in the same spot every time might be fairly important.

Often you might hear the term “real-time” but the problem with that term is the definition depends on the system you’re looking at. A system is real-time if the maximum latency (or jitter) of a reaction is less than what you need for your system to operate. If you’re calculating trajectories for an air-to-air missile your definition of real-time is a whole lot different than if you’re trying to measure the movement of a tectonic plate in “real-time.”

At any rate, we know that any environment with a garbage collector, whether it’s C#, VB.NET running on Windows or Java running on Linux, is not deterministic by its very nature, so there’s no point in beating on this point too heavily. If you want to know about the deterministic behaviors of either Windows CE or managed code, there are good resources on the web to get you well acquainted (and you’ll be seeing more from us in the coming months as well).

The other problem with their design using managed code, however, is not so obvious (unless of course you read the title of this paper). If C# were somehow made deterministic, would it then be a good choice for them? We know that a large amount of what they need to do is going to require Platform Invokes (or P/Invokes) and I instinctively felt that there was going to be a problem with performance because of that. Of course I had no data to back that up and I knew of no specific papers or articles that specifically looked at what the performance impacts of P/Invoke calls were, but I knew in my gut that they were going to be a problem, so we went back to the hotel, had dinner and a couple drinks, and I proceeded to put together some test code to prove out what I knew deep-down just had to be true.

Test Theory

My original test code was far more complex that what we’ll be looking at for this article. In it I tried to do more realistic simulations of what they would be doing in practice, including a lot of thread interactions and driver calls. For the most part a lot of those tests would just add complexity and confusion to these tests so for this article I distilled the problem down to just a few test points:

  1. How long does it take for a call originating in managed code to arrive in the called native library?
  2. How long does it take for the call to return back to the managed caller?
  3. Does a callback perform any better than a returned value or parameter?

 

To measure the first I decided that I would get the current value of the performance counter right before the P/Invoke call and then in the native library I would grab the value again as soon as the method was entered. To make the second and third tests easier, I would simply return the performance counter value collected in the native method back to the managed caller either directly as a return value or through a callback. The managed caller could then do some basic math and arrive at answers for all three questions.

Of course the raw answers themselves are pretty useless. If you want better performance, all you need is a faster processor. To really make any use of these numbers, we need to do the same test in pure native code doing the exact same steps and then do a comparison of the native and managed results. What is important is not the hard numbers, but the ratios or percentages of a managed test to an analogous native test.

The Native Callee

Step one was to create the native library that would get called by both the native and the managed clients. Using the same library for both tests eliminates it as a source of variability, and in this case is was really simple. I simply exported two functions – one that would return the time it received a call through an output parameter and the other through a callback. The code is really simple:

 

 

typedef void (*NATIVE_CALLBACK)(LARGE_INTEGER timeReceived);

 

static LARGE_INTEGER g_li;

 

extern "C"

__declspec(dllexport) void DirectCall(LARGE_INTEGER *timeReceived)

{

// get the time

QueryPerformanceCounter(timeReceived);

}

 

extern "C"

__declspec(dllexport) void CallBack(NATIVE_CALLBACK callback)

{

// get the time

QueryPerformanceCounter(&g_li);

callback(g_li);

}

 


 

Collecting Control Data

Step two was to gather some data to use as the “control” for the experiment by writing a native client EXE that calls the native DLL and receives back a value through both a return value and a callback. TO make sure I had a decent amount of data I ran 10,000 iterations of the test and output the results to a CSV file so I could plot and analyze the data in Excel. Again, the code is very simple. I opted to load the library dynamically to replicate as close as possible what managed code would be doing:

 

 

static LARGE_INTEGER li1;

static LARGE_INTEGER li2;

static HANDLE hFile;

static char line[500];

 

typedef void (*NATIVE_CALLBACK)(LARGE_INTEGER timeReceived);

typedef void (*DIRECTCALL)(LARGE_INTEGER *timeReceived);

typedef void (*CALLBACKFCN)(NATIVE_CALLBACK callback);

 

void CallbackProc(LARGE_INTEGER timeReceived);

 

int _tmain(int argc, _TCHAR* argv[])

{

HINSTANCE hLib = LoadLibrary(_T("NativeDLL.dll"));

DIRECTCALL DirectCall = (DIRECTCALL)GetProcAddress(hLib,

_T("DirectCall"));

CALLBACKFCN CallBack = (CALLBACKFCN)GetProcAddress(hLib,

_T("CallBack"));

OpenLog();

 

LARGE_INTEGER timeReceived;

 

WriteLog("point#, outDirect, backDirect, outCallback, backCallback\r\n");

 

for(int i = 0 ; i < 10000; i++)

{

QueryPerformanceCounter(&li1);

DirectCall(&timeReceived);

QueryPerformanceCounter(&li2);

 

sprintf(line, "%d, %d, %d,", i,

timeReceived.LowPart - li1.LowPart,

li2.LowPart - timeReceived.LowPart);

WriteLog(line);

 

QueryPerformanceCounter(&li1);

CallBack(&CallbackProc);

}

 

CloseLog();

FreeLibrary(hLib);

 

return 0;

}

 

void CallbackProc(LARGE_INTEGER timeReceived)

{

QueryPerformanceCounter(&li2);

 

sprintf(line, "%d, %d\r\n", timeReceived.LowPart - li1.LowPart,

li2.LowPart - timeReceived.LowPart);

WriteLog(line);

}

 

Collecting Compact Framework Data

The code for collecting the data from the managed application is obviously the same basic idea as the native application. Each of the native APIs are called and the managed application calculates the times and logs them to a file. The app runs this set 10,000 times to get a good sampling. The meat of the logic is below – the full source is in the download for this article.

 

 

public delegate void CallbackDelegate(ulong timeReceived);

 

public class Test

{

ulong start;

ulong recv;

ulong stop;

 

CallbackDelegate callbackDelegate;

IntPtr callbackPtr;

 

public void Run()

{

int outDirect;

int outCallback;

int backDirect;

int backCallback;

 

callbackDelegate = new CallbackDelegate(CallbackProc);

callbackPtr = Marshal.GetFunctionPointerForDelegate(callbackDelegate);

 

OpenLog("\\managedNoGC.csv");

WriteLog("point#, outDirect, backDirect, outCallback, backCallback\r\n");

 

for (int i = 0; i < 10000; i++)

{

RunOne(out outDirect, out backDirect, out outCallback,

out backCallback);

WriteLog(string.Format("{0},{1},{2},{3},{4}\r\n", i, outDirect,

backDirect, outCallback, backCallback));

}

CloseLog();

}

 

private void CallbackProc(ulong timeReceived)

{

QueryPerformanceCounter(out stop);

recv = timeReceived;

}

 

private void RunOne(out int timeOutDirect, out int timeBackDirect,

out int timeOutCallback, out int timeBackCallback)

{

QueryPerformanceCounter(out start);

DirectCall(out recv);

QueryPerformanceCounter(out stop);

 

timeOutDirect = (int)(recv - start);

timeBackDirect = (int)(stop - recv);

 

QueryPerformanceCounter(out start);

CallBack(callbackPtr);

 

timeOutCallback = (int)(recv - start);

timeBackCallback = (int)(stop - recv);

}

}

 

 

Since managed code also has the huge variable of the GC, I also decided to rerun the test with a background thread continually allocating and throwing away data.

 

 

void ThreadProc()

{

while (running)

{

garbage = new byte[63000];

}

}

 

 

Now if you’re paying close attention you’ll notice that our timing code here is not identical to that in the managed code. QueryPerformanceCounter itself is a P/Invoke, so we’re inherently adding the time for the return from that P/Invoke to our results. Eliminating this effect would take a bit of work that would make this test setup a lot more convoluted, and we’re trying to keep this simple, but keep it in mind when looking at the results later.

Results

Once I’d run all the tests, I imported the data files into an Excel spreadsheet for analysis and plotting. The following plots provide comparative views of the times for each “segment” of the test (i.e. each of our three initial questions. Each plot has three data sets – the control (native) data and then managed data run with and then without a background thread generating garbage (and therefore garbage collection). The plots show performance counter ticks (elapsed time) on the Y axis and the sample number of the X axis.

 

NOTE

Unless otherwise noted, the results reported in the charts in this section are in performance counter ticks. The test device used was an iCOP eBox-2300 with a 200MHz Vortex86 processor that reports a performance counter frequency of 1,193,180 ticks per second.

Outgoing Direct Call

In Figure 1 we see the times required to make a direct call into the DirectCall method. We can see that a vast majority of the times for all three data sets appear to be about the same and very small. We’ll look at the actual mean values later, but this plot shows some important points:

  1. The managed calls occasionally have gigantic outliers. We have a couple calls that took over 15,000 ticks (over 12 milliseconds on the test device).
  2. Even the native code has some jitter (we see a maximum of 1024 ticks, or about 860 microseconds) but the jitter, or outlier variation from the mean, is much less pronounced than the managed code.
  3. We see the same outlier frequency whether or not we have GC occurring.

 

So from this plot we see that managed code appears to inherently increase the overall jitter of the system by over an order of magnitude. When you’re designing a system that must perform a task within a certain time boundary, you must look at the worst-case scenarios for your maximum speed, so just on this small sample of raw data we would say that we could safely run native code over 15 times faster than managed (15955 / 1024). Of course this is assuming that the P/Invoke call is part of the time-critical logic.

Figure 1


 

Figure 2 is shows the same data but with the Y-axis rescaled to show just the low-end mean data. The interesting point that we see here is that the average speed to make the call from managed code is about half the speed of the native call.

Figure 2


 

Direct Call Return

In Figure 3 we look at the time required to return from a method. Again, it looks like the mean for the native and the managed tests are about the same and very small, but we still see some critical points:

  1. Both the managed and native code have occasional outliers
  2. The managed outliers are an order of magnitude larger than those from the native code.

 

Figure 3


 

Figure 4 again is the same data as Figure 3, but with the Y-Axis rescaled. You can see that the managed code appears to be slightly faster on average (it looks like the “blue” area extends about 1 tick below the green) but all three data sets seem to be about the same.

Figure 4

 


 

Returning Through a Callback

Figures 5 and 6 show the results for the callback times and are the most interesting of the group. Figure 5 is simply the same data as Figure 3 but with the Y-Axis scaled to show just data with times below 600 ticks.

Right away we see the same trend as the earlier plots – that the managed code produces outliers much greater in magnitude than the native code. What’s different here, though, is the frequency of those outliers. Instread of just a couple over the 10,000 sample data set, we see a dozen very large outliers (with the GC data set) and still a lot more low-end variation (all the spikes above the average but below around 3000 ticks).

Figure 5


 

If we zoom in on the low-end data, Figure 4 show some interesting points as well.

  1. The average for the managed callbacks is substantially higher (turns out to be over 10 times slower as we’ll see later) than native callbacks.
  2. Managed callbacks are never as fast as a native callback
  3. GC appears to have a larger impact on callbacks than anything else we’ve looked at

 

Figure 6


 

Summary Data

Below are tables that summarize the collected data. The upper pair show statistics for the direct call and return. The lower pair show the callback (call and return statistics). You can see that the number solidify the perceptions we got from looking at the plots. Making a P/Invoke call does indeed take almost twice as long as the comparable native call. And the return is even 20-30% slower.


outDirect

backDirect


Native

Managed- No GC

Managed - with GC

Native

Managed- No GC

Managed - with GC

Min

16

20

22

13

11

11

Max

16

15761

15955

12

544

16117

Mean

17

33

32

11

14

16

Median

17

28

28

11

13

13

Mode

17

28

28

13

13

13

% of Native

100.0%

51.5%

53.2%

100.0%

79.9%

69.8%
















outCallback

backCallback


Native

Managed- No GC

Managed - with GC

Native

Managed- No GC

Managed - with GC

Min

15

13

13

12

135

137

Max

907

15207

870

506

16295

16779

Mean

18

19

16

14

190

271

Median

18

15

15

14

165

245

Mode

18

16

16

14

164

243

% of Native

100.0%

97.4%

116.3%

100.0%

7.5%

5.2%

Analyzing the Data

So now we have a whole lot of data. Data is fine, but what we really need is information. Looking at these results raises more than a couple questions that we can’t readily answer without access to the CLR itself. I don’t have access to that kind of information, but I do have access to people who do, so I sent off some questions to the developer who owns the marshalling code for the Compact Framework to see if we could get some answers.

To his credit he did a lot of footwork on this, injecting stubs into the CLR and looking at outputs from the JIT compiler. He sent me several excellent explanations for several question complete with assembly code. As a managed developer (and even as a native developer) looking at assembly is like looking at a gorgon and we really only need to know the general answers, so I’ll save you the petrification and summarize in somewhat less technical terms.

Why is the P/Invoke call half the speed of the native call?

The managed call doesn't directly call the native method. Instead it calls into a JITted stub method that must perform some overhead routines such as calls to determine GC Preemption status (to determine if a GC is pending and we need to wait). It is also possible that some marshalling code will get JITted into the stub as well. This all takes time. Also keep in mind (as pointed out back in the “Collecting Compact Framework Data” section earlier) that the managed test itself is actually timing the return from the QueryPerformanceCounter (QPC) P/Invoke as well.

Why is the direct return 20-30% slower in managed?

Just like for the outbound call, the return in managed code also includes a JITted stub that does some GC housekeeping to see if a GC is pending plus we still have our QPC return time which slows the managed test.

Why are callbacks so crazy expensive in managed code?

The deep answer to this is a complex answer complete with a lot of assembly code that no reasonable developer really wants to read. What it boils down to, though, is this:

The IntPtr returned by GetFunctionPointerForDelegate is not really the pointer for the JITted callback. It's actually a pointer to a stub that does some additional marshaling. That stub in turn calls several different functions down in mscoree, so the route is not straight and your call stack for the callback itself is fairly deep.

These calls check to see if we're marshaling managed types, packages and loads any arguments into buffers, makes the call and then packages and marshals any return value. Throughout this process the CLR is also doing bookkeeping on the thread states, checking and adjusting GC preemption status, looking for events and exceptions and potentially sending out debugger and/or profiler information. During all of this locks are getting acquired and released and memory is getting allocated, so a GC can potentially be triggered during the process as well.

So a whole lot more is going on that simply jumping execution to the callback function address as happens in native code.

Why is the directcall in 2x slower than native, but the call in for the callback is about the same as native?

This one is baffling and without some really deep investigation with a native profiler it difficult to say exactly what is happing. We would have expected the results to be the same. There may be some issues with code alignment, memory controller caching or something else that isn’t obvious that is affecting this result. For now it remains a mystery but it’s likely to be an artifact of the environment, not the code itself.

What accounts for the diff between the two managed callbacks?

We see that individually they have a pretty solid mean, yet they differ by 40%. Why? And what’s with the “steps” in the data? The start the same – is the “higher” level due to GC activity. Does that make sense that the “with GC” line would hop up after a short period and stay, since we’re creating garbage as fast as we can? If that’s so, then why does the “no GC” hop up to about the same level and then stay there for a period before dropping back down?

These are all reasonable questions and unfortunately without a ton of work with both native and managed profilers the only thing we can do is make educated guesses. My guess is that the difference is that the additional thread doing lots of allocation is contending for the same memory allocation locks that the callback "goo" is using, and it's the contention for these locks that's causing the callback to take longer. We might have some thread context switching that is being done in one case and not the other. Still your guess is probably as good as mine, and it’s not likely that we’ll have a definitive answer soon (if ever), but I don’t think we really need one. The lesson we’re really interested is that the managed callbacks are an order of magnitude slower than native callbacks. A small jitter on exactly how much isn’t terribly important in the overall scheme of things.

Conclusion

So what have we learned here? I’d say we can sum it up in a few points:

1.       Managed callbacks, while extremely useful in some cases, are an order of magnitude slower than native callbacks.
It’s probably advisable to not rely heavily on them or purposely architect a solution to use callbacks if it can be avoided.

2.       Managed P/Invoke calls are slower than native method calls.
While the exact amount can vary, and the difference is small, the fact remains that a P/Invoke is slower. If you’re doing work where time is critical it’s important to keep this in mind.

3.       Managed code has a lot of jitter that must be considered in any solution.
The nature of managed code means that we have a garbage collector doing stuff for us in the background, and when it does it’s work, the performance of other threads is impacted and impacted severely. This isn’t just true for C# or VB.NET. Any runtime with a collection mechanism (which includes Java) is going to have this problem. This means that if you have deterministic constraints in your solution, they either need to have a lot of room for sway or you need to be using native code.

I specifically not using the term “real-time” here because almost all code needs some level of determinism. When a user clicks a button in your app, if it takes 30 seconds to respond that’s still unacceptable, so you do have some deterministic requirement. If you’re measuring the real-time speed of continental drift, a few seconds probably isn’t going to matter much. However if you’re trying to run a precision machine tool in a factory, a hundred milliseconds can easily mean a ruined part or broken tool.

The point is that you need to understand your solutions requirements and understand the limitations of the code you’re writing. Don’t dismiss managed code out-of-hand because of some general (often incorrect) perception that it’s slow and likewise don’t architect it into a solution with tight time requirements just because it’s the environment you know. Both good and bad code can be written in any language and the difference is usually knowledge and experience. The hope is that now you have a little more of that and we can all try not to add to the large pool of bad code out there.

 

Comments

 

Compact Framework said:

May 2008 Chris Tacke OpenNETCF Consulting, LLC Download the Source Download the PDF Introduction I recently

May 19, 2008 10:53 AM
 

Can .Net CF do real time I/O and comm? | keyongtech said:

Pingback from  Can .Net CF do real time I/O and comm? | keyongtech

January 18, 2009 11:41 AM
 

donyxxx said:

It is not correct to say that Java is not deterministic just like .NET CF. The low-pause CMS concurrent collector in SUN's Java 1.4.2 onwards and G1 in Java 1.7 (to be released) are more deterministic for soft real time systems. Similarly, Oracle-BEA's JVM JRockit has special deterministic mode, that guarantees the upper limit on GC pauses.

March 20, 2009 12:16 AM