Tuesday, January 26, 2010

Rad Studio IDE Changes System Wide Timer Resolution

This one had me scratching my head for a long time. Apparently, the CodeGear / Embarcadero RAD Studio IDEs (I've tested 2007 and 2010) exhibit this behavior - it calls timeBeginPeriod to change system wide timer resolution to 1ms when it starts and timeEndPeriod when it quits.

What's the problem then?

For starters, this means that simple code such as:

while (!Terminated)
{
    if (Poll())
        DoSomething();

    Sleep(1);
}

will behave very differently with and without the IDE running. You may end up polling a lot slower than you expected when you deploy your applications and may cause DoSomething() to not get called in-time. The evil is in the fact that when you are developing the application, DoSomething() always gets called as you would have expected. But when you deploy your application (or hands it over to the testers for testing), you'd soon realize something is amiss. Everyone knows Windows is not a real-time OS, so no one would expect Sleep(1) to actually sleep for 1ms. But while developing the application, you had found that it is actually quite close.

Well, surprise, surprise! Without the IDE running, Sleep(1) would actually wait for 15.625ms by default - that's more than 15 times slower than what you were expecting.

The Sleep Function documentation from MSDN really doesn't do a good job at explaining the Sleep function. In DOS, I would've expected the system ticks to be at a default of 15.7ms. But I had expected Windows starting from Windows 95 to have a default system tick of 1ms. I was wrong (well, not really, see my comment #1).

Regardless, this is a serious problem with all of RAD Studio's IDEs. I am sure hardly anyone knows about this and at one point or another, you would've been bitten by this, even if you didn't know it - except that your application failed on the day of demo at your client's site. Just your luck again.

Microsoft's Visual Studio IDEs (tested 6, 2005, 2008) don't do this.

Also, it's never a good idea to change the system tick resolution - from MSDN, "(The timeBeginPeriod) function affects a global Windows setting. Windows uses the lowest value (that is, highest resolution) requested by any process. Setting a higher resolution can improve the accuracy of time-out intervals in wait functions. However, it can also reduce overall system performance, because the thread scheduler switches tasks more often. High resolutions can also prevent the CPU power management system from entering power-saving modes. Setting a higher resolution does not improve the accuracy of the high-resolution performance counter."

Perhaps the OS should have simply fixed it at 1ms. To allow processes to change system timer resolution that affects global system settings does not make much sense in a multitasking environment.

Wednesday, January 20, 2010

Component / Control with TPropertyEditor in DesignEditors

If you include <designeditor.hpp> and try to use TPropertyEditor in C++ Builder, you'll run into BCC32 errors complaining about multiple declaration for 'IPropertyDescription' and ambiguity between 'IPropertyDescription' and 'Designintf::IPropertyDescription'. This is true for every version post-BCB6, including the latest CB2010.

The namespace ambiguity problem is an inherent problem with C++ Builder because every HPP file that is generated from Delphi includes the namespace in the header file. We all know that's *BAD* now, but it's a decision that dates back to the first version where even the std namespace was implicit included. Now that we've found ourselves too deep in the rabbit hole, there's really no easy way out as far as backward compatibility is concerned.

But, for this particular problem, there's a solution.

Before you include DesignEditors.hpp, you should first include PropSys.hpp, such as,

#include <propsys.hpp>
#include <designeditors.hpp>

class PACKAGE TMyComponentEditor: public TPropertyEditor
{
    // ...
};


Perhaps the better way would be for DesignEditors.hpp to include PropSys.hpp at the very top of the file, so anyone who uses DesignEditors.hpp doesn't need to remember including PropSys.hpp explicitly. That one's for Embarcadero to decide.

Upgrading VCL Apps to C++ Builder 2010

If you run into the following error, here's what you need to do.

[ILINK32 Error] Error: Unresolved external 'wWinMain' referenced from C:\PROGRAM FILES\EMBARCADERO\RAD STUDIO\7.0\LIB\C0W32W.OBJ

Open your main cpp file and look for this line,

WINAPI WinMain(HINSTANCE, HINSTANCE, LPSTR, int)

Change it to,

WINAPI wWinMain(HINSTANCE, HINSTANCE, LPWSTR,
int)

This is to do with the Unicode support in the new IDE (starting from CB2009).

IDE Regex Replace: char to wchar_t string literals

While upgrading your apps to use wchar_t* instead of char* string literals, you'll find that you need to change a string such as "This is a string" to _T("This is a string"), as well as character literals such as 'c' to _T('c').

Well the good news is there's a quick way of doing this.

The C++ Builder IDE has always have a Regex (Regular Expression) based search and replace function. All you have to do is enable it in the Replace Text dialog, under Options | Regular expressions.

These are the corresponding Regex you'll need.

For string literals,

Text to find: "{(\\"|[^"])*}" (include the double quotes)
Replace with: _T("\0")

For char literals,

Text to find: \'{\\[^']|[^']}\'
Replace with: _T('\0')

* Note: Do not blindly replace all. You may end up replacing the text inside a string, such as "I can see 'u' from here". If anyone has any suggestions on how to correct this, I'd appreciate it (note that the IDE regex replacer does not support backreference). You may end up replacing strings that aren't string literals, such as #include "myfile.h".

The reason you'd want to use the _T(x) macro is because it's faster when you do an assignment to UnicodeString (which is typedef'd to String). The _T(x) macro maps to L##x - i.e. _T("text") == L"text". The String and _T(x) macro pair is compatible going from a compiler that supports Unicode to one that doesn't as String will map to UnicodeString in the former and AnsiString in the latter, which is the same for the _T(x) macro mapping to L (L"string") and nothing ("string") respectively.

String fromAnsi = "text";
_UStrFromPChar, which ends up calling MultiByteToWideChar. It is a Windows API that converts Ansi strings to Unicode strings, and as fast as it may be, it's bound to be slower than a straight memory copy.

String fromUnicode = L"text";
All else being equal (allocate memory and finding string length), this is much faster as it's basically just a straight memory copy.

Friday, January 15, 2010

FastMM - Slow in multithreaded apps on multicore CPUs

There's something wrong with FastMM4's (i.e. the default memory manager of Delphi / C++ Builder starting BDS2006) usability on multicore systems, especially running multithreaded apps in a GC/managed environment. The result of this is that when multicore is enabled, performance suffers by up to 5 folds. So, not only that FastMM would not scale, your multithreaded apps will run tremendously slower on a multicore system - up to 5 times slower on a dual-core machine vs a single-core one at the same clock speed of the same architecture.

That's 500% performance drop going from single-core to dual-core! Comparing the dual-core performance of FastMM4 and TBBMM, the latter is 9 times faster!

This test is meant to show just that. Download Test (updated 27/01/2010) (see readme.txt for instructions) *** WARNING: Incompatible with x64 OS due to an OS bug.

It runs through a variety of algorithms in multiple threads (in a threadpool of the framework, similar to .NET's ThreadPool) consisting of a mix of GC list, GC dictionary, and GC string unit-tests.

Keep in mind that this is an app written using a GC framework, which means allocations usually happen in multiple threads concurrently while de-allocations are done in specialized garbage collector threads. This may be the reason FastMM breaks down (a general-purpose memory manager shouldn't break down given any usage patterns).

Notice that when you run the FastMM Test with CPU Affinity set to just one CPU, you'll end up with nearly the same performance as TBBMM. Once you enable multicores though, you'd immediately lose performance once again, running slower than with just one core.

Note: You'll find that the FastMM BorlndMM.dll is different from the default Rad Studio 2010 one. This is due to the changes added to support the GC framework, but at its heart, it's simply making calls to GetMemory, ReallocMemory and FreeMemory (as oppose to WinMM's version of HeapAlloc, HeapRealloc and HeapFree respectively, with all else
being equal). The WinMM version is initialized with the LFH (low fragmentation heap) flag.

Here are some results from my own tests:


Test results in ops/second (10sec average), listed in the following order:
1) TBBMM
(what is TBBMM?)
2) WinMM
3) FastMM


Core2Duo E6550 2.33GHz (Conroe) - XP SP3
Both cores enabled
1) 1785
2) 1230
3) 250

Single core (via CPU affinity mask)
1) 930
2) 650
3) 950


Core2Duo E6550 throttled to 1.33GHz - XP SP3
Both cores enabled
1) 730
2) 520
3) 180

Single core (via CPU affinity mask)
1) 410
2) 275
3) 395


Pentium M 1.2GHz (Banias) - XP SP3
CPU is Single core
1) 395
2) 340
3) 395

Core2Duo E7200 3.6GHz (Wolfdale) - Vista
Both cores enabled
1) 2595
2) 2080
3) 290

Single core (via CPU affinity mask)
1) 1450
2) 1180
3) 1405

As you can see, the results are quite consistent. On a dual core machine, the performance of FastMM is terrible. From 2.33GHz to 3.6GHz, there's virtually no increase at all in speed! In fact, when the test was running, the CPU wasn't even fully utilized (with more than 50% of CPU spent in kernel time), whereas the other memory managers had the CPU pegged at 100% and nearly no kernel time.

If you wish to try it out on your system, download this GC speed tester (updated 27/01/2010) and unzip it to a folder of your choice. Then, run "Run All Tests.bat" and follow the on-screen instructions. Note that the GC Speed Test app will run indefinitely, so once you take note of the speed (ops/sec), you can quit the app to move on to the next test.

I'd appreciate it if you could post your results here in the comments in the same format as the ones above - i.e. CPU make (I'd love to see how AMD CPUs fare) and model number as well as the frequency, OS / service pack, and the results.

My advice? For an all-rounded memory manager, use the Windows default one. It may be a little slower than FastMM on a single core, but it certainly scales very well on multicore systems. Alternatively, the Intel TBB allocator has a near perfect scaling and is the fastest memory managers around. Only thing is, it consumes more RAM.

Regardless, I'd stay away from FastMM4 (thus the default memory manager of Delphi / C++ Builder).

Thursday, January 14, 2010

C++ Builder 2010 Optimizing C++ Compiler

I'm pleasantly surprised after giving C++ Builder 2010 a quick spin. It's much better at optimizing code than its predecessor CB2007 (I skipped CB2009 altogether as it was and still is completely broken).

CB2010 vs CB2007:

AnsiString test:
6938ms vs 6765ms

GcString test:
420ms vs 1734ms
(yes, that's 420ms, it's not a typo)

In the AnsiString test, things got just a bit slower (about 2% - nothing to worry about). But the big surprise here is my GcString test, which is over 400% FASTER!

Code for the test above (executed on Core2Duo 2.33GHz with TBBMM):

void __fastcall RunTest()
{

const
int TEST_COUNT = 10;
const
int TEST_SIZE = 10000;
const
int LOOP_COUNT = 1000;

{

// RefCounted String Test
AnsiString strings[TEST_SIZE];
for
(int i=0; i<TEST_SIZE; i++)
{

strings[i] = "test";
}


DWORD start = GetTickCount();
for
(int x=0; x<LOOP_COUNT; x++)
{

AnsiString temp;
for
(int j=0; j<TEST_COUNT; j++)
for
(int i=0; i<TEST_SIZE/2; i++)
{

temp = strings[i];
strings[i] = strings[TEST_SIZE-1-i];
strings[TEST_SIZE-1-i] = temp;
}
}

ShowMessage(IntToStr((int)GetTickCount() - (int)start));
}

{

// GcString Test
GcString strings[TEST_SIZE];
for
(int i=0; i<TEST_SIZE; i++)
{

strings[i] = "test";
}


DWORD start = GetTickCount();
for
(int x=0; x<LOOP_COUNT; x++)
{

GcString temp;
for
(int j=0; j<TEST_COUNT; j++)
for
(int i=0; i<TEST_SIZE/2; i++)
{

temp = strings[i];
strings[i] = strings[TEST_SIZE-1-i];
strings[TEST_SIZE-1-i] = temp;
}
}

ShowMessage(IntToStr((int)GetTickCount() - (int)start));
}
}


As you may have noticed, it is simply an array reversal test. And yes, the GcString version was 4 times faster even in CB2007. In CB2010, GcString is now a staggering 16.5 times faster than AnsiString!

Internal Compiler Error (ICE) in BCC32 of C++ Builder 2010

An excellent write-up of ‘What is an Internal Compiler Error?’ by David Dean (an Embarcadero C++ QA Engineer) is a must-read if you do not know what an ICE is, apart from it giving you error message such as this, “[BCC32 Fatal Error] FileA.cpp(56): F1004 Internal compiler error at 0x59650a1 with base 0x5900000”.

CB2010 seems to be more prone to encountering ICE, for reasons which are beyond my understanding. However, with a lot of struggle and time spent to get my projects compiled, I’ve found a few settings that are vital to avoid ICE.

The first thing I’d do is disable smart cached precompiled headers (command line: -Hs-). I’ve found that this option, combined with Debugging | Expand inline functions and/or Optimizations | Expand common intrinsic functions (implicit via Generate fastest possible code) is the root of all evil. Disabling the former will allow the latter two to be enabled, thus taking advantage of the new optimization featured in BCC32 v6.21 of CB2010. In fact, I’ve made all my projects default to this configuration. If you still get ICE, then start disabling the other two as well. Even if you get it to compile after disabling either or both of them, you’d still want to submit a QC entry (a bug report). To do this, follow the instructions in the above link (David Dean’s page about ICE).

Thursday, January 7, 2010

ATI DXVA with Arcsoft - Still Behind nVidia with Anything

Happy New Year 2010 to all my readers!

First post of the year. And it's bad news for ATI, yet again.

*note: this post is a follow-up to my post here.

With the recent changes (additions) to the popular open-sourced H.264 encoder, x264, encoding at ref 16 with b-pyramid normal and weighted-p 2 (which is default), playback on ATI cards with the Arcsoft decoder will exhibit bad artifacts, as if using MPC-HC's internal DXVA decoder on ref 16 encodes. Meanwhile, it's all well and dandy over at the green camp. nVidia owners will find that their cards can decode these streams without even the slightest artifact, and with just about any DXVA decoders you could get your hands on - even the Win7 Microsoft DTV-DVD. If you intend to encode with b-pyramid normal + weighted-p 2, you should reduce the ref to 12 (haven't tried anything in between) to ensure artifact-free DXVA playback with ATI+Arcsoft.

So there you go, even with the best combo, ATI still looses out to nVidia. So again, my advice is, sell your ATI cards and stick to nVidia.

Let's see if this is the year ATI catches up (My bet is on NO. Perhaps never. Frankly, ATI doesn't care about the HTPC scene).