I think one of the biggest problems in C++ today is there’s no perfect library to do conversions between strings in a global locale distribution. There are most problems when displaying or converting some characters in most languages. Moreover, there are more conversion problems between UTF8, UTF16, and UTF32 strings. There are solutions but I found them not modern and simple as in today’s programming world. The UnicodeString is one of the most powerful string formats in use roday. A few weeks ago, I saw that char*
arrays of a struct had some conversion problems when I read them from a file and displayed them in a TMemo
box. In this post, I want to give you an example of how you can convert this kind of char*
array to a Unicode String cxorrectly.
Table of Contents
What is a string? a basic_string? and UnicodeString?
The basic_string
(std::basic_string
and std::pmr::basic_string
) is a class template that stores and manipulates sequences of alpha numeric string objects (char
, w_char
,…). A basic string can be used to define string
, wstring
, u8string
, u16string
and u32string
data types.
The UnicodeString
string type is a default String
type of RAD Studio, C++ Builder, Delphi that is in UTF-16 format that means characters in UTF-16 may be 2 or 4 bytes. In C++ Builder and Delphi; Char
and PChar
types are now WideChar
and PWideChar
, respectively. There is a good article about Unicode in RadStudio. And here is a good post about Basic String and Unicode String.
How to convert char array string to UnicodeString correctly in C++?
Assume that we have a struct with some char arrays, such as an example header as below,
1 2 3 4 5 6 7 8 9 10 |
struct Twavheader { char format[4]; char fmt[3]; char c; } wav = { 'W', 'A', 'V', 'E', 'f', 'm', 't', '\0' }; |
When we read this from a file, you should obtain char array properties correctly, such as chunk_ID, format, etc.They have nul
terminator, so when we obtain and display these types we may have wrong outputs. To avoid this, there are 3 different solutions.
1. How to convert char string array to UnicodeString in a single line in C++?
In C++ Builder, UnicodeString type is really awesome in all the ways when you want to use strings. We can convert char string array to Unicode string in a single line in 3 different ways.
First, we can use this syntax to convert a char array
in structs, or a char* arra
y. (Thank You Remy Lebeau, Embarcadero MVP)
1 2 3 |
UnicodeString(const char* src, int len); |
We can use this as below,
1 2 3 |
UnicodeString ustr = UnicodeString( wav.format, 4 ); |
or we can define as below,
1 2 3 |
UnicodeString ustr( wav.format, 4 ); |
then we can display it in our Memo component as below,
1 2 3 |
Memo1->Lines->Add("File Format:" + ustr); |
2. How can we use printf method of UnicodeString to convert char string array in a single line in C++?
Second, we can use printf() method of UnicodeString. Here we should use “%.4hs” format specifier as below,
1 2 3 4 |
UnicodeString ustr; ustr.printf("%.4hs", wav.format ); |
Here above, the .printf()
method of System::UnicodeString
takes a wide format string, and we are passing wav.format
which is narrow string. When we are going to use wide printf
with narrow inputs, then we should use “%hs"
format specifier. The h
tells printf
that we are using narrow data in a context that expects wide. Likewise, we would use %ls
when you are sending wide data to a version of printf
that expects %s
to mean narrow. ( Thank you Bruneau Babet, Embarcadero Developer)
3. How to convert char string array to UnicodeString using with std::string in C++?
Third, If you want to do this using with std::string
, you can write as below,
1 2 3 |
UnicodeString ustr = String( std::string(wav.format, 4).c_str() ); |
You can do same line step by step. First you should convert this format
to std::string as below, note that that has 4 bytes size,
1 2 3 |
std::string str = std::string(wav.format, 4); |
Now you can convert it to const char * as below,
1 2 3 |
const char* c_str = str.c_str(); |
Finally, you can safely convert char* to UnicodeString as below,
1 2 3 |
UnicodeString ustr = String( c_str() ); |
Why other methods are not correct?
Assume that we read a wave file info in a struct, and we try to display some of the members of this struct in a Memo component. Let’s do this in 4 different ways. Compiler will compile all these lines below correctly but the outputs will be different.
1 2 3 4 5 6 7 8 9 10 11 12 |
UnicodeString str; str.printf(L"a-File Format: %.4s", wav.format ); Memo1->Lines->Add(str); str.printf(L"b-File Format: %.4ls", wav.format ); Memo1->Lines->Add(str); str.printf(L"c-File Format: %.4hs", wav.format ); Memo1->Lines->Add(str); str= "d-File Format: "+ String( wav.format ); Memo1->Lines->Add(str); str= "e-File Format: "+ String( std::string(wav.format).c_str() ); Memo1->Lines->Add(str); str= "f-File Format: "+ String( std::string(wav.format, 4).c_str() ); Memo1->Lines->Add(str); str= "g-File Format: "+ UnicodeString(wav.format, 4); Memo1->Lines->Add(str); |
Normally output of wav.format should be “WAVE“, here are the outputs what will they look like,
1 2 3 4 5 6 7 8 9 |
a-File Format: 䅗䕖浦⁴ b-File Format: 䅗䕖浦⁴ c-File Format: WAVE d-File Format: WAVEfmt e-File Format: WAVEfmt f-File Format: WAVE g-File Format: WAVE |
From these outputs above,
- As you see only we can obtain data and display it correctly as in (c), (f) and (g) lines.
- (a) and (b) fails because,
printf
member ofSystem::UnicodeString
takes a wide format string, while it expects%s
to also be wide we are using narrowwav.format
. - (d) and (e) fails because of the no
nul
terminate the input in format. So the logic picks the next propertyfmt
that happens to follow theWAVE
. Luckly there is a"\0"
nul
terminate after thef
,m
,t
characters and it stops there.
Is there an example to convert char string array to UnicodeString correctly in C++?
Here is a full example about to convert char string array to UnicodeString in C++ Builder,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
//--------------------------------------------------------------------------- #include <fmx.h> #pragma hdrstop #include "Unit1.h" //--------------------------------------------------------------------------- #pragma package(smart_init) #pragma resource "*.fmx" TForm1 *Form1; #include <iostream> #include <string> struct Twavheader { char format[4]; char fmt[3]; char c; } wav = { 'W', 'A', 'V', 'E', 'f', 'm', 't', '\0' }; //--------------------------------------------------------------------------- __fastcall TForm1::TForm1(TComponent* Owner) : TForm(Owner) { } //--------------------------------------------------------------------------- void __fastcall TForm1::Button1Click(TObject *Sender) { UnicodeString ustr(wav.format, 4) ; // CORRECT Memo1->Lines->Add( "File Format: "+ ustr); ustr.printf(L"%.4hs", wav.format); // CORRECT Memo1->Lines->Add( "File Format: "+ ustr); ustr= String( std::string(wav.format, 4).c_str() ); // OKAY Memo1->Lines->Add( "File Format: "+ ustr); }//--------------------------------------------------------------------------- |
C++ Builder is the easiest and fastest C and C++ compiler and IDE for building simple or professional applications on the Windows operating system. It is also easy for beginners to learn with its wide range of samples, tutorials, help files, and LSP support for code. RAD Studio’s C++ Builder version comes with the award-winning VCL framework for high-performance native Windows apps and the powerful FireMonkey (FMX) framework for UIs.
There is a free C++ Builder Community Edition for students, beginners, and startups; it can be downloaded from here. For professional developers, there are Professional, Architect, or Enterprise versions of C++ Builder and there is a trial version you can download from here.