C++ character types : char , wchar_t , char8_t , char16_t and char32_t a tutorial !

mohamad wael
17 min readJan 24, 2021

--

A character can be a letter , a number , a sign , an emoji , or anything that can be written in any form .

Characters Sets in C++

In C++ there is the source character set , and the execution character set .

The source character set , is the set of characters available in the source environment , and which are used in writing , the string literals "Hey" , the character literals 'H' , and which are also used to write variable names , and the program as a whole .

When a C++ program is compiled , it is converted into machine code for the execution environment . The execution environment , is where the machine code is executed , hence what is needed , is to translate the character literals such as 'H' , and the string literals such as "Hey", written in the source character set , into the execution character set , if they are different .

This being said , the source character set and the execution character set must both contain some basic characters , which are called the basic character set .

The source file itself , can be saved into a third character set , but when it is being compiled , the source file which is saved in any desired character set , must first be brought to the source character set , and later on compiled , and has its character and string literals , translated into the execution character set.

The basic character set is formed from the letters A to Z , a to z , from the digits 0 to 9 , from the new line character , the horizontal tab character , the vertical tab character , the form feed character , the space character , and the characters _ { } [ ] # ( ) < > % : ; . ? * + - / ^ & | ~ ! = , \ " ' .

Additionally the basic execution character set , must contain the null character which has a value of 0 , and which is used to terminate C style strings , the backspace character , the carriage return character , and an alert character , which emits a beeping sound , in the execution environment .

The basic character set is represented in the C++ standard using the chartype . The basic character set , is called the narrow character set . A narrow character set , is any character set , that can be represented using a single byte . Such character set , is also called , an ordinary character set . An example of a character set , considered to be a narrow character set , is the ASCII character set , even though it contains additional characters .

What about other character sets , beside the basic character set , are they allowed by the C++ standard ? Yes , they are allowed to exist , and if their encoding does not fit using a single byte , they are called wide character sets .

These wide character sets , must contains the characters , which are enounced to belong to the basic character set . They are also known as locale character sets , or extended character sets . An example of a wide character set , is the unicode character set .

The wide character sets are represented by multiple types , all called wide character types , and which are wchar_t , char16_t , char32_t .

When translating string and character literals , from the source character set , into the execution character set , and if no corresponding character exists in the execution character set , it is up to the implementation to define the translated to character .

For example , the source character set can be ISO-8859-1 , and the execution character set can be ISO-8859-8 . ISO-8859-1 contains the character À which is not to be found in ISO-8859-8 , so it is up to the implementation , to define to which character in the execution character set , À will map to .

Ordinary and narrow characters

char , unsigned char , and signed char , are called ordinary character types .

char , unsigned char , signed char , and char8_t are called narrow character types . The narrow character types , all have a size of 1 byte .

char , unsigned char , and signed char

The char type can hold any member of the basic character set . The char type can be signed , as in , it can contain both positive and negative values , or it can be unsigned , as in , it can contain only negative values , this is implementation defined .

The char type has a size of 1 byte , the number of bits in a byte , is not defined by the C++ standard , but typically a byte , is formed of 8 bits .

This being said , the char type can represent more characters , than those found in the basic character set . In all cases , the characters found , in the basic character set , must be encoded using a positive numeric value , and characters from 0till 9 , must be one after another , in an ascending order .

The char type , the unsigned char type , and the signed char type , must all have the same size and alignment . An alignment , is where in memory , an object can be placed .

All three characters types are distinct , an unsigned char can only hold positive values, a signed char , can hold both positive and negative values , whereas a char , can either be signed or unsigned , this is implementation defined .

The C++ standard defines the minimum range , that the char , unsigned char , and signed char types can have .

An implementation can provide larger ranges , to view the actual range for the ordinary character types , they are defined in the climits header .

#include<climits>
#include<iostream>
int main(void ){
std::cout << "SCHAR_MIN is : " << SCHAR_MIN << "\n";
std::cout << "SCHAR_MAX is : " << SCHAR_MAX << "\n";
std::cout << "UCHAR_MAX is : " << UCHAR_MAX << "\n";
std::cout << "CHAR_MIN is : " << CHAR_MIN << "\n";
std::cout << "CHAR_MAX is : " << CHAR_MAX << "\n"; }
/*Output :
SCHAR_MIN is : -128
SCHAR_MAX is : 127
UCHAR_MAX is : 255
CHAR_MIN is : -128
CHAR_MAX is : 127 */

The C++ standard does not specify which signed representation is to be in a host environment , it can be sign and magnitude , one’s complement , or two’s complement .

Ordinary character and string literals

An ordinary character literal is the character literal enclosed by a single quote ' , for example 'a' is an ordinary character literal .

An ordinary character literal is of type char . It can contain members of the basic character set , such as r , and if it contains members of an extended character set , such as À , then these members are converted to their universal character name , in the start of the compilation process .

A universal character name is an escape sequence such as '\u00C0' , which represents the character 'À', hence a character literal can also include escape sequences .

An escape sequence is used , between other things , to represent a character , in the execution character set . For example if a horizontal tab is to be represented in the execution character set , it is represented in the source character set using the escape sequence \t , and it is physically shown in the execution environment , for example on a terminal or on a screen .

An ordinary character literal , when being converted to the execution character set , is converted to the encoding of this character , in the execution character set . So a horizontal tab \t , in the source character set , when being translated to the execution character set , and if the execution character set is ascii , it will have an encoding of 00001001 .

For ordinary character literals , the encoding storage destination , is the char type . The char type can only hold 1 byte , typically a byte is 8 bits .

Ordinary character sets , those that can be represented using a single byte , have an encoding of a character , which is the same as the character code point . A code point is just a way of numbering the characters in the character set , whereas the encoding is the representation of the code point , in the computer as a series of bits .

The available escape sequences in C++ are :

A universal character name , must not be used to represent a character , in the basic character set , or to represent a control character

#include<iostream>int main(void ){
/*An example of escape sequences
usage .*/
std::cout << '\a' ;
/*Emit an alert or a beeping
sound . */
std::cout << "hello5678You" << "\r\tMe";
/*Output:
hello567Me */
std::cout << '\n' << '??)' << '\n';
/*??) is a trigraph , and is
replaced in the start of ,
the compilation process with
] . The program must be compiled ,
with the switch -trigraphs ,
g++ prg.cpp -trigraphs otherwise
trigraphs are just ignored , and
are not replaced .
Output :
] */
std::cout << 'A' << '\x41' << '\101';
/*\x41 is the encoding of the character
A in hexadecimal , and \101 is
the encoding of the character A
in octal .
Output :
AAA */
/*
char var_c = '\u00C0';
\u00C0 is the universal character name
of the character À . If the execution
character set is unicode , and its
encoding is UTF-8 , this will cause
a compiler error of :
character too large for enclosing character literal type
because the encoding of À will be the two bytes
C380 , which does not fit in a single byte .*/ }

An ordinary string literal is a literal delimited by double quotes " , as in "Hello world" . It is an array of constant ordinary characters , each member of the array stores the encoding of an ordinary character , and the array is terminated with the null character . Adjacent string literals , are concatenated into one string literal .

#include<iostream>int main(void ){
char *ptr_c = "Hello" "World";
/*The string literal "Hello"
and "World" are concatenated
into one .
ptr_c is a pointer to a null
terminated array , of constant
characters , trying to change
the value of one of the characters ,
as in :
*ptr_c = 'd';
, will cause a run time
error .*/
char arr_c[ ] = "Hello World";
/*Create an array containing
the characters of the string
literal Hello World , and
which is null terminated .
It is perfectly legal , to
change the values of the
elements in the created array .*/
std::cout << arr_c << '\n';
/*Output :
Hello Word .*/
arr_c[0 ] = 'd';
/*Set the value of the first
element of the array to
d .*/
std::cout << arr_c << '\n';
/*Output :
dello Word .*/ }

char8_t and char8_t character and string literals

char8_t has an underlying type of unsigned char , but is a distinct type .

A char8_t character literal can be specified using the suffix u8 , followed by single quotes enclosing a character , as in u8'a' .

Characters preceded by u8 are encoded , using utf-8 . utf-8 is a multibyte encoding , which encodes unicode characters using : 1 byte , 2 bytes , 3 bytes , or 4 bytes .

When encoding using one byte , utf-8 always sets the first bit , of the byte to 0 , hence the available encodings are between 0 and 0x7F in hexadecimal , or between 0 and 0111 1111 in binary. So , the utf-8 encoding of one byte , can only represent 128 characters , this works well for char8_t , which has a storage size of 1 byte .

The 128 characters represented using utf-8 1 byte encoding , are the ascii characters . The ascii character set , is a subset of unicode , and it is superset of the basic character set .

A string-literal that begins with u8, followed by double quotes , optionally enclosing some characters , as in u8"À" , is a char8_t , also known as , a utf-8 string literal .

A utf-8 string literal , contains the encoding of characters , of the unicode character set , using utf-8 . This result in an array of constant char8_t characters , containing the encoding of each character , in the char8_t string literal , terminated with a null character .

The encoding of each character , can be 1 , 2 , 3 or 4 bytes , depending on the character .

#include<iostream>int main(void ){
const char8_t *ptr_cc8 = u8"À";
/*The encoding of À in utf-8 ,
is C380 .
ptr_cc8 , is a pointer to
a constant character , which
is the first element of an array
of constant char8_t characters ,
containing the encoding of À , and
terminated with the null character ,
C3 80 00 .*/
std::cout << std::hex << ptr_cc8[0 ] << ptr_cc8[1 ] << ptr_cc8[2 ] ;
/*Output c38000 .*/ }

The escape sequences , described in the Ordinary character and string literals section , can be used in char8_t character and string literals.

Wide characters

The wide character types are wchar_t , char16_t , and char32_t. They are of a fixed length encoding , and they are used to store the encoding of characters in the extended character sets .

wchar_t and wchar_t character and string literals

wchar_t , is a wide character type . It has an integer type , decided by the implementation , this integer type is called its underlying type .

The underlying type can be unsigned or it can be signed , so it can contain only positive or both positive and negative values , but there is no signed wchar_t or unsigned wchar_t .

A wide character literal , starts with an uppercase L , followed by a character , enclosed in single quotes , as in L'À' . The encoding of this character , is stored in the wchar_t type .

Under windows , wchar_t has typically a size of 16 bits , and under linux it has a typically a size of 32 bits , so under windows it can store 16 bit encodings of characters , whereas under linux it can stores 32 bit encodings of characters .

A wide character string literal starts with an uppercase L , followed by optional characters , enclosed in double quotes " , as in L"Hello world" . A wide character string literal , is stored as an array of constant wchar_t .

The definition of the type wchar_t can be found , in the header cwchar . This header also contains , the definition of WCHAR_MAX and WCHAR_MIN , which contain the min and max values storable in wchar_t .

The cwchar header , also contains utility functions , such as functions to get the length of a wide character string , or to convert a multibyte narrow character string , into a wide character string , or vice versa .

#include<stdio.h>
#include<cwchar>
int main(void ){
printf("WCHAR_MIN is : %d \n", WCHAR_MIN );
/*Output Under Linux :
WCHAR_MIN is : -2147483647
Output Under Windows :
WCHAR_MIN is : 0 */
printf("WCHAR_MAX is : %d \n", WCHAR_MAX );
/*Output Under Linux :
WCHAR_MAX is : 2147483647
Output Under Windows :
WCHAR_MAX is : 65535 */
wchar_t var_wc_BE = L'\U00010301';
/* \U00010301 is the universal
character name of the
old italic letter BE . It is
formed of \U followed by BE
code point in hexadecimal
in Unicode .
The escape sequence is preceded ,
with L , as such it is a wide
character escape sequence .
var_wc_BE is a wide character ,
and it contains the encoding
of the wide character literal .*/
printf("Encoding of BE is : %#x\n", var_wc_BE );
/*Print the stored encoding
of old italic letter 𐌁 in
hexadecimal .
Output under linux :
Encoding of BE is : 0x10301
Output under windows :
Encoding of BE is : 0xdf01
Under linux , what is stored is the
utf-32 encoding of the letter 𐌁 ,
and which is 00010301 , whereas
under windows what is stored is
the utf-16 encoding of the letter
𐌁 , and which is D800DF01 , since
under windows wchar_t is 16 bits ,
only the last 4 hex digits DF01 : are
stored .*/
const char *ptr_c_BE = "\U00010301";
/* \U00010301 is the universal character
name of the old italian letter BE .
"\U00010301" is an ordinary string
literal . A character from the
extended character set is chosen ,
and it was not specified that the
string literal is a wide string
literal .
The compiler choses to encode
this string literal , using utf-8 ,
which is a multibyte encoding , Hence
what is stored in the encoding is :
F0 90 8C 81 */
printf("Encoding of BE is : 0x%hhx%hhx%hhx%hhx \n", ptr_c_BE[0 ] , ptr_c_BE[1 ] , ptr_c_BE[2 ] , ptr_c_BE[3 ] ) ;
/*Print the encoding stored in ptr_c_BE ,
in hexadecimal .
Output under linux :
Encoding of BE is : 0xf0908c81
Output under windows :
Encoding of BE is : 0xf0908c81 */
const wchar_t *ptr_wc_BE = L"\U00010301";
/*L"\U00010301" is a wide string literal ,
its encoding is stored as utf-32 under
linux , and as utf-16 under windows ,
in both cases , it needs 4 bytes to be stored .*/
printf("Encoding of BE is : 0x%x 0x%x \n", ptr_wc_BE[0 ] , ptr_wc_BE[1 ] ) ;
/*Output for linux , the utf-32
encoding of BE which is the
same as its code point
U00010301 , followed by the
null character :
Encoding of BE is : 0x10301 0x0
Output for windows , the utf-16
encoding of BE :
Encoding of BE is : 0xd800 0xdf01 .*/
const char *ptr_c = "ab" ;
/*"ab" is an ordinary string
literal , ptr_c is a pointer
to the first character ,
in this constant array
of characters . */
printf("%s\n" , ptr_c );
/*Print the string pointed by ,
ptr_c .
Output :
ab */
wchar_t ptr_wc[3];
/*Define an array of wide characters ,
formed of three elements .*/
mbsrtowcs(ptr_wc , &ptr_c , 3 , 0 );
/*Convert the ordinary string "ab" ,
to the wide character string L"ab" .*/
printf("%ls\n" , ptr_wc );
/*Print the wide character string
pointed by ptr_wc .
Output :
ab */ }

The escape sequences described in the Ordinary character and string literals section , can be used with wide string and character literals .

char16_t and char16_t character and string literals

char16_t has an underlying type of uint_least16_t , it is used to store , the utf-16 encoding of characters .

A char16_t character literal , starts with the small case letter u , followed by single quotes , enclosing a character , as in u'l' .

A char16_t string literal , starts with the small case letter u , followed by double quotes , enclosing some optional characters , as in u"Hey" .

The escape sequences , described earlier on , can be used in char16_t , string and character literals .

#include<cstdint>
/*Contain the min , and max values
for integer types which
have a least width ,
specific width , fastest
type with a least width
... */
#include<iostream>int main(void ){
using namespace std ;
cout << "UINT_LEAST16_MAX is : " << UINT_LEAST16_MAX << "\n" ;
/*The max value that can be stored
in char16_t , is the same for
windows and linux , and
is UINT_LEAST16_MAX .
The min value that can
be stored is 0 .
Output :
UINT_LEAST16_MAX is : 65535 */
/* char16_t var_c16 = u'\U00010301';
This definition of var_c16 , will cause
a compiler error , of character being ,
too large for enclosing type .
The character being used is the
old italic letter BE , and is
specified using its universal character
name , this character has a utf-16
encoding of D800 DF01 , so
it needs 32 bits , and cannot fit
in 16 bits .*/
const char16_t *ptr_c16 = u"\U00010301";
/*\u00010301 is a utf-16 string
literal , so the encoding of
the character old italic letter
BE , specified using its universal
character name , is stored in
an array of constant char16_t
which its first element is pointed
by ptr_c16 .*/
cout<< hex << "UTF-16 encoding of old italic BE is : " << *ptr_c16 << *(ptr_c16+1 ) <<endl ;
/*Print the hexadecimal values stored ,
in the first and second char16_t ,
, both elements of the array pointed
by ptr_c16 .
Output , the same thing for windows ,
and linux :
UTF-16 encoding of old italic BE is : d800df01 */ }

char32_t and char32_t character and string literals

char32_t has an underlying type of uint_least32_t , and it is used to store , the utf-32 encoding , of characters .

A utf-32 character literal , starts with the capital letter U , followed by a single quote , enclosing a character , as in U'a' .

A utf-32 string literal , starts with the capital letter U , followed by double quotes , enclosing some optional characters , as in U"a" .

The escape sequences , described earlier , can be used with utf-32 , string and character literals .

#include<cstdint>
/*The cstdint header contains
the min and max values for
integer types , of specific
width , least width , fast
of least width ...*/
#include<iostream>int main(void ){
using namespace std ;
cout << "UINT_LEAST32_MAX is : " << UINT_LEAST32_MAX << "\n" ;
/*The max value that can be stored ,
in char32_t , is UINT_LEAST32_MAX .
This is the same , for windows ,
and linux .
Output :
UINT_LEAST32_MAX is : 4294967295 */
char32_t var_c32 = U'\U00010301';
/*'\U00010301' is a char32_t
character literal , it
contains an escape sequence ,
which represents the old italian
letter Be . This character
is encoded in utf-32 , and the
encoding is the same as the code
point , as such it is :
00010301 in hex .*/
cout<< hex << "UTF-32 encoding of old italic BE is : " << var_c32 <<endl ;
/*Output , the same thing on windows ,
and linux :
UTF-32 encoding of old italic BE is : 10301 */ }

Raw string literals

A raw string literal , is a string literal which is stored as it is written , escape sequences are not interpreted , new lines and white spaces are preserved .

This literal has the following format :

R"opt-delimiter(characters-of-the-literal)opt-delimiter"

The optional delimiter can be formed of at most sixteen characters , which must be members of the basic character set , with the exception of : ( ) \ , the space character , the horizontal tab , the vertical tab , the form feed , and the new line .

A raw string literal , can be any kind of string literals , for example a wide character string literal , or anything .

#include<iostream>int main(void ){  const char *ptr_c = "C:\\Windows\\System32\\drivers\\etc";
/*This is a window path , the backslash
character is escaped in the string
literal , as not to be
interpreted .*/
std::cout << ptr_c << "\n";
/*Print the string literal ,
pointed by ptr_c .
Output :
C:\Windows\System32\drivers\etc */
ptr_c = R"(C:\Windows\System32\drivers\etc)";
/*In a raw string literal ,
nothing is interpreted , it
is stored as is , so there is
no need to escape the backslash
character in the path .*/
std::cout << ptr_c << "\n";
/*Print the string literal ,
pointed by ptr_c .
Output :
C:\Windows\System32\drivers\etc */
ptr_c =R"DLM(Use a delimiter to be able to use )" in a raw string literal)DLM";
/*An example of why to use a
delimiter . This is done ,
in order to be able to use )"
in a raw string literal .*/
std::cout << ptr_c << "\n";
/*Print the string literal pointed
by ptr_c .
Output :
Use a delimiter to be able to use )" in a raw string literal */ }

Originally published at https://twiserandom.com on January 24, 2021.

--

--

No responses yet