From: Keith Marshall Date: Tue, 7 Apr 2020 18:55:12 +0000 (+0100) Subject: Document MinGW MBCS/wide character conversion functions. X-Git-Tag: wsl-5.3.1-release~6 X-Git-Url: http://git.osdn.net/view?p=mingw%2Fmingw-org-wsl.git;a=commitdiff_plain;h=fc451f9b494dd0b6bafb17ba9e35efd9a9d815e7;hp=5ffd683d2c767408f674e9ac9484210e85d581ed Document MinGW MBCS/wide character conversion functions. --- diff --git a/mingwrt/ChangeLog b/mingwrt/ChangeLog index 724c4c2..bcd305d 100644 --- a/mingwrt/ChangeLog +++ b/mingwrt/ChangeLog @@ -1,3 +1,11 @@ +2020-04-07 Keith Marshall + + Document MinGW MBCS/wide character conversion functions. + + * man/btowc.3.man man/mbrlen.3.man man/mbrtowc.3.man + * man/mbsinit.3.man man/mbsrtowcs.3.man man/wcrtomb.3.man + * man/wcsrtombs.3.man man/wctob.3.man: New files. + 2020-04-02 Keith Marshall Handle wcsrtombs() initial surrogate completion. diff --git a/mingwrt/man/btowc.3.man b/mingwrt/man/btowc.3.man new file mode 100644 index 0000000..5f200c4 --- /dev/null +++ b/mingwrt/man/btowc.3.man @@ -0,0 +1,169 @@ +.\" vim: ft=nroff +.TH %PAGEREF% MinGW "MinGW Programmer's Reference Manual" +. +.SH NAME +.B \%btowc +\- convert a single byte to a wide character +. +. +.SH SYNOPSIS +.B #include +.RB < stdio.h > +.br +.B #include +.RB < wchar.h > +.PP +.B wint_t btowc( int +.I c +.B ); +. +.IP \& -4n +Feature Test Macro Requirements for libmingwex: +.PP +.BR \%__MSVCRT_VERSION__ : +since \%mingwrt\(hy5.3, +if this feature test macro is +.IR defined , +with a value of +.I at least +.IR \%0x0800 , +(corresponding to the symbolic constant, +.BR \%__MSCVR80_DLL , +and thus declaring intent to link with \%MSVCR80.DLL, +or any later version of \%Microsoft\(aqs \%non\(hyfree runtime library, +instead of with \%MSVCRT.DLL), +calls to +.BR \%btowc () +will be directed to the implementation thereof, +within \%Microsoft\(aqs runtime DLL. +. +.PP +.BR \%_ISOC99_SOURCE , +.BR \%_ISOC11_SOURCE : +since \%mingwrt\(hy5.3.1, +when linking with \%MSVCRT.DLL, +or when +.B \%__MSVCRT_VERSION__ +is either +.IR undefined , +or is +.I defined +with any value which is +.I less than +.IR \%0x0800 , +(thus denying intent to link with \%MSVCR80.DLL, +or any later \%non\(hyfree version of Microsoft\(aqs runtime library), +.I explicitly +defining either of these feature test macros +will cause any call to +.BR \%btowc () +to be directed to the +.I \%libmingwex +implementation; +if neither macro is defined, +calls to +.BR \%btowc () +will be directed to Microsoft\(aqs runtime implementation, +if it is available, +otherwise falling back to the +.I \%libmingwex +implementation. +. +.PP +Prior to \%mingwrt\(hy5.3, +none of the above feature test macros have any effect on +.BR \%btowc (); +all calls will be directed to the +.I \%libmingwex +implementation. +. +. +.SH DESCRIPTION +If +.I c +is not +.BR EOF , +the +.BR \%btowc () +function attempts to interpret +.I c +as a multibyte character sequence of length +.IR one ; +if the single byte evaluated represents a complete multibyte character, +in the codeset which is associated with the +.B \%LC_CTYPE +category of the active process locale, +.BR \%btowc () +converts it to, +and returns, +its equivalent wide character value. +. +. +.SH RETURN VALUE +If +.I c +is +.BR EOF , +or if it does not represent a complete multibyte +character sequence of length +.IR one , +.BR \%btowc () +returns +.BR WEOF ; +otherwise the conversion of the single byte character, +to its equivalent wide character value, +is returned. +. +. +.SH ERROR CONDITIONS +No error conditions are defined. +. +. +.SH STANDARDS CONFORMANCE +Except to the extent that it may be affected by limitations +of the underlying \%MS\(hyWindows API, +the +.I \%libmingwex +implementation of +.BR \%btowc () +conforms generally to +.BR \%ISO\(hyC99 , +.BR \%POSIX.1\(hy2001 , +and +.BR \%POSIX.1\(hy2008 ; +(prior to \%mingwrt\-5.3, +and in those cases where calls may be delegated +to a Microsoft runtime DLL implementation, +this level of conformity may not be achieved). +. +. +.\"SH EXAMPLE +. +. +.SH CAVEATS AND BUGS +Use of the +.BR \%btowc () +function is +.IR discouraged ; +it serves no purpose which may not be better served by the +.BR \%mbrtowc (3) +function, +which should be considered as a preferred alternative. +. +. +.SH SEE ALSO +.BR mbrtowc (3) +. +. +.SH AUTHOR +This manpage was written by \%Keith\ Marshall, +\%, +to document the +.BR \%btowc () +function as it has been implemented for the MinGW.org Project. +It may be copied, modified and redistributed, +without restriction of copyright, +provided this acknowledgement of contribution by +the original author remains in place. +. +.\" EOF diff --git a/mingwrt/man/mbrlen.3.man b/mingwrt/man/mbrlen.3.man new file mode 100644 index 0000000..e86728a --- /dev/null +++ b/mingwrt/man/mbrlen.3.man @@ -0,0 +1,377 @@ +.\" vim: ft=nroff +.TH %PAGEREF% MinGW "MinGW Programmer's Reference Manual" +. +.SH NAME +.B mbrlen +\- determine the number of bytes in a multibyte character +. +. +.SH SYNOPSIS +.B #include +.RB < wchar.h > +.PP +.B size_t mbrlen( const char +.BI * s , +.B size_t +.IB n , +.B mbstate_t +.BI * ps +.B ); +. +. +.IP \& -4n +Feature Test Macro Requirements for libmingwex: +.PP +.BR \%__MSVCRT_VERSION__ : +since \%mingwrt\(hy5.3, +if this feature test macro is +.IR defined , +with a value of +.I at least +.IR \%0x0800 , +(corresponding to the symbolic constant, +.BR \%__MSCVR80_DLL , +and thus declaring intent to link with \%MSVCR80.DLL, +or any later version of \%Microsoft\(aqs \%non\(hyfree runtime library, +instead of with \%MSVCRT.DLL), +calls to +.BR \%mbrlen () +will be directed to the implementation thereof, +within \%Microsoft\(aqs runtime DLL. +. +.PP +.BR \%_ISOC99_SOURCE , +.BR \%_ISOC11_SOURCE : +since \%mingwrt\(hy5.3.1, +when linking with \%MSVCRT.DLL, +or when +.B \%__MSVCRT_VERSION__ +is either +.IR undefined , +or is +.I defined +with any value which is +.I less than +.IR \%0x0800 , +(thus denying intent to link with \%MSVCR80.DLL, +or any later \%non\(hyfree version of Microsoft\(aqs runtime library), +.I explicitly +defining either of these feature test macros +will cause any call to +.BR \%mbrlen () +to be directed to the +.I \%libmingwex +implementation; +if neither macro is defined, +calls to +.BR \%mbrlen () +will be directed to Microsoft\(aqs runtime implementation, +if it is available, +otherwise falling back to the +.I \%libmingwex +implementation. +. +.PP +Prior to \%mingwrt\(hy5.3, +none of the above feature test macros have any effect on +.BR \%mbrlen (); +all calls will be directed to the +.I \%libmingwex +implementation. +. +. +.SH DESCRIPTION +The +.BR \%mbrlen () +function inspects the sequence of bytes, +starting at +.IR s , +up to a maximum of +.I n +bytes, +to determine the number of bytes required to complete +the next multibyte code point, +commencing from the conversion state specified in +.IR *ps , +(which is then updated). +. +.PP +The sequence of bytes, +pointed to by +.IR s , +is interpreted as a multibyte character sequence +in the codeset which is associated with the +.B \%LC_CTYPE +category of the active process locale. +. +.PP +If +.I ps +is specified as a NULL pointer, +.BR \%mbrlen () +will track conversion state using an internal +.B \%mbstate_t +object reference, +which is private within the +.BR \%mbrlen () +process address space; +at process \%start\(hyup, +this internal +.B \%mbstate_t +object is initialized to represent +the initial conversion state. +. +. +.SH RETURN VALUE +If the multibyte sequence, +completed by +.I n +or fewer bytes, +does not represent the NUL code point, +then +.BR \%mbrlen () +returns the number of bytes which are actually required +to complete the sequence, +(a number between 1 and +.IR n , +inclusive), +and the conversion state, +as specified in +.IR *ps , +is reset to the initial state. +. +.PP +On the other hand, +if the completed multibyte sequence +.I does +represent the NUL code point, +then +.BR \%mbrlen () +returns zero, +and the conversion state, +as specified in +.IR *ps , +is reset to the initial state. +. +.PP +If +.I n +is less than the effective +.B \%MB_CUR_MAX +for the active process locale, +and +.I n +bytes is insufficient to complete a multibyte character, +then +.I *ps +is updated to represent a new partially completed encoding state, +and +.BR \%mbrlen () +returns +.IR \%(size_t)(\-2) . +Conversely, +if +.I n +is equal to, +or greater than +.BR \%MB_CUR_MAX , +this return condition can arise, +only if the multibyte encoding sequence includes +redundant shift states; +since shift states are not used, +this cannot occur in any \%MS\(hyWindows +multibyte character set. +. +. +.SH ERROR CONDITIONS +If the sequence of +.I n +or fewer bytes, +pointed to by +.IR s , +extends any pending encoding state recorded within +.IR *ps , +to at least +.B \%MB_CUR_MAX +bytes, +and the resulting sequence does not represent +a valid multibyte character, +then +.I \%errno +is set to +.BR \%EILSEQ , +and +.BR \%mbrlen () +returns +.IR \%(size_t)(\-1) . +. +.PP +If, +on entry to +.BR \%mbrlen (), +the conversion state represented by +.I *ps +is deemed to be +.IR invalid , +.I \%errno +is set to +.BR \%EINVAL , +and +.BR \%mbrlen () +returns +.IR \%(size_t)(\-1) ; +the conversion state may be deemed to be invalid if +it contains any sequence of bytes which does not match +a valid initial sequence from a multibyte character +representation within the currently active codeset, +if it can be interpreted as a complete multibyte character, +.I without +the addition of any further bytes from +.IR s , +or if it represents a +.I surrogate\ pair +conversion, +resulting from a preceding call to the +.BR \%mbrtowc (3) +function, +and from which the +.I low\ surrogate +has yet to be retrieved. +. +. +.SH STANDARDS CONFORMANCE +Except to the extent that it may be affected by limitations +of the underlying \%MS\(hyWindows API, +the +.I \%libmingwex +implementation of +.BR \%mbrlen () +conforms generally to +.BR \%ISO\(hyC99 , +.BR \%POSIX.1\(hy2001 , +and +.BR \%POSIX.1\(hy2008 ; +(prior to \%mingwrt\-5.3 , +and in those cases where calls may be delegated +to a Microsoft runtime DLL implementation, +this level of conformity may not be achieved). +. +.PP +The feature whereby +.I \%errno +is set to +.BR EINVAL , +when +.I *ps +is found to be invalid, +is a +.B POSIX.1 +conforming extension to +.BR \%ISO\(hyC99 . +. +. +.\"SH EXAMPLE +. +. +.SH CAVEATS AND BUGS +If +.BR \%mbrlen () +is called with a NULL pointer for +.IR s , +the behaviour is undefined. +. +.PP +Due to a documented limitation of Microsoft\(aqs +.BR \%setlocale () +function implementation, +it is not possible to directly select an active locale, +in which the codeset is represented by any multibyte +character sequence with an effective +.B \%MB_CUR_MAX +of more than two bytes. +Prior to \%mingwrt\(hy5.3, +this limitation precludes the use of +.BR \%mbrlen () +to interpret any codeset with +.B \%MB_CUR_MAX +greater than two bytes, +(such as +.BR \%UTF\(hy8 ). +From \%mingwrt\(hy5.3 onward, +the MinGW.org implementation of +.BR \%mbrlen () +mitigates this limitation by assignment of the codeset +from the +.B \%LC_CTYPE +environment variable, +provided the system default has been previously activated +for the +.B \%LC_CTYPE +locale category; +e.g.\ execution of: +.PP +.RS 4 +.EX +#define _ISOC99_SOURCE + +#include +#include +#include +#include +#include + +int main() +{ + setlocale( LC_CTYPE, "" ); + putenv( "LC_CTYPE=en_GB.65001" ); + printf( "%u bytes\en", + mbrlen( "\eU0001d10b", MB_LEN_MAX, NULL ) + ); + return 0; +} +.EE +.RE +.PP +will interpret the string \fC\%"\eU0001d10b"\fP as a \%four\(hybyte +.B \%UTF\(hy8 +encoding sequence, +(which represents a single code point), +and print the result as \fC4\fP\ \fC\%bytes\fP. +. +.PP +Please be aware that the underlying \%MS\(hyWindows API, +which is used to interpret the multibyte sequence, +offers no readily accessible mechanism to discriminate +between incomplete and invalid sequences; +thus, +if +.I n +is less than the effective +.B \%MB_CUR_MAX +for the active codeset, +this +.BR \%mbrlen () +implementation may return +.IR \%(size_t)(\-2) , +indicating an incomplete sequence, +even in cases where there are no additional bytes +which could be appended, +to complete a valid encoding sequence. +. +. +.SH SEE ALSO +.BR mbrtowc (3) +. +. +.SH AUTHOR +This manpage was written by \%Keith\ Marshall, +\%, +to document the +.BR \%mbrlen () +function as it has been implemented for the MinGW.org Project. +It may be copied, modified and redistributed, +without restriction of copyright, +provided this acknowledgement of contribution by +the original author remains in place. +. +.\" EOF diff --git a/mingwrt/man/mbrtowc.3.man b/mingwrt/man/mbrtowc.3.man new file mode 100644 index 0000000..d6615af --- /dev/null +++ b/mingwrt/man/mbrtowc.3.man @@ -0,0 +1,681 @@ +.\" vim: ft=nroff +.TH %PAGEREF% MinGW "MinGW Programmer's Reference Manual" +. +.SH NAME +.B mbrtowc +\- convert from multibyte to wide character encoding +. +. +.SH SYNOPSIS +.B #include +.RB < wchar.h > +.PP +.B size_t mbrtowc( wchar_t +.BI * pwc , +.B const char +.BI * s , +.B size_t +.IB n , +.B mbstate_t +.BI * ps +.B ); +. +.IP \& -4n +Feature Test Macro Requirements for libmingwex: +.PP +.BR \%__MSVCRT_VERSION__ : +since \%mingwrt\(hy5.3, +if this feature test macro is +.IR defined , +with a value of +.I at least +.IR 0x0800 , +(corresponding to the symbolic constant, +.BR \%__MSCVR80_DLL , +and thus declaring intent to link with \%MSVCR80.DLL, +or any later version of \%Microsoft\(aqs \%non\(hyfree runtime library, +instead of with \%MSVCRT.DLL), +calls to +.BR mbrtowc () +will be directed to the implementation thereof, +within \%Microsoft\(aqs runtime DLL. +. +.PP +.BR \%_ISOC99_SOURCE , +.BR \%_ISOC11_SOURCE : +since \%mingwrt\(hy5.3.1, +when linking with \%MSVCRT.DLL, +or when +.B \%__MSVCRT_VERSION__ +is either +.IR undefined , +or is +.I defined +with any value which is +.I less than +.IR 0x0800 , +(thus denying intent to link with \%MSVCR80.DLL, +or any later \%non\(hyfree version of Microsoft\(aqs runtime library), +.I explicitly +defining either of these feature test macros +will cause any call to +.BR \%mbrtowc () +to be directed to the +.I \%libmingwex +implementation; +if neither macro is defined, +calls to +.BR \%mbrtowc () +will be directed to Microsoft\(aqs runtime implementation, +if it is available, +otherwise falling back to the +.I \%libmingwex +implementation. +. +.PP +Prior to \%mingwrt\(hy5.3, +none of the above feature test macros have any effect on +.BR \%mbrtowc (); +all calls will be directed to the +.I \%libmingwex +implementation. +. +. +.SH DESCRIPTION +If +.I s +is a NULL pointer, +the +.IR *pwc , +and the +.I n +arguments are ignored, +and the call to +.BR \%mbrtowc () +function is interpreted as if invoked as +.PP +.RS 4n +.EX +mbrtowc( NULL, "", 1, ps ); +.EE +.RE +. +.PP +Otherwise, +if +.I s +is not a NULL pointer, +the +.BR \%mbrtowc () +function inspects the sequence of bytes, +starting at +.IR s , +up to a maximum of +.I n +bytes, +to determine the number of bytes required to complete +the next multibyte code point, +commencing from the conversion state specified in +.IR *ps , +(which is then updated). +Then, +if +.I *pwc +is not a NULL pointer, +and +.I n +or fewer bytes is sufficient to complete a single +multibyte character, +the single +.B \%wchar_t +wide character conversion of that multibyte character +is stored at +.IR *pwc . +. +.PP +The sequence of bytes, +pointed to by +.IR s , +is interpreted as a multibyte character sequence +in the codeset which is associated with the +.B \%LC_CTYPE +category of the active process locale. +. +.PP +If +.I ps +is specified as a NULL pointer, +.BR \%mbrtowc () +will track conversion state using an internal +.B \%mbstate_t +object reference, +which is private within the +.BR \%mbrtowc () +process address space; +at process \%start\(hyup, +this internal +.B \%mbstate_t +object is initialized to represent +the initial conversion state. +. +.PP +In the special case, +where the conversion of a completed multibyte character +must be represented as a +.B \%UTF\(hy16LE +.IR surrogate\ pair , +and +.I *pwc +is not a NULL pointer, +only the +.I high\ surrogate +will be stored at +.IR *pwc ; +please refer to the section +.B CAVEATS AND +.BR BUGS , +below, +for advice on retrieval of the +.IR low\ surrogate . +. +. +.SH RETURN VALUE +If the multibyte sequence, +completed by +.I n +or fewer bytes, +does not represent the NUL code point, +then +.BR \%mbrtowc () +returns the number of bytes which are actually required +to complete the sequence, +(a number between 1 and +.IR n , +inclusive), +and the conversion state, +as specified in +.IR *ps , +is reset to the initial state; +if +.I pwc +is not a NULL pointer, +the wide character conversion of the completed +multibyte character is stored at +.IR *pwc . +. +.PP +On the other hand, +if the completed multibyte sequence +.I does +represent the NUL code point, +then +.BR \%mbrtowc () +returns zero, +and the conversion state, +as specified in +.IR *ps , +is reset to the initial state; +if +.I pwc +is not a NULL pointer, +the NUL wide character is stored at +.IR *pwc . +. +.PP +If +.I n +is less than the effective +.B \%MB_CUR_MAX +for the active process locale, +and +.I n +bytes is insufficient to complete a multibyte character, +then +.I *ps +is updated to represent a new partially completed encoding state, +(no wide character conversion is stored), +and +.BR \%mbrtowc () +returns +.IR \%(size_t)(\-2) . +(If +.I n +is equal to, +or greater than +.BR \%MB_CUR_MAX , +this return condition can arise, +only if the multibyte encoding sequence includes +redundant shift states; +since shift states are not used, +this cannot occur in any \%MS\(hyWindows +multibyte character set). +. +. +.SH ERROR CONDITIONS +If the sequence of +.I n +or fewer bytes, +pointed to by +.IR s , +extends any pending encoding state recorded within +.IR *ps , +to at least +.B \%MB_CUR_MAX +bytes, +and the resulting sequence does not represent +a valid multibyte character, +then +.I \%errno +is set to +.BR \%EILSEQ , +no wide character conversion is stored, +and +.BR \%mbrtowc () +returns +.IR \%(size_t)(\-1) . +. +.PP +If, +on entry to +.BR \%mbrtowc (), +the conversion state represented by +.I *ps +is deemed to be +.IR invalid , +.I \%errno +is set to +.BR \%EINVAL , +and +.BR \%mbrtowc () +returns +.IR \%(size_t)(\-1) ; +the conversion state may be deemed to be invalid if +it contains any sequence of bytes which does not match +a valid initial sequence from a multibyte character +representation within the currently active codeset, +if it can be interpreted as a complete multibyte character, +.I without +the addition of any further bytes from +.IR s , +or if it represents a +.I surrogate\ pair +conversion, +resulting from a preceding call to +.BR \%mbrtowc (), +from which the +.I low\ surrogate +has yet to be retrieved, +(and this is not the special case in which +.I n +is specified as +.IR \%zero , +indicating that this call is intended +to retrieve that pending +.IR low\ surrogate ). +. +. +.SH STANDARDS CONFORMANCE +Except in respect of its extended provision for handling of +.IR surrogate\ pairs , +and to the extent that it may be affected by limitations +of the underlying \%MS\(hyWindows API, +the +.I \%libmingwex +implementation of +.BR mbrtowc () +conforms generally to +.BR \%ISO\(hyC99 , +.BR \%POSIX.1\(hy2001 , +and +.BR \%POSIX.1\(hy2008 ; +(prior to \%mingwrt\-5.3, +and in those cases where calls may be delegated +to a Microsoft runtime DLL implementation, +this level of conformity may not be achieved). +. +.PP +The feature whereby +.I \%errno +is set to +.BR EINVAL , +when +.I *ps +is found to be invalid, +is a +.B POSIX.1 +conforming extension to +.BR \%ISO\(hyC99 . +. +. +.\"SH EXAMPLE +. +. +.SH CAVEATS AND BUGS +Due to a documented limitation of Microsoft\(aqs +.BR \%setlocale () +function implementation, +it is not possible to directly select an active locale, +in which the codeset is represented by any multibyte +character sequence with an effective +.B \%MB_CUR_MAX +of more than two bytes. +Prior to \%mingwrt\(hy5.3, +this limitation precludes the use of +.BR \%mbrtowc () +to interpret any codeset with +.B \%MB_CUR_MAX +greater than two bytes, +(such as +.BR \%UTF\(hy8 ). +From \%mingwrt\(hy5.3 onward, +the MinGW.org implementation of +.BR \%mbrtowc () +mitigates this limitation by assignment of the codeset +from the +.B \%LC_CTYPE +environment variable, +provided the system default has been previously activated +for the +.B \%LC_CTYPE +locale category; +e.g.\ execution of: +.PP +.RS 4n +.EX +#include +#include +#include +#include +#include + +void print_conv( const char * ); + +int main() +{ + setlocale( LC_CTYPE, "" ); + putenv( "LC_CTYPE=en_GB.65001" ); + print_conv( "\eU0001d10b" ); + print_conv( "\eu6c34" ); + return 0; +} + +void print_conv( const char *mbs ) +{ + wchar_t wch; + size_t n = mbrtowc( &wch, mbs, MB_LEN_MAX, NULL ); + if( (int)(n) > 0 ) printf( "%u bytes \-> 0x%04X\en", n, wch ); + else if( n == (size_t)(\-1) ) perror( "mbrtowc" ); +} +.EE +.RE +.PP +will interpret the string \fC"\eU0001d10b"\fP as a \%four\(hybyte +.B \%UTF\(hy8 +encoding sequence, +(which represents a single Unicode code point), +but will fail to interpret the following \fC"\eu6c34"\fP sequence, +(which also represents a valid Unicode code point), +and, +(if +.B stderr +is redirected to +.BR stdout ), +will print the result as: +.PP +.RS 4n +.EX +4 bytes \-> 0xD834 +mbrtowc: Invalid argument +.EE +.RE +.PP +This example illustrates a potentially irreconcilable +deviation of any +.BR \%mbrtowc () +implementation, +on \%MS\(hyWindows, +from the +.B \%ISO\(hyC99 +standard: +due to \%Microsoft\(aqs choice of +.B \%UTF\(hy16LE +as the underlying representation of the +.B \%wchar_t +data type, +it is not possible to satisfy the requirement, +implicit in the +.B \%ISO\(hyC99 +specification for +.BR \%mbrtowc (), +that it should be possible to return the complete representation +of any single representable Unicode code point as a single +.B \%wchar_t +value. +In the case of this example, +whereas the \%4\(hybyte +.B \%UTF\(hy8 +representation of the \fC\%"\eU0001d10b"\fP Unicode code point +.I is +complete, +the \fC\%0xD834\fP +.B \%wchar_t +representation, +as returned by +.BR \%mbrtowc (), +is +.I not +complete; +it represents a +.B \%UTF\(hy16 +.IR high\ surrogate , +which +.I must +be paired with a corresponding +.I low\ surrogate +to complete it, +and, +since +.B \%ISO\(hyC99 +requires that the +.B \%*pwc +argument to +.BR \%mbrtowc () +refers to sufficient storage space to accommodate only +.I one +.B \%wchar_t +value, +it is not possible for +.BR \%mbrtowc () +to +.I safely +return +.I both +the +.IR high\ surrogate , +and its complementary +.IR low\ surrogate , +in a single call. +To mitigate this non\(hyconformance, +from \%mingwrt\(hy5.3 onward, +the \%MinGW implementation of +.BR \%mbrtowc () +supports the following non\(hystandard strategy +for completion of any conversion which requires return of a +.IR surrogate\ pair : +. +.RS 2n +.ll -2n +.IP \(bu 2n +Any translation unit, +in which +.BR \%mbrtowc () +is called, +should: +.RS 2n +.ll -2n +.IP a) 3n +explicitly define either the +.BR \%_ISOC99_SOURCE , +or the +.B \%_ISOC11_SOURCE +feature test macro, +(with any arbitrary value, +or even no value), +.B before +including +.I any +header file, +and +.IP b) 3n +include the +.B \% +header file, +in addition to the required +.B \% +header. +.ll +2n +.RE +. +.IP \(bu 2n +Following each call of +.BR \%mbrtowc (), +which returns a +.B \%wchar_t +value with a converted byte count greater than zero, +test the returned +.B \%wchar_t +value, +using the +.BR \%IS_HIGH_SURROGATE () +macro. +. +.IP \(bu 2 +When the +.BR \%IS_HIGH_SURROGATE () +macro call indicates that the returned +.B \%wchar_t +value does represent a +.IR high\ surrogate , +immediately call +.BR mbrtowc () +again, +passing the +.B \%*ps +state as returned by the original call, +together with the original multibyte sequence reference, +but with an explicit scan length limit, +.BR \%n , +of zero, +and an alternative +.B \%wchar_t +buffer reference pointer, +for storage of the +.IR low\ surrogate ; +on successful retrieval of this +.IR low\ surrogate , +the additional converted byte count will be returned as zero, +and the pending +.B \%*ps +conversion state will have been cleared, +(i.e.\& reset to the initial state). +.ll +2n +.RE +. +.PP +Thus, +considering the preceding example, +to support interpretation of +.I surrogate pairs +the example code should be modified by insertion of: +.PP +.RS 4n +.EX +#define _ISOC99_SOURCE +#include +.EE +.RE +.PP +at the top of the source file, +and reimplementation of the +.BR print_conv () +function, +to incorporate the +.BR IS_HIGH_SURROGATE () +test, +and response: +.PP +.RS 4n +.EX +void print_conv( const char *mbs ) +{ + wchar_t wch; + size_t n = mbrtowc( &wch, mbs, MB_LEN_MAX, NULL ); + if( (int)(n) > 0 ) + { + if( IS_HIGH_SURROGATE( wch ) + { + wchar_t wcl; + mbrtowc( &wcl, mbs, 0, NULL ); + printf( "%u bytes \-> 0x%04X:0x%04X\en", n, wch, wcl ); + } + else printf( "%u bytes \-> 0x%04X\en", n, wch ); + } + else if( n == (size_t)(\-1) ) perror( "mbrtowc" ); +} +.EE +.RE +. +.PP +With these changes in place, +the output from the program becomes: +.PP +.RS 4n +.EX +4 bytes \-> 0xD834:0xDD0B +3 bytes \-> 0x6C34 +.EE +.RE +.PP +thus now correctly reporting the conversion of the +.IR surrogate\ pair , +and then correctly interpreting the following \%3-byte +.B \%UTF\(hy8 +sequence. +. +.PP +Please be aware that the underlying \%MS\(hyWindows API, +which is used to interpret the multibyte sequence, +offers no readily accessible mechanism to discriminate +between incomplete and invalid sequences; +thus, +if +.I n +is less than the effective +.B \%MB_CUR_MAX +for the active codeset, +this +.BR \%mbrtowc () +implementation may return +.IR \%(size_t)(\-2) , +indicating an incomplete sequence, +even in cases where there are no additional bytes +which could be appended, +to complete a valid encoding sequence. +. +. +.SH SEE ALSO +.BR mbsrtowcs (3) +. +. +.SH AUTHOR +This manpage was written by \%Keith\ Marshall, +\%, +to document the +.BR \%mbrtowc () +function as it has been implemented for the MinGW.org Project. +It may be copied, modified and redistributed, +without restriction of copyright, +provided this acknowledgement of contribution by +the original author remains in place. +. +.\" EOF diff --git a/mingwrt/man/mbsinit.3.man b/mingwrt/man/mbsinit.3.man new file mode 100644 index 0000000..dfff9e3 --- /dev/null +++ b/mingwrt/man/mbsinit.3.man @@ -0,0 +1,262 @@ +.\" vim: ft=nroff +.TH %PAGEREF% MinGW "MinGW Programmer's Reference Manual" +. +.SH NAME +.B \%mbsinit +\- check state of multibyte to wide character conversion +. +. +.SH SYNOPSIS +.B #include +.RB < wchar.h > +.PP +.B int mbsinit( mbstate_t +.BI * ps +.B ); +. +. +.SH DESCRIPTION +If +.I ps +is not a NULL pointer, +the +.BR \%mbsinit () +function determines whether the +.B \%mbstate_t +object, +to which it points, +represents a multibyte to wide character conversion in the +.IR initial , +or in an +.I intermediate +state. +. +.PP +The +.I initial +conversion state is represented by a +.I zero\(hyvalued +.B \%mbstate_t +object. +(POSIX.1 stipulates that this representation must be supported, +although additional alternative representations are permitted; +MinGW uses only the zero\(hyvalued representation). +. +.PP +In MinGW, +an initial conversion state may be establised by initialization: +.PP +.RS 4n +.EX +mbstate_t st = (mbstate_t)(0), *ps = &st; +.EE +.RE +.PP +or by assignment: +.PP +.RS 4n +.EX +*ps = (mbstate_t)(0); +.EE +.RE +.PP +However, +for portability: +.PP +.RS 4n +.EX +memset( ps, 0, sizeof( mbstate_t )); +.EE +.RE +.PP +may be preferred. +. +.PP +Nominally, +.B \%mbstate_t +objects represent +.I shift states +of the active codeset. +However, +since \%MS\(hyWindows codesets do not use shift states, +as such, +MinGW uses +.B \%mbsinit_t +odjects to represent an alternative class of +.I intermediate conversion +.IR states , +viz.: +.RS 2n +.ll -2n +.IP \(bu 2n +Parsing of a multibyte sequence has been interrupted, +before interpretation of +.B \%MB_CUR_MAX +bytes, +without identification of a complete code point; +this conversion state may arise following a call of +.BR mbrlen (3), +or +.BR mbrtowc (3), +which has returned a parsed sequence length of +.IR \%(size_t)(\-2) . +. +.IP \(bu 2n +Processing of a wide character sequence has encountered a +.IR high\ surrogate , +but the complementary +.I low surrogate +has yet to be evaluated; +this state may arise after a call of +.BR mbrtowc (3), +has returned the +.IR high\ surrogate , +(with a returned sequence length between +.I one +and +.BR \%MB_CUR_MAX ), +and a further call is needed, +to retrieve the +.IR low\ surrogate ; +alternatively, +a complementary conversion state may arise when +.BR wcrtomb (3) +has been called to interpret a +.IR high\ surrogate , +and a further call, +to complete the conversion to a multibyte sequence, +by evaluation of the complementary +.IR low\ surrogate , +is still required. +.ll +2n +.RE +. +. +.SH RETURN VALUE +If +.I ps +is a NULL pointer, +or if the conversion state, +represented by the +.B \%mbstate_t +object to which it points, +is the +.I initial +state, +.BR \%mbsinit () +returns a +.I \%non\(hyzero +value; +otherwise, +.I \%zero +is returned, +indicating an +.I intermediate +conversion state. +. +. +.SH ERROR CONDITIONS +No error conditions are defined. +. +. +.SH STANDARDS CONFORMANCE +There is no Microsoft implementation of the +.BR mbsinit () +function, +which is readily accessible for use in MinGW applications; +the +.I \%libmingwex +implementation conforms generally to +.BR \%ISO\(hyC99 , +.BR \%POSIX.1\(hy2001 , +and +.BR \%POSIX.1\(hy2008 . +. +. +.\"SH EXAMPLE +. +. +.SH CAVEATS AND BUGS +Prior to \%mingwrt\(hy5.3, +the +.I \%libmingwex +implementation of +.BR mbsinit () +would always return +.IR \%non\(hyzero , +apparently indicating an +.I initial +conversion state, +regardless of the actual state indicated by any +.B \%mbstate_t +object referred to by +.IR *ps ; +this defect is corrected, +in \%mingwrt\(hy5.3. +. +.PP +Any +.I intermediate conversion +.IR state , +arising from a call to +.BR mbrlen (3), +.BR mbrtowc (3), +or +.BR wcrtomb (3), +is specific to the particular conversion which produces it. +Any intermediate state produced by +.BR mbrlen (3), +or by +.BR mbrtowc (3) +may be resolved by a further call to either of these two functions, +or to +.BR mbsrtowcs (3), +provided the initial part of the multibyte sequence, +passed in the subsequent call, +completes the sequence which led to the intermediate state; +if this intermediate state is used in any other context, +the consequent behaviour is undefined. +. +.PP +Similarly, +an intermediate state resulting from a call to +.BR wcrtomb (3) +may be resolved by a further call to +.BR wcrtomb (3), +or to +.BR wcsrtomb (3), +provided the first, +(or the only), +wide character to be interpreted, +in the subsequent call, +represents the +.I low surrogate +which completes the pending +.I surrogate pair +from which the intermediate state was created. +Once again, +if this intermediate state is used in any other context, +the consequent behaviour is undefined. +. +. +.SH SEE ALSO +.BR \%mbrlen (3), +.BR \%mbrtowc (3), +.BR \%mbsrtowcs (3), +.BR \%wcrtomb (3), +and +.BR \%wcrtomb (3). +. +. +.SH AUTHOR +This manpage was written by \%Keith\ Marshall, +\%, +to document the +.BR \%mbsinit () +function as it has been implemented for the MinGW.org Project. +It may be copied, modified and redistributed, +without restriction of copyright, +provided this acknowledgement of contribution by +the original author remains in place. +. +.\" EOF diff --git a/mingwrt/man/mbsrtowcs.3.man b/mingwrt/man/mbsrtowcs.3.man new file mode 100644 index 0000000..1f754b5 --- /dev/null +++ b/mingwrt/man/mbsrtowcs.3.man @@ -0,0 +1,521 @@ +.\" vim: ft=nroff +.TH %PAGEREF% MinGW "MinGW Programmer's Reference Manual" +. +.SH NAME +.B mbsrtowcs +\- convert from multibyte to wide character string +. +. +.SH SYNOPSIS +.B #include +.RB < wchar.h > +.PP +.B size_t mbsrtowcs( wchar_t +.BI * dst , +.B const char +.BI ** src , +.B size_t +.IB len , +.B mbstate_t +.BI * ps +.B ); +. +.IP \& -4n +Feature Test Macro Requirements for libmingwex: +.PP +.BR \%__MSVCRT_VERSION__ : +since \%mingwrt\(hy5.3, +if this feature test macro is +.IR defined , +with a value of +.I at least +.IR 0x0800 , +(corresponding to the symbolic constant, +.BR \%__MSCVR80_DLL , +and thus declaring intent to link with \%MSVCR80.DLL, +or any later version of \%Microsoft\(aqs \%non\(hyfree runtime library, +instead of with \%MSVCRT.DLL), +calls to +.BR mbsrtowcs () +will be directed to the implementation thereof, +within \%Microsoft\(aqs runtime DLL. +. +.PP +.BR \%_ISOC99_SOURCE , +.BR \%_ISOC11_SOURCE : +since \%mingwrt\(hy5.3.1, +when linking with \%MSVCRT.DLL, +or when +.B \%__MSVCRT_VERSION__ +is either +.IR undefined , +or is +.I defined +with any value which is +.I less than +.IR 0x0800 , +(thus denying intent to link with \%MSVCR80.DLL, +or any later \%non\(hyfree version of Microsoft\(aqs runtime library), +.I explicitly +defining either of these feature test macros +will cause any call to +.BR \%mbsrtowcs () +to be directed to the +.I \%libmingwex +implementation; +if neither macro is defined, +calls to +.BR \%mbsrtowcs () +will be directed to Microsoft\(aqs runtime implementation, +if it is available, +otherwise falling back to the +.I \%libmingwex +implementation. +. +.PP +Prior to \%mingwrt\(hy5.3, +none of the above feature test macros have any effect on +.BR \%mbsrtowcs (); +all calls will be directed to the +.I \%libmingwex +implementation. +. +. +.SH DESCRIPTION +.PP +Commencing from the conversion state specified in +.IR *ps , +the +.BR \%mbsrtowcs () +function converts the multibyte character sequence, +starting at +.IR *src , +to a sequence of wide characters; +each conversion is performed as if by calling the +.BR mbrtowc (3) +function. +. +.PP +If +.I dst +is not a NULL pointer, +the resulting sequence of wide characters, +up to a maximum of +.I len +in number, +will be stored as a wide character string, +starting at +.IR dst ; +conversion may be curtailed, +before +.I len +wide characters have been stored, +under any of the following conditions: +.RS 2n +.ll -2n +.IP \(bu 2n +The result of any one conversion represents the NUL wide character, +(in which case the NUL wide character is stored, +but is not included in the count of characters converted). +. +.IP \(bu 2n +The result of any single multibyte character conversion is a +.IR surrogate\ pair , +but the available space, +remaining in the conversion buffer, +is insufficient to accommodate more than one +.B \%wchar_t +value. +. +.IP \(bu 2n +An invalid multibyte character sequence is encountered, +(in which case the conversion state becomes undefined). +.ll +2n +.RE +. +.PP +Conversely, +if +.I dst +is a NULL pointer, +the +.I len +argument is ignored, +and conversions are performed until either +the multibyte equivalent of the NUL character, +or an invalid multibyte sequence is encountered, +but no wide characters are stored. +. +.PP +The sequence of bytes, +pointed to by +.IR *src , +is interpreted as a multibyte character sequence +in the codeset which is associated with the +.B \%LC_CTYPE +category of the active process locale. +. +.PP +If +.I ps +is specified as a NULL pointer, +.BR \%mbsrtowcs () +will track conversion state using an internal +.B \%mbstate_t +object reference, +which is private within the +.BR \%mbsrtowcs () +process address space; +at process \%start\(hyup, +this internal +.B \%mbstate_t +object is initialized to represent +the initial conversion state. +. +. +.SH RETURN VALUE +On successful conversion of the multibyte character +sequence indirectly pointed to by +.IR *src , +up to the wide character string length limit specified by +.IR len , +.BR \%mbsrtowcs () +updates +.IR *src , +by either: +.RS 2n +.ll -2n +.IP \(bu 2n +Replacing it with a NULL pointer, +if conversion is terminated by a NUL character, +before +.I len +wide characters have been evaluated. +. +.IP \(bu 2n +Incrementing it, +such that it points to the first multibyte character in the +.I *src +sequence, +which, +when converted, +would produce wide characters beyond the string length +limit specified by +.IR len . +.ll +2n +.RE +.PP +In either case, +.BR mbsrtowcs () +returns the actual number of +.B \%wchar_t +values which have been stored at +.IR dst , +(if +.I dst +is not a NULL pointer, +or which would have been stored, +otherwise). +. +. +.SH ERROR CONDITIONS +If, +at any stage of conversion of the multibyte sequence at +.IR \%*src , +and, +if +.I dst +is not a NULL pointer, +before +.I len +.B \%wchar_t +values have been evaluated, +any sequence within +.IR \%*src , +which does not represent a valid multibyte character, +is encountered, +then +.I \%errno +is set to +.BR \%EILSEQ , +and +.BR \%mbsrtowcs () +returns +.IR \%(size_t)(\-1) ; +the conversion state, +including the state of any +.B \%wchar_t +values already stored at +.IR \%*dst , +is undefined. +. +. +.SH STANDARDS CONFORMANCE +Except in respect of its provisions for handling of +.IR surrogate\ pairs , +and to the extent that it may be affected by limitations +of the underlying \%MS\(hyWindows API, +the +.I \%libmingwex +implementation of +.BR mbsrtowcs () +conforms generally to +.BR \%ISO\(hyC99 , +.BR \%POSIX.1\(hy2001 , +and +.BR \%POSIX.1\(hy2008 ; +(prior to \%mingwrt\-5.3, +and in those cases where calls may be delegated +to a Microsoft runtime DLL implementation, +this level of conformity may not be achieved). +. +. +.\"SH EXAMPLE +. +. +.SH CAVEATS AND BUGS +Due to a documented limitation of Microsoft\(aqs +.BR \%setlocale () +function implementation, +it is not possible to directly select an active locale, +in which the codeset is represented by any multibyte +character sequence with an effective +.B \%MB_CUR_MAX +of more than two bytes. +Prior to +.IR \%mingwrt\(hy5.3 , +this limitation precludes the use of +.BR \%mbsrtowcs () +to interpret any codeset with +.B \%MB_CUR_MAX +greater than two bytes, +(such as +.BR \%UTF\(hy8 ). +From +.I \%mingwrt\(hy5.3 +onward, +the MinGW.org implementation of +.BR \%mbsrtowcs () +mitigates this limitation by assignment of the codeset +from the +.B \%LC_CTYPE +environment variable, +provided the system default has been previously activated +for the +.B \%LC_CTYPE +locale category; +e.g.\ execution of: +.PP +.RS 4n +.EX +#include +#include +#include +#include + +void print_conv( const char * ); + +int main() +{ + setlocale( LC_CTYPE, "" ); + putenv( "LC_CTYPE=en_GB.65001" ); + print_conv( "\exe6\exb0\exb4\exf0\ex9d\ex84\ex8b" ); + return 0; +} + +void print_conv( const char *mbs ) +{ + size_t len; + if( (len = 1 + mbsrtowcs( NULL, &mbs, 0, NULL )) > 0 ) + { + wchar_t wcs[len]; + len = mbsrtowcs( wch, &mbs, len, NULL ); + printf( "%d wide char%s: ", len, (len == 1) ? "" : "s" ); + while( len > 0 ) + { printf( "0x%04X%c", *wcs++, (--len > 0) : ':' : '\n' ); + } + } + else perror( "mbsrtowcs" ); +} +.EE +.RE +.PP +will convert the +.B \%UTF\(hy8 +encoded multibyte sequence, +\fC\%"\exe6\exb0\exb4\exf0\ex9d\ex84\ex8b"\fP, +(which represents the two Unicode code points, +\fC\%"\eu6c34"\fP and \fC\%\eU0001d10b")\fP, +to its equivalent +.B \%wchar_t +sequence, +resulting in the three\(hyvalue output sequence: +.PP +.RS 4n +.EX +3 wide chars: 0x6C34:0xD834:0xDD0B +.EE +.RE +. +.PP +Note that, +in the preceding example, +although the input +.B \%UTF\(hy8 +sequence represents only +.I two +Unicode code points, +the output shows +.I \%three +distinct +.B \%wchar_t +values, +with the second code point being represented by the +.IR surrogate\ pair , +\fC\%"0xD834:0xDD0B"\fP. +This raises a potential issue, +which is consequent on Microsoft\(aqs choice of +.B \%UTF-16LE +as the underlying representation of the +.B \%wchar_t +data type: +normally, +when +.I dst +is not a NULL pointer, +the MinGW +.BR mbsrtowcs () +function will simply store a +.I surrogate\ pair +when necessary, +but in the particular case where doing so would cause the +.I low\ surrogate +to overrun the buffer length specified by the +.I len +argument, +then no part of the +.I surrogate\ pair +will be stored, +and +.BR mbsrtowcs () +will stop as if the buffer length limit has been reached, +at a count of one less than +.IR len . +This case may be distinguished from a short count due to +conversion of a NUL character, +(in which case +.I *src +will have been respecified as a NULL pointer), +by inspection of +.IR *src , +which will have been updated to point, +in this case, +to the start of that part of the multibyte sequence +which represents the +.IR surrogate\ pair . +. +.PP +A further issue, +also related to +.IR surrogate\ pairs , +may arise if the +.B \%mbstate_t +object passed via the +.I *ps +argument originates from a preceding +.BR mbrtowc (3) +call which has returned a +.IR high\ surrogate , +but the +.I low\ surrogate +has not been retrieved. +In this case, +the +.I low\ surrogate +is returned, +(and potentially orphaned), +as the first +.B \%wchar_t +value to be considered for storage at +.IR dst . +This may not be what you want, +but it is supported as an alternative to the method, +formally documented using +.BR mbrtowc (3), +for completion of a +.IR surrogate\ pair ; +for example: +.PP +.RS 4n +.EX +#define _ISOC99_SOURCE + +#include +#include +#include +#include +#include +#include + +void print_conv( const char * ); + +int main() +{ + setlocale( LC_CTYPE, "" ); + putenv( "LC_CTYPE=en_GB.65001" ); + print_conv( "\eU0001d10b" ); + print_conv( "\eu6c34" ); + return 0; +} + +void print_conv( const char *mbs ) +{ + wchar_t wch; + mbstate_t ps = (mbstate_t)(0); + size_t n = mbrtowc( &wch, mbs, MB_LEN_MAX, &ps ); + if( (int)(n) > 0 ) + { + if( IS_HIGH_SURROGATE( wch ) ) + { + wchar_t wcl; + mbsrtowcs( &wcl, &mbs, 1, &ps ); + printf( "%u bytes -> 0x%04X:0x%04X\en", n, wch, wcl ); + } + else printf( "%u bytes -> 0x%04X\en", n, wch ); + } + else if( n == (size_t)(-1) ) perror( "mbrtowc" ); +} +.EE +.RE +.PP +is equivalent to the example given for +.I surrogate\ pair +completion using +.BR mbrtowc (3). +Regardless of the method used to complete +.IR surrogate\ pairs , +it is the caller\(aqs responsibility to ensure that the +.I high\ surrogate +and its complementary +.I low\ surrogate +remain correctly associated. +. +. +.SH SEE ALSO +.BR mbsinit (3), +and +.BR mbrtowc (3) +. +. +.SH AUTHOR +This manpage was written by \%Keith\ Marshall, +\%, +to document the +.BR \%mbsrtowcs () +function as it has been implemented for the MinGW.org Project. +It may be copied, modified and redistributed, +without restriction of copyright, +provided this acknowledgement of contribution by +the original author remains in place. +. +.\" EOF diff --git a/mingwrt/man/wcrtomb.3.man b/mingwrt/man/wcrtomb.3.man new file mode 100644 index 0000000..27bc620 --- /dev/null +++ b/mingwrt/man/wcrtomb.3.man @@ -0,0 +1,493 @@ +.\" vim: ft=nroff +.TH %PAGEREF% MinGW "MinGW Programmer's Reference Manual" +. +.SH NAME +.B \%wcrtomb +\- convert a wide character to a multibyte sequence +. +. +.SH SYNOPSIS +.B #include +.RB < wchar.h > +.PP +.B size_t wcrtomb( char +.BI * s , +.B wchar_t +.IB wc , +.B mbstate_t +.BI * ps +.B ); +. +.IP \& -4n +Feature Test Macro Requirements for libmingwex: +.PP +.BR \%__MSVCRT_VERSION__ : +since \%mingwrt\(hy5.3, +if this feature test macro is +.IR defined , +with a value of +.I at least +.IR \%0x0800 , +(corresponding to the symbolic constant, +.BR \%__MSCVR80_DLL , +and thus declaring intent to link with \%MSVCR80.DLL, +or any later version of \%Microsoft\(aqs \%non\(hyfree runtime library, +instead of with \%MSVCRT.DLL), +calls to +.BR \%wcrtomb () +will be directed to the implementation thereof, +within \%Microsoft\(aqs runtime DLL. +. +.PP +.BR \%_ISOC99_SOURCE , +.BR \%_ISOC11_SOURCE : +since \%mingwrt\(hy5.3.1, +when linking with \%MSVCRT.DLL, +or when +.B \%__MSVCRT_VERSION__ +is either +.IR undefined , +or is +.I defined +with any value which is +.I less than +.IR \%0x0800 , +(thus denying intent to link with \%MSVCR80.DLL, +or any later \%non\(hyfree version of Microsoft\(aqs runtime library), +.I explicitly +defining either of these feature test macros +will cause any call to +.BR \%wcrtomb () +to be directed to the +.I \%libmingwex +implementation; +if neither macro is defined, +calls to +.BR \%wcrtomb () +will be directed to Microsoft\(aqs runtime implementation, +if it is available, +otherwise falling back to the +.I \%libmingwex +implementation. +. +.PP +Prior to \%mingwrt\(hy5.3, +none of the above feature test macros have any effect on +.BR \%wcrtomb (); +all calls will be directed to the +.I \%libmingwex +implementation. +. +. +.SH DESCRIPTION +The +.BR \%wcrtomb () +function determines the number of bytes which are required, +starting from the conversion state represented by the +.B \%mbstate_t +object at +.IR *ps , +to accommodate the multibyte character sequence, +in the codeset associated with the +.B \%LC_CTYPE +category of the active process locale, +which represents the completed conversion of +the wide character specified by +.IR wc . +. +.PP +In the special case, +when +.I s +is a NULL pointer, +the +.I wc +argument is ignored, +and the call is evaluated as if it had been invoked as +.PP +.RS 4n +.EX +wcrtomb( buf, L'\e0', ps ) +.EE +.RE +.PP +returning the effect of conversion of the NUL wide character, +as a completion of any intermediate conversion state specified in +.IR *ps , +but without storing the converted multibyte sequence; +(in this special case, +the +.B \%ISO\(hyC99 +standard specifies that +.I buf +should be an internal buffer, +but since such a buffer becomes effectively inaccessible, +storage of any converted multibyte sequence is unnecessary). +. +.PP +Conversely, +in the normal case, +when +.I s +is not a NULL pointer, +the +.BR \%wcrtomb () +function converts the wide character, +represented by +.IR wc , +to the corresponding multibyte character sequence, +which is stored in the byte array starting at +.IR *s , +and the function return value is set to +the number of bytes stored. +. +. +.SH RETURN VALUE +When conversion is successful, +regardless of whether the resultant multibyte sequence is stored, +or not, +the +.BR wcrtomb () +function returns the number of bytes which are, +or which would be, +stored at +.IR *s . +. +.PP +If the result of conversion represents a completed multibyte sequence, +the conversion state, +represented by +.IR *ps , +is updated to represent the +.I initial +.IR state . +Conversely, +if the result of conversion is equivalent to the conversion of a +.I high +.IR surrogate , +nothing is stored, +the return value is set to +.IR zero , +and the conversion state is updated to represent a pending +.I surrogate pair +completion. +. +. +.SH ERROR CONDITIONS +If the wide character, +passed as +.IR wc , +either cannot be converted to a valid multibyte sequence, +or does not complete a pending +.I surrogate pair +which can be represented as a valid multibyte sequence, +in the codeset of the active +.B \%LC_CTYPE +locale category, +.I \%errno +is set to +.BR \%EILSEQ , +the +.BR wcrtomb () +function returns +.IR (size_t)(\-1) , +and the conversion state is unspecified. +. +. +.SH STANDARDS CONFORMANCE +Except in respect of its extended provision for handling of +.IR surrogate\ pairs , +and to the extent that it may be affected by limitations +of the underlying \%MS\(hyWindows API, +the +.I \%libmingwex +implementation of +.BR \%wcrtomb () +conforms generally to +.BR \%ISO\(hyC99 , +.BR \%POSIX.1\(hy2001 , +and +.BR \%POSIX.1\(hy2008 ; +(prior to \%mingwrt\-5.3, +and in those cases where calls may be delegated +to a Microsoft runtime DLL implementation, +this level of conformity may not be achieved). +. +. +.\"SH EXAMPLE +. +. +.SH CAVEATS AND BUGS +Due to a documented limitation of Microsoft\(aqs +.BR \%setlocale () +function implementation, +it is not possible to directly select an active locale, +in which the codeset is represented by any multibyte +character sequence with an effective +.B \%MB_CUR_MAX +of more than two bytes. +Prior to \%mingwrt\(hy5.3, +this limitation precludes the use of +.BR \%wcrtomb () +to interpret any codeset with +.B \%MB_CUR_MAX +greater than two bytes, +(such as +.BR \%UTF\(hy8 ). +From \%mingwrt\(hy5.3 onward, +the MinGW.org implementation of +.BR \%wcrtomb () +mitigates this limitation by assignment of the codeset +from the +.B \%LC_CTYPE +environment variable, +provided the system default has been previously activated +for the +.B \%LC_CTYPE +locale category; +e.g.\ execution of: +.PP +.RS 4n +.EX +#define _ISOC99_SOURCE + +#include +#include +#include +#include +#include + +void print_conv( const wchar_t * ); + +int main() +{ + setlocale( LC_CTYPE, "" ); + putenv( "LC_CTYPE=en_GB.65001" ); + print_conv( L"\eu6c34\eU0001d10b" ); + return 0; +} + +void print_conv( const wchar_t *wcs ) +{ + wchar_t wch; + while( (wch = *wcs++) != L'\e0' ) + { + char mbs[MB_LEN_MAX]; + mbstate_t ps = (mbstate_t)(0); + size_t n = wcrtomb( mbs, wch, &ps ); + + if( (int)(n) > 0 ) + { + unsigned char *p = (unsigned char *)(mbs); + printf( "Single wide character: 0x%04X \-\-> %u byte%s", + wch, n, (n == 1) ? ": " : "s: " + ); + while( n > 0 ) + printf( "0x%02X%c", *p++, (\-\-n == 0) ? '\en' : ':' ); + } + else if( n == (size_t)(\-1) ) perror( "wcrtomb" ); + } +} +.EE +.RE +.PP +will successfully convert the \fCL"\eu6c34"\fP wide character to its +.B \%UTF\(hy8 +equivalent, +resulting in the output: +.PP +.RS 4n +.EX +Single wide character: 0x6C34 \-\-> 3 bytes: 0xE6:0xB0:0xB4 +.EE +.RE +.PP +However, +when it then progresses to the \fCL"\eU0001d10b"\fP wide character, +(which +.I should +be represented by a valid +.B \%UTF\(hy16LE +.I surrogate +.IR pair ), +it fails with the diagnostic: +.PP +.RS 4n +.EX +wcrtomb: Invalid or incomplete multibyte or wide character +.EE +.RE +. +.PP +This (possibly unexpected) failure is an unfortunate consequence +of Microsoft\(aqs choice of +.B \%UTF\(hy16LE +as the underlying representation of the +.B \%wchar_t +data type; +this choice makes it impossible for +.I any +\%MS\(hyWindows implementation of +.BR \%wcrtomb () +to be fully +.B \%ISO\(hyC99 +compliant. +To mitigate this non\(hycompliance, +the MinGW implementation of +.BR \%wcrtomb () +incorporates the following non\(hystandard capabilities: +.RS 2n +.ll -2n +.IP \(bu 2n +When the +.B \%mbstate_t +argument refers to the +.I initial conversion +.IR state , +and the +.B \%wchar_t +argument represents a +.I high +.IR surrogate , +then nothing is stored in the conversion buffer, +the +.B \%mbstate_t +reference is updated to indicate pending completion of the +.IR surrogate , +and the function returns an effective conversion count of +.I zero +bytes. +. +.IP \(bu 2n +When the +.B \%mbstate_t +argument refers to a pending completion of a +.I surrogate +.IR pair , +and the +.B \%wchar_t +argument represents a +.I low +.IR surrogate , +then the deferred +.I high surrogate +is combined with the +.I low surrogate +argument, +and the two are converted as a pair; +the resultant conversion is stored in the conversion buffer, +the +.B \%mbstate_t +reference is reset to the +.I initial conversion +.IR state , +and the function returns the number of bytes +which were stored in the conversion buffer. +.ll +2n +.RE +. +.PP +These capabilities of MinGW\(aqs +.BR \%wcrtomb () +are certainly non\(hystandard; +nonetheless, +they are required to circumvent non\(hyconformity, +which is imposed by an unfortunate Microsoft design choice, +and it is incumbent upon the caller of +.BR \%wcrtomb (), +on the \%MS\(hyWindows platform, +to make use of them. +The preceding example clearly illustrates how strictly +.B \%ISO\(hyC99 +conforming usage will yield incorrect behaviour; +the following illustrates how that example may be adapted, +by incorporation of the above non\(hystandard features, +to achieve correct behaviour: +.PP +.RS 4n +.EX +#define _ISOC99_SOURCE + +#include +#include +#include +#include +#include +#include + +void print_conv( const wchar_t * ); + +int main() +{ + setlocale( LC_CTYPE, "" ); + putenv( "LC_CTYPE=en_GB.65001" ); + print_conv( L"\eu6c34\eU0001d10b" ); + return 0; +} + +#define DESC(FMT) FMT "0x%1$04X --> %2$u byte%3$s" + +void print_conv( const wchar_t *wcs ) +{ + while( *wcs != L'\e0' ) + { + wchar_t wch = *wcs; + char mbs[MB_LEN_MAX]; + mbstate_t ps = (mbstate_t)(0); + const char *fmt = DESC( "Single wide character: " ); + size_t n = wcrtomb( mbs, wch, &ps ); + + if( (n == (size_t)(0)) && IS_HIGH_SURROGATE( wch ) ) + { + if( (int)(n = wcrtomb( mbs, wcs[1], &ps )) > 0 ) + { + fmt = DESC( "Surrogate pair: 0x%1$04X:" ); + wcs++; + } + } + if( (int)(n) > 0 ) + { + unsigned char *p = (unsigned char *)(mbs); + printf( fmt, wch, n, (n == 1) ? ": " : "s: ", *wcs ); + while( n > 0 ) + printf( "0x%02X%c", *p++, (\-\-n == 0) ? '\en' : ':' ); + } + else if( n == (size_t)(\-1) ) perror( "wcrtomb" ); + if( *wcs != L'\e0' ) ++wcs; + } +} +.EE +.RE +.PP +It may be observed that, +on execution of this modified version of the example, +both the \fCL"\eu6c34"\fP, +and the \fCL"\eU0001d10b"\fP code points are now correctly evaluated, +producing the expected output: +.PP +.RS 2n +.EX +Single wide character: 0x6C34 --> 3 bytes: 0xE6:0xB0:0xB4 +Surrogate pair: 0xD834:0xD834 --> 4 bytes: 0xF0:0x9D:0x84:0x8B +.EE +.RE +. +. +.SH SEE ALSO +.BR mbsinit (3), +and +.BR wcsrtombs (3) +. +. +.SH AUTHOR +This manpage was written by \%Keith\ Marshall, +\%, +to document the +.BR \%wcrtomb () +function as it has been implemented for the MinGW.org Project. +It may be copied, modified and redistributed, +without restriction of copyright, +provided this acknowledgement of contribution by +the original author remains in place. +. +.\" EOF diff --git a/mingwrt/man/wcsrtombs.3.man b/mingwrt/man/wcsrtombs.3.man new file mode 100644 index 0000000..74334b0 --- /dev/null +++ b/mingwrt/man/wcsrtombs.3.man @@ -0,0 +1,361 @@ +.\" vim: ft=nroff +.TH %PAGEREF% MinGW "MinGW Programmer's Reference Manual" +. +.SH NAME +.B \%wcsrtombs +\- convert a wide character to a multibyte sequence +. +. +.SH SYNOPSIS +.B #include +.RB < wchar.h > +.PP +.B size_t wcsrtombs( char +.BI * dst , +.B wchar_t +.BI ** src , +.B size_t +.IB len , +.B mbstate_t +.BI * ps +.B ); +. +.IP \& -4n +Feature Test Macro Requirements for libmingwex: +.PP +.BR \%__MSVCRT_VERSION__ : +since \%mingwrt\(hy5.3, +if this feature test macro is +.IR defined , +with a value of +.I at least +.IR \%0x0800 , +(corresponding to the symbolic constant, +.BR \%__MSCVR80_DLL , +and thus declaring intent to link with \%MSVCR80.DLL, +or any later version of \%Microsoft\(aqs \%non\(hyfree runtime library, +instead of with \%MSVCRT.DLL), +calls to +.BR \%wcsrtombs () +will be directed to the implementation thereof, +within \%Microsoft\(aqs runtime DLL. +. +.PP +.BR \%_ISOC99_SOURCE , +.BR \%_ISOC11_SOURCE : +since \%mingwrt\(hy5.3.1, +when linking with \%MSVCRT.DLL, +or when +.B \%__MSVCRT_VERSION__ +is either +.IR undefined , +or is +.I defined +with any value which is +.I less than +.IR \%0x0800 , +(thus denying intent to link with \%MSVCR80.DLL, +or any later \%non\(hyfree version of Microsoft\(aqs runtime library), +.I explicitly +defining either of these feature test macros +will cause any call to +.BR \%wcsrtombs () +to be directed to the +.I \%libmingwex +implementation; +if neither macro is defined, +calls to +.BR \%wcsrtombs () +will be directed to Microsoft\(aqs runtime implementation, +if it is available, +otherwise falling back to the +.I \%libmingwex +implementation. +. +.PP +Prior to \%mingwrt\(hy5.3, +none of the above feature test macros have any effect on +.BR \%wcsrtombs (); +all calls will be directed to the +.I \%libmingwex +implementation. +. +. +.SH DESCRIPTION +The +.BR \%wcsrtombs () +function converts a sequence of wide characters from +the array which is indirectly pointed to by +.IR src , +to a corresponding multibyte character sequence in +the codeset which is associated with the +.B \%LC_CTYPE +category of the active process locale, +beginning in the conversion state which is represented by the +.B \%mbstate_t +object at +.IR *ps ; +each wide character is converted, +as if by calling the +.BR \%wcrtomb (3) +function. +. +.PP +Conversion continues until: +.RS 2n +.ll -2n +.IP \(bu 2n +A wide character which is invalid in its own context is encountered. +. +.IP \(bu 2n +A wide character which does not have a valid representation within +the target multibyte codeset is encountered. +. +.IP \(bu 2n +The NUL wide character is encountered, +while in the initial conversion state. +. +.IP \(bu 2n +The +.I dst +argument is not a NULL pointer, +and a wide character is encountered for which +the converted length would cause the aggregate length +of the converted multibyte character string to exceed +the limit specified by the +.I len +argument. +.ll +2n +.RE +. +.PP +If +.I dst +is +.I not +a NULL pointer, +the multibyte character string resulting from successful conversion, +up to a maximum of +.I len +bytes, +is stored in the multibyte array starting at +.IR dst . +If the conversion is NUL terminated, +the wide character string reference pointed to by +.I src +is replaced by a NULL pointer; +otherwise it is updated to point to the address immediately +following that of the last wide character converted. +. +.PP +If +.I dst +is a NULL pointer, +the aggregate count of bytes required +to represent the conversion is accumulated, +until any one of the preceding termination conditions is encountered; +the +.I len +argument, +and the termination condition which is dependent upon it, +is ignored, +and the conversion is not stored. +. +.PP +If +.I ps +is a NULL pointer, +the +.BR \%wcsrtombs () +function uses a static internal +.B \%mbstate_t +object, +which is known only to, +and visible only within the scope of execution of, +the +.BR \%wcsrtombs () +function itself. +. +.PP +Following a successful conversion, +the +.B \%mbstate_t +object at +.IR *ps , +or the internal +.B \%mbstate_t +object if appropriate, +is reset to the initial conversion state. +. +. +.SH RETURN VALUE +When conversion is successful, +and +.I dst +is +.I not +a NULL pointer, +the +.BR \%wcsrtombs () +function returns the number of bytes stored at +.IR dst , +to represent the resulting multibyte character sequence, +.I excluding +the terminating NUL, +(if any). +. +.PP +Conversely, +when conversion is successful, +but +.I dst is +a NULL pointer, +the +.BR \%wcsrtombs () +function returns the number of bytes which would be required +to store the entire multibyte character string resulting from +the successful conversion, +.I excluding +the terminating NUL. +. +. +.SH ERROR CONDITIONS +If conversion is unsuccessful, +.I \%errno +is set to +.BR \%EILSEQ , +the +.BR wcsrtombs () +function returns +.IR (size_t)(\-1) , +and the conversion state is unspecified. +. +. +.SH STANDARDS CONFORMANCE +Except in respect of its extended provision for handling of +.IR surrogate\ pairs , +and to the extent that it may be affected by limitations +of the underlying \%MS\(hyWindows API, +the +.I \%libmingwex +implementation of +.BR \%wcsrtombs () +conforms generally to +.BR \%ISO\(hyC99 , +.BR \%POSIX.1\(hy2001 , +and +.BR \%POSIX.1\(hy2008 ; +(prior to \%mingwrt\-5.3, +and in those cases where calls may be delegated +to a Microsoft runtime DLL implementation, +this level of conformity may not be achieved). +. +. +.\"SH EXAMPLE +. +. +.SH CAVEATS AND BUGS +Due to a documented limitation of Microsoft\(aqs +.BR \%setlocale () +function implementation, +it is not possible to directly select an active locale, +in which the codeset is represented by any multibyte +character sequence with an effective +.B \%MB_CUR_MAX +of more than two bytes. +Prior to \%mingwrt\(hy5.3, +this limitation precludes the use of +.BR \%wcsrtombs () +to convert to any codeset with +.B \%MB_CUR_MAX +greater than two bytes, +(such as +.BR \%UTF\(hy8 ). +From \%mingwrt\(hy5.3 onward, +the MinGW.org implementation of +.BR \%wcsrtombs () +mitigates this limitation by assignment of the codeset +from the +.B \%LC_CTYPE +environment variable, +provided the system default has been previously activated +for the +.B \%LC_CTYPE +locale category; +e.g.\ execution of: +.PP +.RS 4n +.EX +#define _ISOC99_SOURCE + +#include +#include +#include +#include + +void print_conv( const wchar_t * ); + +int main() +{ + setlocale( LC_CTYPE, "" ); + putenv( "LC_CTYPE=en_GB.65001" ); + print_conv( L"\eu6c34\eU0001d10b" ); + return 0; +} + +void print_conv( const wchar_t *wcs ) +{ + size_t len; + if( (len = 1 + wcsrtombs( NULL, &wcs, 0, NULL )) > 0 ) + { + const wchar_t *wc = wcs; + size_t n = 1 + wcslen( wcs ); + unsigned char mbs[len], *mb = mbs; + printf( "UTF-16: %u value%s: ", n, (n == 1) ? "" : "s" ); + do { printf( "0x%04X%c", *wc, (*wc == L'\e0') ? '\en' : ':' ); + } while( *p++ != L'\e0' ); + printf( "UTF-8: %u byte%s: ", + 1 + wcsrtombs( mbs, &wcs, len, NULL ), + (len == 1) ? "" : "s" + ); + do { printf( "0x%02X%s", *mb, (*mb == '\e0') ? '\en' : ':' ); + } while( *mb++ != '\e0' ); + } + else perror( "wcsrtombs" ); +} +.EE +.RE +.PP +will select +.B \%UTF\(hy8 +as the target codeset, +then convert the \fC\%L"\eu6c34\eU0001d10b"\fP +wide character string, +resulting in the output: +.PP +.RS 4n +.EX +UTF-16: 4 values: 0x6C34:0xD834:0xDD0B:0x0000 +UTF-8: 8 bytes: 0xE6:0xB0:0xB4:0xF0:0x9D:0x84:0x8B:0x00 +.EE +.RE +. +. +.SH SEE ALSO +.BR mbsinit (3), +and +.BR wcrtomb (3) +. +. +.SH AUTHOR +This manpage was written by \%Keith\ Marshall, +\%, +to document the +.BR \%wcsrtombs () +function as it has been implemented for the MinGW.org Project. +It may be copied, modified and redistributed, +without restriction of copyright, +provided this acknowledgement of contribution by +the original author remains in place. +. +.\" EOF diff --git a/mingwrt/man/wctob.3.man b/mingwrt/man/wctob.3.man new file mode 100644 index 0000000..ef396b6 --- /dev/null +++ b/mingwrt/man/wctob.3.man @@ -0,0 +1,174 @@ +.\" vim: ft=nroff +.TH %PAGEREF% MinGW "MinGW Programmer's Reference Manual" +. +.SH NAME +.B \%wctob +\- convert a wide character to a single byte +. +. +.SH SYNOPSIS +.B #include +.RB < stdio.h > +.br +.B #include +.RB < wchar.h > +.PP +.B int wctob( wint_t +.I c +.B ); +. +.IP \& -4n +Feature Test Macro Requirements for libmingwex: +.PP +.BR \%__MSVCRT_VERSION__ : +since \%mingwrt\(hy5.3, +if this feature test macro is +.IR defined , +with a value of +.I at least +.IR \%0x0800 , +(corresponding to the symbolic constant, +.BR \%__MSCVR80_DLL , +and thus declaring intent to link with \%MSVCR80.DLL, +or any later version of \%Microsoft\(aqs \%non\(hyfree runtime library, +instead of with \%MSVCRT.DLL), +calls to +.BR \%wctob () +will be directed to the implementation thereof, +within \%Microsoft\(aqs runtime DLL. +. +.PP +.BR \%_ISOC99_SOURCE , +.BR \%_ISOC11_SOURCE : +since \%mingwrt\(hy5.3.1, +when linking with \%MSVCRT.DLL, +or when +.B \%__MSVCRT_VERSION__ +is either +.IR undefined , +or is +.I defined +with any value which is +.I less than +.IR \%0x0800 , +(thus denying intent to link with \%MSVCR80.DLL, +or any later \%non\(hyfree version of Microsoft\(aqs runtime library), +.I explicitly +defining either of these feature test macros +will cause any call to +.BR \%wctob () +to be directed to the +.I \%libmingwex +implementation; +if neither macro is defined, +calls to +.BR \%wctob () +will be directed to Microsoft\(aqs runtime implementation, +if it is available, +otherwise falling back to the +.I \%libmingwex +implementation. +. +.PP +Prior to \%mingwrt\(hy5.3, +none of the above feature test macros have any effect on +.BR \%wctob (); +all calls will be directed to the +.I \%libmingwex +implementation. +. +. +.SH DESCRIPTION +The +.BR \%wctob () +function converts the wide character, +represented by +.IR c , +to a multibyte character sequence +in the codeset which is associated with the +.B \%LC_CTYPE +category of the active process locale. +Provided the entire conversion can be accommodated +within a single byte, +the value of that byte, +interpreted as an +.IR unsigned\ char , +and cast to an +.IR int , +is returned; +otherwise, +.B EOF +is returned. +. +. +.SH RETURN VALUE +If the conversion of +.IR c , +to a multibyte character sequence, +in its entirety, +occupies exactly +.I one +byte, +the value of that byte, +interpreted as an +.IR unsigned\ char , +and cast to an +.IR int , +is returned; +otherwise, +.B EOF +is returned. +. +. +.SH ERROR CONDITIONS +No error conditions are defined. +. +. +.SH STANDARDS CONFORMANCE +Except to the extent that it may be affected by limitations +of the underlying \%MS\(hyWindows API, +the +.I \%libmingwex +implementation of +.BR \%wctob () +conforms generally to +.BR \%ISO\(hyC99 , +.BR \%POSIX.1\(hy2001 , +and +.BR \%POSIX.1\(hy2008 ; +(prior to \%mingwrt\(hy5.3, +and in those cases where calls may be delegated +to a Microsoft runtime DLL implementation, +this level of conformity may not be achieved). +. +. +.\"SH EXAMPLE +. +. +.SH CAVEATS AND BUGS +Use of the +.BR \%wctob () +function is +.IR discouraged ; +it serves no purpose which may not be better served by the +.BR \%wcrtomb (3) +function, +which should be considered as a preferred alternative. +. +. +.SH SEE ALSO +.BR wcrtomb (3) +. +. +.SH AUTHOR +This manpage was written by \%Keith\ Marshall, +\%, +to document the +.BR \%wctob () +function as it has been implemented for the MinGW.org Project. +It may be copied, modified and redistributed, +without restriction of copyright, +provided this acknowledgement of contribution by +the original author remains in place. +. +.\" EOF