Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

UTF-8 to Shift_JIS in C

Name: Anonymous 2010-09-01 19:33

Hello /prog/,

Since most of you are EXPERT C-PROGRAMMERS I thought some of you might have an idea as how to best implement this.
In short, I need to convert (wchar_t) UTF-8 input to Shift_JIS, but I am a very incompetent C programmer.

My current implementation is to convert the entire map located here: http://unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT to a huge string on the form ("u%s%s", unicode, sjiscode) and then searching for the hexform of the unicode input.
This is obviously very slow and I ask of you if you could be so kind as to tell me how one is supposed to go about doing this sort of thing in C, as I currently am in a learning process.

Thank you for your attention.

Name: Anonymous 2010-09-01 19:44

i can do that in javap

Name: Anonymous 2010-09-01 19:49

>>1
On Windows/MSVC I do it like this:
1) Using libc (with MS extensions, but I think some might work with gcc, check appropriate documentation):
 a) Set the locale you'll be using for multibyte strings:
    setlocale(LC_ALL, "lang_JPN.932"); // from SJIS
    For more details see: http://msdn.microsoft.com/en-us/library/hzz3tw78.aspx
 b) Use the approrpiate functions which operate on either unicode strings or multibyte strings (use the right prefixes). In this case, conversion is done using mbstowcs (multibyte->widechar) and wcstombs(widechar->multibyte), when dealing with more varied encodings, it's not uncommon to do MB->UTF16->MB. UTF16 or "widechar" is common because of its simple in-memory representation, of a ushort(WORD/16bit value) for each unicode character. Multiple setlocale calls may be needed
2)Using WinAPI. Less portable than 1(which may be supported outside of MSVC), but fully supported on Windows. I find this nicer than the libc variant in that it doesn't depend on global state, and you can just specify the code pages as parameters:
MultiByteToWideChar and WideCharToMultiByte. It's the same *->UTF16->* conversion, except simpler. This is my conversion method of choice for non-portable code. I've done UTF8->SJIS before using this and have some code lieing around for doing this. I've only used the libc method for sjis->utf16->sjis, so I don't know for sure if utf8 is properly supported, but I believe it is, I just never tried using it before.
3)Portably using iconv. It's the more "bloated" method, but if you don't want to tie your tool/application to an OS, it's a common choice. There are also other libraries around for doing it.

Name: Anonymous 2010-09-01 19:54

>>3

Thank you for your apt and prompt reply, I shall attempt this and report back.

Name: Anonymous 2010-09-01 20:38

man iconv.h, come on.
If you're implementing tripcodes in C, Xarn already did that: http://github.com/Cairnarvon/triptools/blob/master/tripcode.c

Name: Anonymous 2010-09-01 21:05

>>5

Well thank you, that is exactly what I was attempting to implement. I'm currently trying to learn C and I reckoned this would be a good project to do, the goal would be to make a dictionary tripper.
Equipped with this excellent specimen of XARNcode I shall continue my learning process.

Thank you /prog/olytes!

Name: Anonymous 2010-09-02 5:17

>>6
Why are you talking like a retard?

Name: Anonymous 2010-09-02 6:07

>>1
Protip: wchar_t is never 'encoded' in UTF-8. In fact, you'd have to be fucking retarded to waste up to three bytes for each character. I believe you have no idea what Unicode is about.

Name: Anonymous 2010-09-02 6:53

>>7
Words hurt :-(.

>>8
If your assumption is that I had no idea what I was doing you'd be correct.

Name: Anonymous 2010-09-02 14:36

>>9
wchar_t is meant to be a unicode code point, but it is not portable; it is 16-bit on Windows and 32-bit on Unix. Don't use it.

The expression "(wchar_t) UTF-8" makes no sense. UTF-8 and Shift-JIS are both variable-width byte encodings. They use chars, not wchar_t, and a character may be represented by several chars.

Just read this: http://www.joelonsoftware.com/articles/Unicode.html

Name: Anonymous 2010-09-02 14:48

>>8
you mean four bytes

Name: Anonymous 2010-09-02 14:54

>>8
but utf8 is all about wasting UP TO 4 bytes for each character

Name: Anonymous 2010-09-02 15:38

>>12
6 actually. A character can take up to 6 bytes in UTF-8. This is because for extended characters, only 6 bits out of each byte are available, so you need 6*6=36 to encode a 32-bit code point. The top two bits of the first byte are 11, and for the rest of the bytes in the character they are 10.

The reason for this is so that you can resynchronize a broken UTF-8 stream. If you lose some data, you just wait until a byte has the top bit 0 or the top two bits 11, and you've found the start of a character.

Name: Anonymous 2010-09-02 15:53

>>13
Actually I'm wrong, it was originally up to 6 bytes but RFC 3629 restricted it down to 4 bytes because Unicode does not define characters above U+10FFFF.

Name: Anonymous 2010-09-02 21:40

What I meant to say in the OP was that I read the unicode input in a wchar_t array. Because just putting it in a char array gave me weird values when I tried to printf("%X\n", *char_ptr); whilst printf("%X\n", *wchar_ptr); gave me the correct values.

Name: Anonymous 2010-09-02 23:37

>>8
you'd have to be fucking retarded to waste up to three bytes for each character
Obviously you have never used Mork technology

Name: Anonymous 2010-12-17 1:29

Are you GAY?
Are you a NIGGER?
Are you a GAY NIGGER?

If you answered "Yes" to all of the above questions, then GNAA (GAY NIGGER ASSOCIATION OF AMERICA) might be exactly what you've been looking for!

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List