/prog/ - MySQL collation

Name: Anonymous 2008-04-25 5:41

Sup,

I'm currently struggling to have my rows sorted in a MySQL database. I'm trying to get 'é' handled like 'e' when it comes to sorting. Instead, I get it either at the start or the end of the list (tried a few collations) or between 'a' and 'b' (with utf8_general_ci). I can understand why it could be taken as first or last element but why it sits in the middle of nowhere is completely beyond me but if anyone knows a collation that could help me, please tell me.

Test case:
create database test charset utf8 collate utf8_general_ci;
create table test.test (
id int unsigned auto_increment primary key,
name varchar(255) charset utf8 collate utf8_general_ci
) engine=InnoDB charset utf8 collate utf8_general_ci;
insert into test.test value (1,'aa'),(2,'ab'),(3,'bb'),(4,'ea'),(5, 'éb'),(6, 'ec');
select * from test.test order by name;
select * from test.test order by id;
drop database test;

Expected result: both selects should have the same result set.

Name: Anonymous 2008-04-25 6:19

But é is not the same character as e nor is it anywhere near a-z. I can only presume the characters are compared by their numeric code, not by a special comparison procedure. I'm not sure why you might want this? Anyway, perhaps you can provide it with a equivalence function to handle this specific case.

Name: Anonymous 2008-04-25 6:22

I'm not sure why you might want this?
Oh wow. Amerifag, I presume?

Name: Anonymous 2008-04-25 6:27

>>3
No, British. I have a bad suggestion... you could normalise all characters with accents and whatnot, so 'e', 'E', 'é' etc. are normalised to 'e' for the comparison operation only. Might be a solution if all else fails.

Name: Anonymous 2008-04-25 6:27

>>4
(That is, normalise them on insertion.)

Name: Anonymous 2008-04-25 12:47

>>2
I'm French-speaking

>>4
>>5
Even if I don't consider that this is, like you said, a bad suggestion in regard to coding style, I'd rather have them unsorted than with a spelling mistake.
Acents here have about the same importance as doubled leters in English. Sure, it's readable if you omit them, but it's stil a speling mistake and, as such, is tedious to read.

What puzzles me is that the MySQL doc states that the collation I selected in the above example does what I want. Can anyone that example on his server ? I tested it under 5.0.32 and 5.0.51

Name: Anonymous 2008-04-25 12:51

>>6
I'm French-speaking

Now you have two problems.

Name: Anonymous 2008-04-25 13:01

If that acute accent is actually a combining diacritic, you might want to normalise the string into form C first. I'm not sure if MySQL's collation actually handles combining marks...

Name: Anonymous 2008-04-25 13:09

>>7
Could be worse.
>>8
I don't know how to check for sure. I tried both with the 'é' key of my keyboard and with the explicit diacritic and the behavior stays the same.

Name: Anonymous 2008-04-25 14:18

>>9
The é on your keyboard is most likely the precomposed version. I'm now confused too, because the MySQL documentation says utf8_general_ci should be usable for French and German.

Name: Anonymous 2008-04-25 14:31

>>6
Even if I don't consider that this is, like you said, a bad suggestion in regard to coding style, I'd rather have them unsorted than with a spelling mistake.
No, I meant store two versions. One for use of sorting and one for displaying. E.g.



ID | Name    | Normalised_Name | Stuff

--------------------------------------------

00 | Citroën | citroen         | Some stuff.

Name: Anonymous 2008-04-25 19:01

>>11
BTW, that stuff reminds me of an awesome function that does so in wordpress. Well, useful only if >>1 is using crappy php, of course. Other decent languages include such functions in the ``standard library''

Name: Anonymous 2008-04-25 19:40



public static string RemoveAccents(string str) {

    if (str == null)

        return null;



    StringBuilder sb = new StringBuilder();

    bool changed = false;



    foreach (char c in str.Normalize(NormalizationForm.FormD))

        if (char.GetUnicodeCategory(c) != System.Globalization.UnicodeCategory.NonSpacingMark)

            sb.Append(c);

        else

            changed = true;



    if (changed)

        return sb.ToString().Normalize();



    return str;

}

Name: Bucket !!PhiVV3U2X7TT1Xm 2008-04-25 21:10

>>13
Damn str8, and if your black, im not a furfag. Is bucket an artist? Art is not a string, it is the topic.

Name: Anonymous 2008-04-25 23:19

>>13
WTF? That won't work unless it's all built with combining marks, which I've seen far fewer of than strings with singular codepoints to represent accented letters.

Name: Anonymous 2008-04-25 23:26

>>15
It normalizes it into normalization form D first, so precomposed code points are decomposed into the base code point followed by the accents. Then it's renormalized into C again, after the accents are removed.

It seems to work well enough for european text, but I can't see it working with complex scripts and all that.

Name: Anonymous 2008-04-26 1:54

>>14
What about my black?

Name: Anonymous 2008-04-28 3:20

>>10
I'll probably submit a bug report. Either the behavior or the documentation has to be fixed.

>>11
Didn't think of it. I'll maybe use this as a last resort, thanks.

Saging my own thread because I have nothing worth bumping to first page to say.

Name: Anonymous 2008-04-28 4:06

Okay, I found the solution.

It turns out my records were actually inserted and retrieved using a latin1 connection (even though my locate is UTF8, the default connection character set for MySQL is latin1). Somehow, they were successfully converted for my terminal/web page to display it properly in UTF8 so I didn't think about it.

I exported the whole thing in UTF8, dropped the records (the table definitions were OK), set my clients to use utf8 as the client encoding and re-inserted the records. Everything went "just as planned".

Name: Sgt.Kabukiman䖉 2012-05-22 23:23

All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy

MySQL collation

1 Name: Anonymous 2008-04-25 5:41

2 Name: Anonymous 2008-04-25 6:19

3 Name: Anonymous 2008-04-25 6:22

4 Name: Anonymous 2008-04-25 6:27

5 Name: Anonymous 2008-04-25 6:27

6 Name: Anonymous 2008-04-25 12:47

7 Name: Anonymous 2008-04-25 12:51

8 Name: Anonymous 2008-04-25 13:01

9 Name: Anonymous 2008-04-25 13:09

10 Name: Anonymous 2008-04-25 14:18

11 Name: Anonymous 2008-04-25 14:31

12 Name: Anonymous 2008-04-25 19:01

13 Name: Anonymous 2008-04-25 19:40

14 Name: Bucket !!PhiVV3U2X7TT1Xm 2008-04-25 21:10

15 Name: Anonymous 2008-04-25 23:19

16 Name: Anonymous 2008-04-25 23:26

17 Name: Anonymous 2008-04-26 1:54

18 Name: Anonymous 2008-04-28 3:20

19 Name: Anonymous 2008-04-28 4:06

21 Name: Sgt.Kabukiman䖉 2012-05-22 23:23

Name: Anonymous 2008-04-25 5:41

Name: Anonymous 2008-04-25 6:19

Name: Anonymous 2008-04-25 6:22

Name: Anonymous 2008-04-25 6:27

Name: Anonymous 2008-04-25 6:27

Name: Anonymous 2008-04-25 12:47

Name: Anonymous 2008-04-25 12:51

Name: Anonymous 2008-04-25 13:01

Name: Anonymous 2008-04-25 13:09

Name: Anonymous 2008-04-25 14:18

Name: Anonymous 2008-04-25 14:31

Name: Anonymous 2008-04-25 19:01

Name: Anonymous 2008-04-25 19:40

Name: Bucket !!PhiVV3U2X7TT1Xm 2008-04-25 21:10

Name: Anonymous 2008-04-25 23:19

Name: Anonymous 2008-04-25 23:26

Name: Anonymous 2008-04-26 1:54

Name: Anonymous 2008-04-28 3:20

Name: Anonymous 2008-04-28 4:06

Name: Sgt.Kabukiman䖉 2012-05-22 23:23