I'm currently struggling to have my rows sorted in a MySQL database. I'm trying to get 'é' handled like 'e' when it comes to sorting. Instead, I get it either at the start or the end of the list (tried a few collations) or between 'a' and 'b' (with utf8_general_ci). I can understand why it could be taken as first or last element but why it sits in the middle of nowhere is completely beyond me but if anyone knows a collation that could help me, please tell me.
Test case:
create database test charset utf8 collate utf8_general_ci;
create table test.test (
id int unsigned auto_increment primary key,
name varchar(255) charset utf8 collate utf8_general_ci
) engine=InnoDB charset utf8 collate utf8_general_ci;
insert into test.test value (1,'aa'),(2,'ab'),(3,'bb'),(4,'ea'),(5, 'éb'),(6, 'ec');
select * from test.test order by name;
select * from test.test order by id;
drop database test;
Expected result: both selects should have the same result set.
Name:
Anonymous2008-04-25 6:19
But é is not the same character as e nor is it anywhere near a-z. I can only presume the characters are compared by their numeric code, not by a special comparison procedure. I'm not sure why you might want this? Anyway, perhaps you can provide it with a equivalence function to handle this specific case.
Name:
Anonymous2008-04-25 6:22
I'm not sure why you might want this?
Oh wow. Amerifag, I presume?
Name:
Anonymous2008-04-25 6:27
>>3
No, British. I have a bad suggestion... you could normalise all characters with accents and whatnot, so 'e', 'E', 'é' etc. are normalised to 'e' for the comparison operation only. Might be a solution if all else fails.
>>4 >>5
Even if I don't consider that this is, like you said, a bad suggestion in regard to coding style, I'd rather have them unsorted than with a spelling mistake.
Acents here have about the same importance as doubled leters in English. Sure, it's readable if you omit them, but it's stil a speling mistake and, as such, is tedious to read.
What puzzles me is that the MySQL doc states that the collation I selected in the above example does what I want. Can anyone that example on his server ? I tested it under 5.0.32 and 5.0.51
If that acute accent is actually a combining diacritic, you might want to normalise the string into form C first. I'm not sure if MySQL's collation actually handles combining marks...
Name:
Anonymous2008-04-25 13:09
>>7
Could be worse. >>8
I don't know how to check for sure. I tried both with the 'é' key of my keyboard and with the explicit diacritic and the behavior stays the same.
Name:
Anonymous2008-04-25 14:18
>>9
The é on your keyboard is most likely the precomposed version. I'm now confused too, because the MySQL documentation says utf8_general_ci should be usable for French and German.
Name:
Anonymous2008-04-25 14:31
>>6 Even if I don't consider that this is, like you said, a bad suggestion in regard to coding style, I'd rather have them unsorted than with a spelling mistake.
No, I meant store two versions. One for use of sorting and one for displaying. E.g.
ID | Name | Normalised_Name | Stuff
--------------------------------------------
00 | Citroën | citroen | Some stuff.
Name:
Anonymous2008-04-25 19:01
>>11
BTW, that stuff reminds me of an awesome function that does so in wordpress. Well, useful only if >>1 is using crappy php, of course. Other decent languages include such functions in the ``standard library''
Name:
Anonymous2008-04-25 19:40
public static string RemoveAccents(string str) {
if (str == null)
return null;
StringBuilder sb = new StringBuilder();
bool changed = false;
foreach (char c in str.Normalize(NormalizationForm.FormD))
if (char.GetUnicodeCategory(c) != System.Globalization.UnicodeCategory.NonSpacingMark)
sb.Append(c);
else
changed = true;
if (changed)
return sb.ToString().Normalize();
return str;
}
Name:
Bucket!!PhiVV3U2X7TT1Xm2008-04-25 21:10
>>13
Damn str8, and if your black, im not a furfag. Is bucket an artist? Art is not a string, it is the topic.
Name:
Anonymous2008-04-25 23:19
>>13
WTF? That won't work unless it's all built with combining marks, which I've seen far fewer of than strings with singular codepoints to represent accented letters.
Name:
Anonymous2008-04-25 23:26
>>15
It normalizes it into normalization form D first, so precomposed code points are decomposed into the base code point followed by the accents. Then it's renormalized into C again, after the accents are removed.
It seems to work well enough for european text, but I can't see it working with complex scripts and all that.
>>10
I'll probably submit a bug report. Either the behavior or the documentation has to be fixed.
>>11
Didn't think of it. I'll maybe use this as a last resort, thanks.
Saging my own thread because I have nothing worth bumping to first page to say.
Name:
Anonymous2008-04-28 4:06
Okay, I found the solution.
It turns out my records were actually inserted and retrieved using a latin1 connection (even though my locate is UTF8, the default connection character set for MySQL is latin1). Somehow, they were successfully converted for my terminal/web page to display it properly in UTF8 so I didn't think about it.
I exported the whole thing in UTF8, dropped the records (the table definitions were OK), set my clients to use utf8 as the client encoding and re-inserted the records. Everything went "just as planned".
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy
All work and no play makes Jack a dull boy