sorting - unix sort -n -t"," gives unexpected result -


unix numeric sort gives strange results, when specify delimiter.

$ cat example.csv  # here's small example 58,1.49270399401 59,0.000192136419373 59,0.00182092924724 59,1.49270399401 60,0.00182092924724 60,1.49270399401 12,13.080339685 12,14.1531049905 12,26.7613447051 12,50.4592437035  $ cat example.csv | sort -n --field-separator=, 58,1.49270399401 59,0.000192136419373 59,0.00182092924724 59,1.49270399401 60,0.00182092924724 60,1.49270399401 12,13.080339685 12,14.1531049905 12,26.7613447051 12,50.4592437035 

for example, sort gives same result regardless if specify delimiter. know if set lc_all=c sort starts give expected behavior again. not understand why default environment settings, shown below, make happen.

$ locale lang="en_us.utf-8" lc_collate="en_us.utf-8" lc_ctype="en_us.utf-8" lc_messages="en_us.utf-8" lc_monetary="en_us.utf-8" lc_numeric="en_us.utf-8" lc_time="en_us.utf-8" lc_all= 

i've read many other questions (e.g. here, here, , here) how avoid behavior in sort, still, behavior incredibly weird , unpredictable , has caused me week of heartache. can explain why sort default environment settings on mac os x (10.8.5) behave way? in other words: sort doing (with local variables set en_us.utf-8) result?

i'm using

 sort 5.93                        november 2005   $ type sort  sort /usr/bin/sort 

update

i've discussed on gnu-coreutils list , understand why sort english unicode default locale settings gave output did. because in english unicode, comma character "," considered numeric (so allow comma's thousand's (or e.g. hundreds) separators), , sort defaults "being greedy" when interprets line, read example numbers approximately

581.491... 590.000... 590.001... 591.492... 600.001... 601.492... 1213.08... 1214.15... 1226.76... 1250.45... 

although not had intended , chepner right actual result want, need specify want sort key on first field. sort defaults interpreting more of line key rather first field key.

this behavior of sort has been discussed in gnu-coreutil's faq, , further specified in the posix description of sort.

so that, eric blake on gnu-coreutil's list put it, if field-separator numeric (which comma is) "without -k stop things, [the field-separator] serves both separator , numeric character - sorting on numbers span multiple fields."

i'm not sure entirely correct, it's close.

sort -n -t, try sort numerically given key(s). in case, key tuple consisting of integer , float. such tuples cannot sorted numerically.

if explicitly specify single keys sort on with

sort -k1,1n -k2,2n -t, 

it should work. explicitly telling sort first sort on first field (numerically), on second field (also numerically).

i suspect -n useful global option if each line of input consists of single numerical value. otherwise, need use -n option in conjunction -k option specify fields numbers.


Comments

Popular posts from this blog

c++ - CryptStringToBinary API behavior -

c++ - Correct method for redrawing a layered window -

java.util.scanner - How to read and add only numbers to array from a text file -