Parallelzed Quicksort with GoLang

Introduction

The Free Lunch Is Over ( http://www.gotw.ca/publications/concurrency-ddj.htm ) . New machine will be more parallelized, but not multiplied CPU freq like before.
The program I made is parallelzed quicksort, by go language. Thanks to go language's channel primitive, coding of inter-thread (inter-goroutine ?) communication is easy. But there is something communication cost, we need some contrive.
In my PC (AMD64 quad core (Phenom II X4 905e, 1.9GHz)), by sorting of an array that have 100,000,000 double precision floating point numbers, parallelzed version with go is about double faster than single thread C version.
I'm a newcomer in go lang. There would be bad code.

Implementation

I'll describe important points. About others, please see source code ;-)
In qsort_para.go ( http://metanest.jp/qsort_sample/go/qsort_para/qsort_para.go )

  4         ch := make(chan int, 256)
  5         area_buf := make(chan []int, 256)

The size of channels are tuned value. Less size cause more context switch and makes program slow.

  6         for i := 0; i < 8; i++ {
  7                 go qsort_loop_buf(array, area_buf, ch)
  8         }
  9         area_buf <- []int{0, len(array) - 1}
 10         for count := 1; count > 0; {
 11                 count += <-ch
 12         }
 13         for i := 0; i < 8; i++ {
 14                 area_buf <- nil
 15         }

I'm using thread pool (goroutine pool ?). 8 threads used (hardcoded).
Sending nil is sign for break.

 79                         width := right - (j + 1)
 80                         if width > 500 {
 81                                 ch <- 1
 82                                 select {
 83                                 case area_buf <- []int{j + 1, right}:
 84                                         // send
 85                                 default:
 86                                         go func(lft, rgt int) {
 87                                                 area_buf <- []int{lft, rgt}
 88                                         }(j+1, right) // to prevent share
 89                                 }
 90                         } else {
 91                                 stack = stack_push(stack, j+1, right)
 92                         }
 93                         right = j - 1

If block size is narrower, parallelize is rather inefficient. 500 is tuned value. Less width makes slower result.
At the select statement, sorting range is send to channel. Because buffering size of go channel is fixed, using a throwaway goroutine to buffer. Care variable unsharering.

Results

↓single threaded C version ( http://metanest.jp/qsort_sample/c/ )

$ ./qsort
22.933873 sec.
22.946687 sec.
23.218775 sec.
23.218435 sec.
23.002754 sec.
23.014397 sec.
23.105153 sec.
22.994831 sec.

↓go version

$ GOMAXPROCS=8 ./qsort
10.419070 sec.
9.952412 sec.
9.873246 sec.
9.856545 sec.
9.759282 sec.
9.547970 sec.
10.413074 sec.
10.459733 sec.

( Note: With width < 30, quicksort recursion is stopped. Practically, after approximately sorted, using insertsort is faster. I have not implemented. )

Conclusion

Go language has lightweight "goroutine", it can use to implement parallel application. But the cost of communication over channel is not zero, we needs some contrive (at present).