A full parallel Quicksort algorithm for multicore processors
Keywords:Quicksort, Full parallel algorithm, parallel sorting, multicore
AbstractThe problem addressed in this paper is that we want to sort an integer array a of length n in parallel on a multi core machine with p cores using Quicksort. Amdahl’s law tells us that the inherent sequential part of any algorithm will in the end dominate and limit the speedup we get from parallelisation. This paper introduces ParaQuick, a full parallel quicksort algorithm for use on an ordinary shared memory multi core machine that has just a few simple statements in its sequential part. It can be seen as an improvement over traditional parallelization of the Quicksort algorithm, where one follows the sequential algorithm and substitute recursive calls with the creation of parallel threads for these calls in the top of the recursion tree. The ParaQuick algorithm, starts with k parallel threads, where k is a multiple of p (here k = 8*p) in a k way partition of the original array with the same pivot value, and hence we get 2k partitioned areas in the first pass. We then calculate where the pivot index, the division between the small and large elements if this had been ordinary sequential Quicksort partition. In full parallel we then swap all small elements to the right of this pivot index with the large elements to the left of this pivot index – these two ‘displaced’ sets are by definition of equal size. We can then recursively with half of the threads now do the left part, and with the other half of the threads the right part (more details and synchronization considerations in the paper). Finally, when there is only one thread left working on one such area, sequential Quicksort and Insertionsort are used, as in the traditional way of doing parallel Quicksort. In the last part of the paper, this new algorithm is empirically tested against two other algorithms and Arrays.sort from the Java library. Five different distributions of the numbers to be sorted end three different machines with p = 2(4 hyper threaded), 4(8) and 32(64) are tested. Finally, conclusions are presented and an explanation is given why this ParaQuick algorithm for large values of n and some distributions is so much faster than a traditional parallelisation.