To get more gonzo, in CPython's list.sort(), the C code that actually merges two lists (which happen, in context, to be two contiguous array slices) is in listobject.c's `merge_at(()` function. That brings in the world of "galloping" (exponential search) optimizations, much faster than one-pair-at-a-time compares in many real-world cases.
So that's a whole lot of additional complication, but it's the heart of what _really_ makes timsort shine in its "jaw dropping" cases.
Tim (of "timsort" - which was an inside joke name at the start, because I never expected anything outside of CPython would use it ;-) ).
Right, `merge_at()`. But that in turn calls small mountains of other C code. It is, alas, a complicated approach. Folding it in would be substantially more work than you've already done.
The point: for inputs like your:
a = list(range(-100, 1700))
b = list(range(1400, 1800))
it would first use a variant of binary search to find where b[0] belongs in a. "Almost all of a" is consumed by that via a relatively tiny number of compares. It would similarly do a binary-ish search to find where a[-1] belongs in b. The tail end of b, from 1700 through 1799, then also never needs to be looked at again.
In the lucky case that two input lists don't overlap at all, and a[-1] <= b[0], that's enough so that a "one-pair-at-a-time" compare _never_ needs to be done. In context (the two lists are contiguous slices of an array), no pointers at all need to be copied in that case either - it deduces that the combined array slices are already in sorted order in total time O(log(len(a)) + log(len(b))), and it's done.
Edit: Ok, I get it now. Don't just pop, binary search into it, so you can fast forward through. I'm not sure I'll get around to implementing that, at least not in the immediate future, but that makes total sense.
You already got a win - quit while you're ahead ;-) Note that plain binary search here is probably not a good idea. See "listsort.txt", in the same directory as "listobject.c", for excruciating details.
Haha. AdamMerge works for me - I get enough abuse for naming a sort after myself ;-)
If you pursue this, you can probably throw out piles of the CPython code. That's
trying to keep the result "in place", so has major code near-duplication to merge "into the left side" or "into the right side", to minimize the amount of
temp memory needed (this depends on which input list is shorter).
But you're writing your output to a new list, so those cases are the same to you.
To keep the sort stable in all cases, though, you still need to distinguish between `gallop_left()` and `gallop_right()`.
There are lots of known people on HN. I started tagging HN usernames with a browser extension some time ago and sometimes I spot some really interesting exchanges given the context of what the posters worked on.
CPython's lists are implemented as contiguous C arrays of pointers (to Python objects). Any way of grafting a tree-ish structure on top of that would be unacceptably wasteful of space and/or time. So sorting "gallops" over arrays because it's array-in, array-out.
As someone that rarely works with Python lists, as opposed to numpy arrays, I was pleasantly surprised to see numpy does what I would expect in providing a mergesort option. I'm surprised Python doesn't, other than via heapq and only implemented in Python (according to my reading of the post and a very quick Google search).
Oops, just for fun the numpy documentation currently states: "The datatype determines which of ‘mergesort’ or ‘timsort’ is actually used, even if ‘mergesort’ is specified. User selection at a finer scale is not currently available." Awesome ...
Also, apparently mergesort may also be done using radix sort for integers "‘mergesort’ and ‘stable’ are mapped to radix sort for integer data types."
This is not true. I have written many libraries to improve on the performance of numpy for image processing applications. Sort backs many operations, such as unique, that result in painfully slow code.
I too have written a lot of extension code to speed up numpy code. It's often not even especially difficult since any code at that level tends to be very algorithmic and looks essentially the same if written in numpy or C++. Having control of the memory layout and algorithms can be a huge win.
Of course I don't _usually_ do this, but that's just be most of the code I write doesn't need it. But it's not at all some crazy sort of thing to do.
I definitely used this to my advantage in ACM ICPC contests. Most of the students and professors wrote the problems and benchmarked against Java. However, on several problems, it boiled down to "write this specific algorithm for this class of problem, or use C." I once netted my team an extra problem after we timed out in Java and I just rewrote the teammate's solution in C++.
My team used java in two cases: BigInteger or 'isProbableTime.' For everything else performance penalty was almost always unacceptable. C++ can also be sometimes finetuned enough to pass with unintended complexity (i.e. O(n lg n) instead of O(n)).
I once had a Hackerrank problem that ultimately boiled down to "merge these sorted lists." I had 15 minutes, I think, to code the solution. I spent 10 of them deciding whether to use list.sort(), then finally just did it and cited the paper linked in On the Worst-Case Complexity of TimSort to claim linear time.
It seems like the built in heapq.merge() should be as fast as the implementation here. I mean if it's really just merging sorted lists, why wouldn't it be as fast as this implementation?
edit: Okay one reason is that heapq.merge is written in python:
It might also be since the python version has more options (like arbitrarily many iterables), but I presume it's the fact that the extension is written in C.
TFA explains this, although they never actually links the StackOverflow answer that they quote from.
> CPython’s list.sort is implemented in C (avoiding interpreter overhead), while heapq.merge is mostly implemented in Python, and optimizes for the “many iterables” case in a way that slows the “two iterables” case.
> The selling point for heapq.merge is that it doesn't require either the inputs or the outputs to be list; it can consume iterators/generators and produces a generator, so huge inputs/outputs (not stored in RAM at once) can be combined without swap thrashing. It also handles merging an arbitrary number of input iterables with lower overhead than might be expected (it uses a heap to coordinate the merging, so the overhead scales with the log of the number of iterables, not linearly, but as noted, that doesn't matter for the "two iterable" case).
Yeah looking at the all the features of the heapq.merge() version, I kind of doubt implementing it in C would speed it up much at all.
edit: Yeah I now see where the author mentioned the design philosophy of the python version. It was a bit burried, but it answers my question like you say. I personally find that to be the most interesting part.
Re-writing it in C wouldn't automatically make much of a difference. Of course a C implementation wouldn't be slower (at minimum you could flatten out the byte-code sequence and use the generated C api calls), but it wouldn't necessarily be noticeably faster. Given the features of the python version, it might very well be the case that writing the same code in C wouldn't speed things up if the same use-cases were supported. Though I would be curious to know if it actually would speed up significantly.
> Re-writing it in C wouldn't automatically make much of a difference.
It probably would even with a line-for-line translation: you’d save all the interpreter overhead, function calls if there are any (I did not look), and most refcount traffic.
Maybe. It's true you would save a certain amount of interpreter overhead and some refcount. How much that would help is not clear to me. So really I just see any speedups pure speculation honestly. Besides the interesting question isn't really whether it speeds up or not (it obviously will speed up at least a little bit with any non-sane implementation), but how _much_ it speeds up. If you gain 1% honestly who cares. If you gain 30%, then that's more interesting. And how close can you get with this more general version in C to the version in the blog post that is using a simplified implementation without all the extra features.
Anyway I am curious to know the answer, but I don't personally feel like writing out the implementation. If you do, I'd be interested in seeing the results.
I'm a bit amazed I've received downvotes for this post. I'm saying (1) you don't know how much this will speed up by going to C without trying it and (2) you don't know how much of the speed up in this case comes from the OP having written it in C and how much comes from the fact that it's a different algorithm. The fact that anyone would disagree with these (obviously true) statements is kind of incredible to me.
TimSort is cool. In it's worst case it looks no better than merge sort, but give it some partially order data, or small amounts of data, where it uses insertion sort, and it really shines.
Interesting read. This post seems to be written by Adam Gordon Bell who is also the host of the excellect CoRecursive podcast. Recently featured at HN in The Untold Story of SQLite [0].
That's me. Thanks for reading it and listening to the podcast. The SQLite episode is now nearly the most listened to, thanks to hacker news.
On Merging lists: I was caught off guard by recommendation to use sort to merge sorted lists. Everyone I mentioned it to was surprised as well so I thought it was worthy of some investigation. This is my first C extension in Python and I got help so someone else might be able to do even better than this.
It is interesting when our normal short hands for thinking about runtime complexity break down.
> It is interesting when our normal short hands for thinking about runtime complexity break down.
Actually I think more interesting would be to write a python version following the algorithm of your C extension (i.e. only allow two lists, don't allow generators, don't return generators, etc). You could probably do that quite quickly and generate the same graphs. I would expect that python version to lie somewhere between your C extension and the heapq.merge() version which is more general.
edit: If you were to do this, I wouldn't recommend using the python version in your blog since it pops the lists from the front. This seems to be O(n) in general:
I did a 10 minute glance at the python source code and couldn't totally verify this (it needs more than 10 minutes...), but it makes sense for popping from anywhere but the back to cause a list copy to occur. Maybe this isn't technically necessary when popping from the front, but I couldn't verify it. Either way it's easy to avoid by just not mutating the input lists anyway so I don't see a good reason not to go that way.
So we could determine what gains I am getting are due to C and the specific compares and which are due just being 2 lists? That would be an interesting test.
Maybe I'll do a follow-up. Thanks for reading the article. I suspect you know way more about C extensions in Python than I do. I saw you have a talk on this topic.
I think you are totally right that pop is not the way to go, but using an offset to track the head, like the C example does. I actually put that in a footnote, because I was considering testing my python version but the SO answer mentioned pop being expensive.
I think the simple version is still great as psuedo-code for communicating a solution though and hopefully it makes the c code a bit easier to follow once you've seen the python version.
Well the version I'm saying is essentially the same as the pop version you have so I don't think it's too complicated. You would just do something like this instead:
def merge_sorted_lists(l1, l2):
sorted_list = []
i = 0
j = 0
while i < len(l1) or j < len(l2):
if i == len(l1):
sorted_list.append(l2[j])
j += 1
continue
if j == len(l2):
sorted_list.append(l1[i])
i += 1
continue
if l1[i] < l2[j]:
sorted_list.append(l1[i])
i += 1
else:
sorted_list.append(l2[j])
j += 1
return sorted_list
I haven't tested that (you probably should before using it), but it looks mostly right and is essentially the same algorithm as yours.
This simple change is about 3x as fast. I think it is relevant because the same should be done in the original c code. 2x comes from the list swap, the rest comes from the use of N in the while conditional.
def merge2(l1, l2):
if len(l1) == 0:
return [x for x in l2]
if len(l2) == 0:
return [x for x in l1]
# ensure l1 is exhausted first to minimize
# comparisons
if l1[-1] > l2[-1]:
l1, l2 = l2, l1
sorted_list = []
i = 0
j = 0
N = len(l1)
while i < N:
if l1[i] <= l2[j]:
sorted_list.append(l1[i])
i += 1
else:
sorted_list.append(l2[j])
j += 1
sorted_list.extend(l2[j:])
return sorted_list
The first Python implementation is bad - removing the first element in each iteration is O(n), the C implementation gets this right by maintaining one index into each list instead of modifying the lists.
I missed the footnote. My point was more that one could have compared this with the C implementation as heapq.merge is a different algorithm and so there is no actual comparison of the same algorithm in C and Python.
Not only that, but also list(a + b) is producing two different new lists since (a + b) produces a list and list() constructs a new list copy. The benchmark would be faster if the OP just did
def sort_test():
m2 = a + b;
m2.sort()
instead of
def sort_test():
m2 = list(a + b);
m2.sort()
EDIT: It seems like OP fixed this issue in perf.py, but left it in test.py
You can check it, but I'm pretty sure the difference is negligible. But yeah, all benchmarks are using the code in perf and the pop code is just to demonstrate. It is not benched.
Edit: dropping the extra list() from the blog code examples.
I see about a 10% performance improvement on my local machine on the input in test.py when not constructing the unnecessary second list.
I don't really buy that it reads nicer in prose since you can just do (a + b).sort() if you want. Plus, I feel like it's important for readability to not be unnecessarily redundant. Having list(a + b) code also risks creating misconceptions about how lists can be constructed and used in Python.
I dunno about "multiway merge". But your standard 2-way merge is SIMD-optimized by having a binary-search across the "diagonals" of the so-called mergepath.
Leading to O(lg(p * sqrt(2))) time for the longest comparison diagonal, where "p" is the number of processors. Including thread divergence, that's O(N/p * lg(p * sqrt(2))) total work done, or assuming constant-p, O(N) of work per 2-way SIMD merge (with a large "p" providing an arbitrarily large speedup: but in practice is probably limited to 1024 for modern GPU architectures. Since modern GPUs only really "gang up" into 1024-CUDA thread blocks / workgroups efficiently)
I was thinking about how to make sorting faster the other day and was toying around with the idea of a permutation sort. If you think about a list of items generally (at least) one of the permutations of the list is guaranteed to be the correct sort order. Theoretically you should be able to do a binary search to find a correct permutation since there should always be a comparison you can do to reduce the search space by half. You should be able to go faster if there is duplicated data since there are multiple permutations of the list that would work.
That wouldn't work for general n, but for smaller n, optimized sorting libraries frequently use size-optimal sorting networks [1] which effectively do the same thing (partitioning permutations into two halves with the minimal size difference).
Assuming constantly time comparison, the best you are going to be able to do is log(n!), which is O(n log n) by Stirling's approximation. But comparing a permutation is likely to be O(n), so you'll probably do worse than that.
You don't need to compare permutations you reduce the set of possible permutations until you reach the one that must be the answer. Theoretical performance should be O(log n)
Sorry, nope. There aren't n permutations you're searching through, but the factorial of n to search through. That's the heart of the information-theoretic proof that no comparison-based sorting algorithm can do better, on average, than needing a number of comparisons equal to the base-2 logarithm of the factorial of n. Which, as already said, is O(n log n).
No, I understand that there are n! permutations. Again you aren't actually searching through the permutations. You're running comparisons on the data to reduce the set of possible permutations to the one it has to be. (pigeon hole principle) I believe my point still stands. You should be able to cut the search space in half with one comparison every single time, which should perform the same as a binary search.
Edit: You are correct I made a mistake in my big O notation, it should be O(n log n).
No, I understand your algorithm, I’m just saying that you’ll start with n! permutations and halve that with each step. This will require log(n!) steps.
That's what I suspected when reading the first part of the post where it states that heapq.merge() is slower than sort(). Hmm, isn't it comparing Apple to Orange since heapq.merge() is implement in Python and the system sort() would be in C?
So that's a whole lot of additional complication, but it's the heart of what _really_ makes timsort shine in its "jaw dropping" cases.
Tim (of "timsort" - which was an inside joke name at the start, because I never expected anything outside of CPython would use it ;-) ).