Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Beating TimSort at Merging (earthly.dev)
282 points by andrewstetsenko on July 13, 2021 | hide | past | favorite | 69 comments


To get more gonzo, in CPython's list.sort(), the C code that actually merges two lists (which happen, in context, to be two contiguous array slices) is in listobject.c's `merge_at(()` function. That brings in the world of "galloping" (exponential search) optimizations, much faster than one-pair-at-a-time compares in many real-world cases.

So that's a whole lot of additional complication, but it's the heart of what _really_ makes timsort shine in its "jaw dropping" cases.

Tim (of "timsort" - which was an inside joke name at the start, because I never expected anything outside of CPython would use it ;-) ).


Wow. Thanks for writing it. I believe this is the part you are referring to: https://github.com/python/cpython/blob/main/Objects/listobje...


Right, `merge_at()`. But that in turn calls small mountains of other C code. It is, alas, a complicated approach. Folding it in would be substantially more work than you've already done.

The point: for inputs like your:

a = list(range(-100, 1700)) b = list(range(1400, 1800))

it would first use a variant of binary search to find where b[0] belongs in a. "Almost all of a" is consumed by that via a relatively tiny number of compares. It would similarly do a binary-ish search to find where a[-1] belongs in b. The tail end of b, from 1700 through 1799, then also never needs to be looked at again.

In the lucky case that two input lists don't overlap at all, and a[-1] <= b[0], that's enough so that a "one-pair-at-a-time" compare _never_ needs to be done. In context (the two lists are contiguous slices of an array), no pointers at all need to be copied in that case either - it deduces that the combined array slices are already in sorted order in total time O(log(len(a)) + log(len(b))), and it's done.


Very Cool!

Edit: Ok, I get it now. Don't just pop, binary search into it, so you can fast forward through. I'm not sure I'll get around to implementing that, at least not in the immediate future, but that makes total sense.


You already got a win - quit while you're ahead ;-) Note that plain binary search here is probably not a good idea. See "listsort.txt", in the same directory as "listobject.c", for excruciating details.


If I get it working, it can be TimMerge.


Haha. AdamMerge works for me - I get enough abuse for naming a sort after myself ;-)

If you pursue this, you can probably throw out piles of the CPython code. That's trying to keep the result "in place", so has major code near-duplication to merge "into the left side" or "into the right side", to minimize the amount of temp memory needed (this depends on which input list is shorter).

But you're writing your output to a new list, so those cases are the same to you.

To keep the sort stable in all cases, though, you still need to distinguish between `gallop_left()` and `gallop_right()`.


Gosh, comments like these from the actual author are a reminder of why I love HN. :)


I'm also amazed by the fact that that John Nagle from the popular Nagle's algorithm is still active on HN!


There are lots of known people on HN. I started tagging HN usernames with a browser extension some time ago and sometimes I spot some really interesting exchanges given the context of what the posters worked on.


I'd like to more about this. Can I use this?


Search for HN username tagging extension. There's quite a few options available.


Is this collaborative? That is, can one benefit from other people's tagging?


Is there a reason to use galloping over an array vs a seek in a search tree?

Better real world factors?

Galloping, Demaine set intersection, worst case optimal joins, they all seem to be different aspects of the same underlying principle.

So I'm very curious about the peculiarities in the sorting case.


See CPython's https://github.com/python/cpython/blob/main/Objects/listsort... for details about the sort. In fact, its "galloping" was inspired by a paper of which Demaine was a co-author (reference in the file already linked to).

CPython's lists are implemented as contiguous C arrays of pointers (to Python objects). Any way of grafting a tree-ish structure on top of that would be unacceptably wasteful of space and/or time. So sorting "gallops" over arrays because it's array-in, array-out.


Thank you for the explanation!


I had the same idea, that re-using a lot of the non-sort parts of the TimSort code in this merge, the result would be just "Tim".


As someone that rarely works with Python lists, as opposed to numpy arrays, I was pleasantly surprised to see numpy does what I would expect in providing a mergesort option. I'm surprised Python doesn't, other than via heapq and only implemented in Python (according to my reading of the post and a very quick Google search).

Oops, just for fun the numpy documentation currently states: "The datatype determines which of ‘mergesort’ or ‘timsort’ is actually used, even if ‘mergesort’ is specified. User selection at a finer scale is not currently available." Awesome ...

Also, apparently mergesort may also be done using radix sort for integers "‘mergesort’ and ‘stable’ are mapped to radix sort for integer data types."


Why would Python provide a mergesort option when timsort was the replacement for a standard mergesort?

`heapq.merge` is not a mergesort, it's a merge of sorted sequences (aka just the merge part of a mergesort).


Turns out that at least in terms of performance, everyone using numpy is fine with this. It just needs to run fast enough, not as fast as possible ;)


This is not true. I have written many libraries to improve on the performance of numpy for image processing applications. Sort backs many operations, such as unique, that result in painfully slow code.


I too have written a lot of extension code to speed up numpy code. It's often not even especially difficult since any code at that level tends to be very algorithmic and looks essentially the same if written in numpy or C++. Having control of the memory layout and algorithms can be a huge win.

Of course I don't _usually_ do this, but that's just be most of the code I write doesn't need it. But it's not at all some crazy sort of thing to do.


Tangential but np.bincount is typically the fast version of np.unique. Not entirely the same thing, but it’s worth knowing about it.


mergesort mandates that it's a stable sort, so I guess they get away with replacing "like for like" that way. For varying definitions of like.


I definitely used this to my advantage in ACM ICPC contests. Most of the students and professors wrote the problems and benchmarked against Java. However, on several problems, it boiled down to "write this specific algorithm for this class of problem, or use C." I once netted my team an extra problem after we timed out in Java and I just rewrote the teammate's solution in C++.


My team used java in two cases: BigInteger or 'isProbableTime.' For everything else performance penalty was almost always unacceptable. C++ can also be sometimes finetuned enough to pass with unintended complexity (i.e. O(n lg n) instead of O(n)).


isProbableTime -> isProbablePrime.


Outside of some very specific parsing problems where Java standard library was very helpful, writing solutions in C was the way to go.

Though while it could allow certain parts of the code to be less efficient, if you had completely the wrong algorithm it wasn’t going to save you.


Past TimSort threads, for anyone interested:

Timsort, the Python sorting algorithm - https://news.ycombinator.com/item?id=21196555 - Oct 2019 (131 comments)

On the Worst-Case Complexity of TimSort - https://news.ycombinator.com/item?id=17883461 - Aug 2018 (74 comments)

Timsort is a sorting algorithm that is efficient for real-world data - https://news.ycombinator.com/item?id=17436591 - July 2018 (77 comments)

Functional verification with mechanical proofs of TimSort [pdf] - https://news.ycombinator.com/item?id=9778243 - June 2015 (1 comment)

Timsort - https://news.ycombinator.com/item?id=3214527 - Nov 2011 (27 comments)

Java has switched from Mergesort to TimSort - https://news.ycombinator.com/item?id=752677 - Aug 2009 (13 comments)


I once had a Hackerrank problem that ultimately boiled down to "merge these sorted lists." I had 15 minutes, I think, to code the solution. I spent 10 of them deciding whether to use list.sort(), then finally just did it and cited the paper linked in On the Worst-Case Complexity of TimSort to claim linear time.

I passed, and ultimately got the job.


Recently learned that Swift also uses TimSort, really cool to see other language implementations:

https://news.ycombinator.com/item?id=20884208


It seems like the built in heapq.merge() should be as fast as the implementation here. I mean if it's really just merging sorted lists, why wouldn't it be as fast as this implementation?

edit: Okay one reason is that heapq.merge is written in python:

https://github.com/python/cpython/blob/6252670732c68420c2a8b...

It might also be since the python version has more options (like arbitrarily many iterables), but I presume it's the fact that the extension is written in C.


TFA explains this, although they never actually links the StackOverflow answer that they quote from.

> CPython’s list.sort is implemented in C (avoiding interpreter overhead), while heapq.merge is mostly implemented in Python, and optimizes for the “many iterables” case in a way that slows the “two iterables” case.

Going to the original StackOverflow answer, we can see the same user (ShadowRanger) expanding on this in a comment: https://stackoverflow.com/questions/464342/combining-two-sor...

> The selling point for heapq.merge is that it doesn't require either the inputs or the outputs to be list; it can consume iterators/generators and produces a generator, so huge inputs/outputs (not stored in RAM at once) can be combined without swap thrashing. It also handles merging an arbitrary number of input iterables with lower overhead than might be expected (it uses a heap to coordinate the merging, so the overhead scales with the log of the number of iterables, not linearly, but as noted, that doesn't matter for the "two iterable" case).


Yeah looking at the all the features of the heapq.merge() version, I kind of doubt implementing it in C would speed it up much at all.

edit: Yeah I now see where the author mentioned the design philosophy of the python version. It was a bit burried, but it answers my question like you say. I personally find that to be the most interesting part.


Because it's written in python


Re-writing it in C wouldn't automatically make much of a difference. Of course a C implementation wouldn't be slower (at minimum you could flatten out the byte-code sequence and use the generated C api calls), but it wouldn't necessarily be noticeably faster. Given the features of the python version, it might very well be the case that writing the same code in C wouldn't speed things up if the same use-cases were supported. Though I would be curious to know if it actually would speed up significantly.


> Re-writing it in C wouldn't automatically make much of a difference.

It probably would even with a line-for-line translation: you’d save all the interpreter overhead, function calls if there are any (I did not look), and most refcount traffic.


Maybe. It's true you would save a certain amount of interpreter overhead and some refcount. How much that would help is not clear to me. So really I just see any speedups pure speculation honestly. Besides the interesting question isn't really whether it speeds up or not (it obviously will speed up at least a little bit with any non-sane implementation), but how _much_ it speeds up. If you gain 1% honestly who cares. If you gain 30%, then that's more interesting. And how close can you get with this more general version in C to the version in the blog post that is using a simplified implementation without all the extra features.

Anyway I am curious to know the answer, but I don't personally feel like writing out the implementation. If you do, I'd be interested in seeing the results.


I'm a bit amazed I've received downvotes for this post. I'm saying (1) you don't know how much this will speed up by going to C without trying it and (2) you don't know how much of the speed up in this case comes from the OP having written it in C and how much comes from the fact that it's a different algorithm. The fact that anyone would disagree with these (obviously true) statements is kind of incredible to me.


TimSort is cool. In it's worst case it looks no better than merge sort, but give it some partially order data, or small amounts of data, where it uses insertion sort, and it really shines.


Interesting read. This post seems to be written by Adam Gordon Bell who is also the host of the excellect CoRecursive podcast. Recently featured at HN in The Untold Story of SQLite [0].

[0]: https://news.ycombinator.com/item?id=27718701


That's me. Thanks for reading it and listening to the podcast. The SQLite episode is now nearly the most listened to, thanks to hacker news.

On Merging lists: I was caught off guard by recommendation to use sort to merge sorted lists. Everyone I mentioned it to was surprised as well so I thought it was worthy of some investigation. This is my first C extension in Python and I got help so someone else might be able to do even better than this.

It is interesting when our normal short hands for thinking about runtime complexity break down.


> It is interesting when our normal short hands for thinking about runtime complexity break down.

Actually I think more interesting would be to write a python version following the algorithm of your C extension (i.e. only allow two lists, don't allow generators, don't return generators, etc). You could probably do that quite quickly and generate the same graphs. I would expect that python version to lie somewhere between your C extension and the heapq.merge() version which is more general.

edit: If you were to do this, I wouldn't recommend using the python version in your blog since it pops the lists from the front. This seems to be O(n) in general:

https://wiki.python.org/moin/TimeComplexity

I did a 10 minute glance at the python source code and couldn't totally verify this (it needs more than 10 minutes...), but it makes sense for popping from anywhere but the back to cause a list copy to occur. Maybe this isn't technically necessary when popping from the front, but I couldn't verify it. Either way it's easy to avoid by just not mutating the input lists anyway so I don't see a good reason not to go that way.


So we could determine what gains I am getting are due to C and the specific compares and which are due just being 2 lists? That would be an interesting test.

Maybe I'll do a follow-up. Thanks for reading the article. I suspect you know way more about C extensions in Python than I do. I saw you have a talk on this topic.

I think you are totally right that pop is not the way to go, but using an offset to track the head, like the C example does. I actually put that in a footnote, because I was considering testing my python version but the SO answer mentioned pop being expensive.

I think the simple version is still great as psuedo-code for communicating a solution though and hopefully it makes the c code a bit easier to follow once you've seen the python version.


Well the version I'm saying is essentially the same as the pop version you have so I don't think it's too complicated. You would just do something like this instead:

    def merge_sorted_lists(l1, l2):
        sorted_list = []
        i = 0
        j = 0
        while i < len(l1) or j < len(l2):
            if i == len(l1):
                sorted_list.append(l2[j])
                j += 1
                continue
            if j == len(l2):
                sorted_list.append(l1[i])
                i += 1
                continue
            if l1[i] < l2[j]:
                sorted_list.append(l1[i])
                i += 1
            else:
                sorted_list.append(l2[j])
                j += 1
        return sorted_list
I haven't tested that (you probably should before using it), but it looks mostly right and is essentially the same algorithm as yours.


This simple change is about 3x as fast. I think it is relevant because the same should be done in the original c code. 2x comes from the list swap, the rest comes from the use of N in the while conditional.

    def merge2(l1, l2):
        if len(l1) == 0:
            return [x for x in l2]
    
        if len(l2) == 0:
            return [x for x in l1]

        # ensure l1 is exhausted first to minimize
        # comparisons
        if l1[-1] > l2[-1]:
            l1, l2 = l2, l1
    
        sorted_list = []
        i = 0
        j = 0
        N = len(l1)
        while i < N:
            if l1[i] <= l2[j]:
                sorted_list.append(l1[i])
                i += 1
            else:
                sorted_list.append(l2[j])
                j += 1
        sorted_list.extend(l2[j:])
        return sorted_list


The first Python implementation is bad - removing the first element in each iteration is O(n), the C implementation gets this right by maintaining one index into each list instead of modifying the lists.


Yes, it says so in footnote 1 in the article. It's also not the benchmarked code so it doesn't really matter.


I missed the footnote. My point was more that one could have compared this with the C implementation as heapq.merge is a different algorithm and so there is no actual comparison of the same algorithm in C and Python.


Not only that, but also list(a + b) is producing two different new lists since (a + b) produces a list and list() constructs a new list copy. The benchmark would be faster if the OP just did

def sort_test(): m2 = a + b; m2.sort()

instead of

def sort_test(): m2 = list(a + b); m2.sort()

EDIT: It seems like OP fixed this issue in perf.py, but left it in test.py


You can check it, but I'm pretty sure the difference is negligible. But yeah, all benchmarks are using the code in perf and the pop code is just to demonstrate. It is not benched.

Edit: dropping the extra list() from the blog code examples.


I see about a 10% performance improvement on my local machine on the input in test.py when not constructing the unnecessary second list.

I don't really buy that it reads nicer in prose since you can just do (a + b).sort() if you want. Plus, I feel like it's important for readability to not be unnecessarily redundant. Having list(a + b) code also risks creating misconceptions about how lists can be constructed and used in Python.


Yeah, good point. I think you are right about it being incorrect. Updating...


Are there similar relatively simple methods or simd tricks for multi-way merges?


I dunno about "multiway merge". But your standard 2-way merge is SIMD-optimized by having a binary-search across the "diagonals" of the so-called mergepath.

Leading to O(lg(p * sqrt(2))) time for the longest comparison diagonal, where "p" is the number of processors. Including thread divergence, that's O(N/p * lg(p * sqrt(2))) total work done, or assuming constant-p, O(N) of work per 2-way SIMD merge (with a large "p" providing an arbitrarily large speedup: but in practice is probably limited to 1024 for modern GPU architectures. Since modern GPUs only really "gang up" into 1024-CUDA thread blocks / workgroups efficiently)

https://web.cs.ucdavis.edu/~amenta/f15/GPUmp.pdf

This provides a natural way at allowing ~1024 CUDA threads in a singular block to sort a list at O(n * log(n)) efficiently.


I was thinking about how to make sorting faster the other day and was toying around with the idea of a permutation sort. If you think about a list of items generally (at least) one of the permutations of the list is guaranteed to be the correct sort order. Theoretically you should be able to do a binary search to find a correct permutation since there should always be a comparison you can do to reduce the search space by half. You should be able to go faster if there is duplicated data since there are multiple permutations of the list that would work.


That wouldn't work for general n, but for smaller n, optimized sorting libraries frequently use size-optimal sorting networks [1] which effectively do the same thing (partitioning permutations into two halves with the minimal size difference).

[1] https://en.wikipedia.org/wiki/Sorting_network#Optimal_sortin...


Assuming constantly time comparison, the best you are going to be able to do is log(n!), which is O(n log n) by Stirling's approximation. But comparing a permutation is likely to be O(n), so you'll probably do worse than that.


You don't need to compare permutations you reduce the set of possible permutations until you reach the one that must be the answer. Theoretical performance should be O(log n)


Sorry, nope. There aren't n permutations you're searching through, but the factorial of n to search through. That's the heart of the information-theoretic proof that no comparison-based sorting algorithm can do better, on average, than needing a number of comparisons equal to the base-2 logarithm of the factorial of n. Which, as already said, is O(n log n).


No, I understand that there are n! permutations. Again you aren't actually searching through the permutations. You're running comparisons on the data to reduce the set of possible permutations to the one it has to be. (pigeon hole principle) I believe my point still stands. You should be able to cut the search space in half with one comparison every single time, which should perform the same as a binary search.

Edit: You are correct I made a mistake in my big O notation, it should be O(n log n).


No, I understand your algorithm, I’m just saying that you’ll start with n! permutations and halve that with each step. This will require log(n!) steps.


Could Cython be used for writing the fast module here instead?


Excellent read.


This is awesome. Thanks!


Doesn't explain why Earthly.dev is under the BSL.


I don't know why you would expect a post with a title like this to touch on that, even though being wary of the BSL licence is fair.


Great post. I almost did a spit take when I read:

> Timsort overcomes this disadvantage by being written in C rather than Python.

Yeah, Python is SLOW. Thankfully the author dug into C. That was nice to see.

Edit: lol I have no idea why this is being downvoted. Y’all weird.


That's what I suspected when reading the first part of the post where it states that heapq.merge() is slower than sort(). Hmm, isn't it comparing Apple to Orange since heapq.merge() is implement in Python and the system sort() would be in C?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: