# Software Algorithms for k-combinations

Interpreting the mathematics of k-combinations for computer programmers.

Header Image: Crystal Pyramid Wallpaper by Michael Dziedzic.

Preface

It’s been a little over a year since my last post which may lead one to wonder if I had dropped off the face of the earth or merely given up. Fortunately neither is even close to being true.

Unless you’re a Sleeping Beauty or Rip Van Winkle, you will have noticed 2020 was the year of COVID-19. We’ve been extremely fortunate here in Australia; we locked down early enough to eradicate most of the community spread. We’ve suffered only a little over 900 deaths so far which is in stark contrast to most other nations and the total world death count of over 2,887,000 people (figures correct at time of writing).

While Australians were virtually unaffected by the virus itself we were heavily affected in other ways. Firstly, lockdowns played havoc to our supply chains. Western Australia is the most isolated capital city in the world, and our supermarkets moved quickly to curb panic-buying by enforcing restrictions. It worked eventually, but not before our supermarkets were stripped bare by fear-driven mobs before the restrictions could be put in place. The lack of supply played havoc for my family, especially due to our soy- and rice-milk drinking ways (which were classified as “long-life milk” and restricted). I would spend hours travelling across multiple neighborhoods each day, going shop to shop to gather enough food and supplies for our daily meals. Fortunately we had no problems with toilet paper, which seemed to be a problem for many people. 🧻

My partner and I removed our kids from school early over health concerns, and a week later the government sent children home en masse. Online learning was organised, but overworked teachers don’t have time to keep up with technology let alone move their entire syllabus to it, so the quality was unusable. We couldn’t keep up with our kids questions about the incomprehensible instructions they were given while holding down our full time remote jobs. The constant attention required to make it work was too demanding of our time and we were forced to abandon our children’s education altogether during this remainder of lockdown.

As most of Australia came out of lockdown and life found the new normal for most folks, my partner began suffering a debilitating medical issue and couldn’t get out of bed or move at all on her own. I took up the additional responsibility of full-time carer for our children and for her. Our public hospital system failed her utterly, refusing to provide a bed, adequate medical relief, or empathy from the agonising pain that wracked her body around the clock. After weeks of broken sleep, mental anguish, and helplessness at being unable to relieve her constant cries of pain, we threw a haymaker and went to a private hospital. Thankfully she immediately started getting the right treatment at last, putting her on track for multiple back surgeries to repair what turned out to be the worst bulging disc in anyone’s peacetime experience. She is still recovering but is finally doing okay.

During this period of caring for her, being a sole-parent, working full-time, and hour-long round-trips to the hospital far far away, my mental health started to break. Now anyone who’s played cricket with or against me knows I am one mentally tough hombre, but after months of my unrelenting pressure my grey-matter said “no”. As luck would have it I had already organised a therapist for myself weeks earlier and right at my breaking point I had the help I needed on hand.

As a bonus we determined I suffer from CPTSD caused by major childhood trauma. This was actually a huge relief to me; it explained many life-long issues and forced me to direct some care on myself for a change. Sure it meant living with the COVID-shitstorm of the present and unearthing unholy childhood trauma simultaneously, but y’all know I relish a challenge. At the end of this I’m going to be one goddamn unstoppable force of nature.

My partner appears to be recovering at long last, and I look forward to the day she’s fully recovered. It’s been a long hard road for both of us. Our kids have been resilient throughout this whole reign of fire and I believe its a testament to our amazing parenting skills.

Now my life is getting back on track I’ve been reclaiming some spare time and dipping my toes into side projects once again. I thought I was starting gently, but as you’ll see the adventures just keep coming for me. 😁

I look forward to my partner’s full recovery, the COVID vaccine to roll out and be a success 🤞 , and to my recovery from a lifetime of undiagnosed mental trauma.

# Introduction

Computer programmers, software engineers, software developers; no matter what we call ourselves, there are times where all of us find ourselves faced with some logic we know is a solved problem and reach out on the internet for the solution. Most of the time that solution can be found by spending a few minutes on our favourite search engine. We quickly gain knowledge of how the solution works, most often with code to help us understand the explanation. We take the code offered, tinker with it to make it our own, and move on having learned something.

This was not the case for me recently when it came time to work with algorithms I needed for what mathematicians call k-combinations. While I quickly found information about the mathematics of k-combinations, I found precious little about converting that mathematics in to code.

Having spent weeks of spare time working on a problem that I expected would only take a few minutes, and with encouragement from fellow engineers, I realised there was a hole in our online knowledge base about this important topic. This post attempts to fill that void.

# How it Started

25 years ago I wrote an application in Microsoft Excel to help me pick numbers for the national lottery. In the mid 90’s when I wrote this app, the biggest (and only) lottery available in my hometown city of Perth (Western Australia) was the Saturday Lottery Draw. In this lottery, every game required you to pick 6 numbers from 45 numbers, and you could play as many games as you like every week for an increasing amount of cost.

Like any gambler worth their salt, I have my own superstitions when it comes to the selection of my numbers. Here are mine:

1. Make sure you play every number in the field at least once, across a minimal number of games
2. If you don’t win any prize in the weekly draw, replay the same numbers the following week
3. If you do win any prize in the weekly draw, regenerate a new set of numbers to play

You don’t have to be a math genius to realise that to satisfy #1 you need to play a minimum of 45 ÷ 6 = 7.5 games. I chose to play 12 games every week, so I could cover the field once, take the three numbers left over from game 8 with the left over from game 8 in a second field, and fill 4 more games with randomly selected games from that second field.

This was fairly simple to achieve with Excel VBA. I wrote two columns of numbers from 1 to 45 (the two fields), shuffled the columns using VBA’s random number generator, then coloured each group of 6 (each game) with different colours. I manually transfer my picks onto paper coupons and play them each week.

25 years later I’m still waiting for that big win. The gambler’s conspiracy voice in my head has been at me for decades to change my approach. Being a vastly more seasoned programmer today than I was when I wrote that app, one day I decided to indulge my inner gambler and rewrite the app using my current technology stack (C# with a UWP front end). You know, purely for scientific purposes 🤨. It wouldn’t take long, right? 🤦‍♀️

# New Superstitions

Software engineers are very good at identifying patterns, and I am quite exceptional at it. After decades of using my Excel App I’ve noticed what appears to be the same groups of numbers appearing over and over. The scientist in me theorises the 25+ year old VBA randomizer probably doesn’t have the best random distribution on the planet, so I wondered if the .NET randomizer would be any better.

It wasn’t hard to whip up some code in LinqPad to show the distribution, and sure enough it was terrible. I toyed with the .NET Cryptogrphic Number Generator and was equally dissatisfied. I searched NuGet for a better alternative, and settled on RandN which has a specific aim to improve the quality of the randomness. Testing the distribution of the various generators RandN provides also left me dissatisfied. Here’s a sample distribution over 100 million generations of numbers between 1 and 45 using it’s StandardRng strategy:

note

While not lighting my world on fire with it’s random distribution, I do find RandN to be a solid random number generator. I continue using it to provide random numbers for this project and recommend it over .NET’s built in randomizers.

After some experimentation with generating lotto games randomly, and noticing certain combinations repeating too often for my liking, I resigned myself to the fact that simply using a random number generator was not going to be enough to satisfy my inner conspiracist. I would need to do more work to guarantee I don’t play the same numbers more than once.

It should be obvious there will be a finite number of ways to choose 6 numbers from 45 numbers. I figured if I precalculate every possible combination of 6, select my games randomly from this list, and store the combinations I have played, then I can guarantee I won’t be repeating games until I have depleted the entire list. I need to do additional work to ensure I don’t choose combinations that violate superstition #1 above, but that’s the kind of algorithmic challenge that I love about writing code; I have faith in myself to resolve that.

I just need to generate that list right?

# Another Numbers Game

To track the games I’ve played I would be storing a large list of sequences of 6 numbers on disk, so I need to consider how much space that takes. The worst case with minimal decoration will store all combinations of 45 choose 6, so how many combinations are there?

Mathematics has a field of study called Combinatorics that is devoted to counting systems of finite and infinite sets of numbers. It doesn’t take much searching to discover that in the language of mathematics I’m asking the question “how many combinations are there of n choose k?”. This is a solved problem in mathematics; the number of k-combinations is the count of the finite set of unique ways to choose k combinations from n things. From there it’s easy to find the formula for the Binomial Coefficient which is used to calculate the count of n choose k.

How mathematics write n choose k

I won’t go into the math just yet, but using this formula I found the number of combinations I would need to store is 8,145,060. That’s a big number! If I store all 6 numbers of each combination as a 12 digit string with no separator, this results in a file size of over 85MB! That’s an excessive amount of data for someone who grew up in the 8 bit era, and I simply couldn’t find peace until I found a better storage mechanism.

# Can You Smell What The Math Is Cooking?

One tool computer scientists use to identify entries in a consistently repeatable sequence of data (such as my list of all possible sequences) is an index. I can just number each entry in my list of all possible games with an integer from 1 to 8,145,060 as long as I can generate the same sequence every time I need. If I store each index on it’s own line in the file, that gives me a maximum file size of just under 69MB. While that is an improvement over 85MB (almost 20% less data), there are more efficient ways to store index data than as ASCII. What’s important to realise though is that indexing is probably my best way to reference previously played combinations.

When it comes to Combinatorial Number Systems, mathematics has us covered there too. In combinatorics, the index of an entry in the finite set of is called the rank, and finding the index/rank of a k-combination from the set is called ranking.

Now mathematicians love a good bijection as much as any computer scientist (maybe even more…purist bastards! 😁), so with typical swagger mathematics defines the inverse operation as well. In combinatorics finding the k-combination at a specific index/rank from the set is called unranking. At least the naming convention makes implicit sense!

aside

In combinatorial number systems, the order in which k-combinations of the set are listed is called a Lexicographical Order. This order must be idempotent so that mathematical functions like rank and unrank can be bijections of each other. We don’t go into detail about these terms are used in this post, but you will need to understand them if you choose to read wider into the mathematical theories of k-combinations.

Both ranking and unranking operations will be required in my new lotto game generating app. I’ll need to generate the set of all combinations of using a for loop over all the possible indexes 1 to 8,145,060 (unranking) , and I’ll need to regenerate the index from the combination when I’m caching the played games to disk (ranking).

note

The functions rank and unrank are usually accompanied by an iterator function to calculate the next k-combination in a sequence of . This function is used to enumerate the entire set, which is more efficient and performant than unranking each combination from scratch using an incrementing index. For the values of and that I’m using the performance gain isn’t relevant so I chose to keep the code simple by not implementing it.

# Where’s Wally?

Knowing what the mathematical terms are is one thing, but finding examples of algorithms implementing them is something else. For programmers, the most helpful Stack Overflow contributions on the topic often refer to an old MSDN article that has been retired, and currently available only by downloading a massive PDF archive of MSDN that includes it. For mathematicians, community contributions gloss over much of the reasoning that programmers need to create algorithms. I suppose in that respect at least the programming and mathematical communities have that in common. 😜

After a lot of searching I concluded that finding algorithms for rank and unrank weren’t available to read, understand, and use like most solved problems generally are. In terms of a complete reference of how to think about those mathematical functions, the best example I could find was definitely that retired MSDN article by Dr James McCaffrey, which I’ve dug out and am now hosting for posterity and your reading pleasure.

Armed with Dr McCaffrey’s article in particular, and supported by a host of other mathematical and Wikipedia web pages, I began to decode, reproduce, and create the functions I needed in C#.

# Algorithm Archaeology

The article describes the algorithm for unranking a k-combination from it’s index, as well as providing some C# code to do it. While the code is welcome and helps to verify the intention of the lengthy description, the description itself is the gold I needed to find in order to resurrect the knowledge. As for the code, I most often I prefer to re-write code myself to embed the knowledge in my own mind and incorporate language features that didn’t exist when reference code was written (simplifying it for maintainability and unit testing). Further, the reference code uses a zero-based system whereas the solution I’m after requires a one-based system.

After a lot of sweat and experimenting, I managed to distill the algorithm for a one-based unrank function down to the pseudocode below. The steps required in the algorithm are on the left, while some example working out using the set with an index of is on the right. Ignore the unfamiliar terms as I’ll define what they are a little later on in this post.

The pseudocode for the Unrank function

Let’s step through the algorithm.

1. Take n and k
This simply defines the values of n and k, and the value of (n choose k).
2. Take the index we want
This defines the index/rank of the k-combination we want to retrieve from the set
3. Move to base 0
Converts the system to a zero-based system. Ignore this step if you already use base zero.
4. Calculate the dual
Calculates the dual of the index/rank.
5. Calculate combinadic of the dual
Calculates the combinadic of the dual.
6. Map to zero-based combination
Maps the combinadic to the k-combination in a zero-based system.
7. Add 1 (for base 1)
Because I want one-based combinations, I add 1 to the result.

Fairly simple right? But what are those undefined terms? What do they mean?

## Calculating n choose k

As mentioned previously, finding the value of is a solved problem. In mathematics we use the Binomial Coefficient to calculate this value, and forms of the algorithm are explicitly described in the Combinations Wikipedia page.
Here’s my version, which is based on the second alternate computation method from that page:

note

I use int because my values for and are small enough to stay within the limits of that C# type. You’ll likely find the maximum value of int will become too limiting very quickly and you’ll likely need types with larger maximums.

## The Dual

The Dual (as Dr. McCaffrey calls it) is defined as the value of minus the rank/index of a k-combination. For example, if we list out all the possible combinations of , the index/rank runs from and the duals run from .

While my lottery generator needs to work with a one-based numbers, we’ll list the zero-based equivalent of as well. This will become important for reference later in this article.

In the decimal number system we can represent a number by breaking it down into it’s unit columns. For example, the decimal number can be broken down as follows:

These unit columns can be represented as a powers of ten with the exponent changing by one for each column; the ones column is , the tens column is , the hundreds column is , and so on.

This method for representing numbers can also be used by other numerical systems. The binary number system can be represented with each unit column as a power of two. The decimal number is in binary, which is broken down as:

Now lets consider a combinatorial number system. In a combinatorial number system we can represent any number by breaking it down into unit columns just like the decimal and binary systems. However instead of representing each unit column as a power of some base integer, we’ll use binomial coefficients.

For a finite set , the left-most unit column starts with a base of , the next column has base , and so on until the final column with . We can express the representation of any number of as:

When the values of are chosen such that each is greater than the following it (or zero if it follows a zero), then the set of all is what Dr McCaffrey calls the Combinadic.

As an explicit example, let’s look at the representation of the number in the set of numbers .

We’ll discuss how to calculate the combinadic of numbers a little later.

## Combining the Combinadic and the Dual

It turns out the combinadic has a very useful property for working with k-combinations in zero-based combinatorial number systems : when each of the combinadic of any dual is subtracted from the number , it reveals the zero-based k-combination of the rank/index of that dual.

That’s quite a mind-bender to take in, so here’s what this property looks like for the complete set of duals in the set :

This means that if we have any rank/index of , we can use the combinadic of the dual of that number to give us the k-combination at that rank/index. This is only true for zero based combinatorial number systems, but it isn’t too hard to use a bit of addition and subtraction at key stages to get to a one-based system like I need for my lottery app.

How exactly does the relationship between the combinadic, dual, and produce the k-combinations of ? Dr. McCaffrey doesn’t explain in his article, and I haven’t found any explanation in any of my searches across the web. I’m sure there is mathematics somewhere to prove this relationship, but I’ve just not found any. 🤷‍♀️

Fortunately Dr. McCaffrey does explain how to calculate the combinadic of a number, at least for zero-based systems. Remember, the definition of the combinadic of a number satisfies the equation:

We need to calculate the values of such that each is greater than the following it (or zero if it follows a zero).

Let’s work through the example of from the set .

### Finding

First we look for the largest value of that doesn’t exceed . We know the maximum number possible will be , so we work our way down from until we find a value .

Now , but , so we’ve discovered that .

### Finding

Next we look for the largest value of that doesn’t exceed . We know the maximum number possible will be , so we work our way down from until we find a value .

Now , , but , so we’ve discovered that .

### Finding

There can be only one value for any remaining once we get to , and that’s . So we’ve discovered that .

### The Result

Now we have all the values of the combinadic is revealed:

# Algorithm for Unrank

Now that we’ve discussed the theory and defined the terms, lets return to the pseudocode of the unrank function:

Things should make complete sense now. The only thing to note is how I convert my index from one-based to zero-based so I can use Dr. McCaffrey’s Unrank theory (which is zero-based), and then convert the result from zero-based back to one-based at the end for my one-based lottery generator.

It was relatively simple to code this up in C#.

To run this code you’ll also need my Binomials class defining the binomial coefficient calculation and some extension functions:

# Algorithm for Rank

It took me weeks of research, analysis, and prototyping to get to this point, so I was excited when I finally had working code to calculate the Unrank of . Next I needed to find an algorithm for the rank function of . Surely the information I needed was right alongside the theory for unrank, right?

My excitement quickly faded when I discovered Dr McCaffrey’s article stopped short of discussing the rank function. Panic started to set in when I couldn’t find any code or discussion of how the Rank function works anywhere on the web. It’s as though the problem was so well understood that nobody bothered to document it. I went from having the buzz that comes with success after weeks of hard work to the dread of having to do it all again to create the second, missing algorithm.

Fortunately by this point my senses for Mathematics and Science were completely re-awakened and I remembered that rank and unrank are bijections. That means I should be able to reverse the algorithm for unrank to find the algorithm for rank.

One more time, here’s the pseudocode for unrank:

Pseudocode for the Rank algorithm, one final time

Reversing order we get the pseudocode for rank:

Pseudocode for the Unrank algorithm

Coding this is even easier; there’s no need to calculate the combinadic using binomial coefficients because we deconstruct it from the combination provided and the value of . Everything plays out using addition and subtraction:

# Return of the King

Finally I had what I needed; algorithms for rank and unrank and I could get on with building my lottery generator…except I didn’t get on with it, at least not yet. When I coded the pseudocode I had aligned my code with each step of the pseudocode to ensure I was doing it right (note how the code comments line up with the pseudocode). Now that I had an executable version of the algorithm, I could see a few places where refactoring was clearly in order; preventing the Unrank function from looping through twice for example.

If I was going to refactor for more performance I should write some performance tests to measure the change in execution speed for the original and refactored functions. It was time to bust out my old friend BenchmarkDotNet.

If you’ve been following my blog you may remember this old friend from my first blog post back in 2017. In the years since BenchmarkDotNet has gone from strength to strength and is now the most widely used performance measuring tool in history. Microsoft use it extensively, for example, to prevent performance regression and drive performance improvements out of the unified .NET platform. The benefits of that effort have resulted in some amazing performance gains, to the point where C# is faster than C++ in many aspects. That’s an incredible outcome.

Last time I used BenchmarkDotNet I stayed within the confounds of LINQPad. This time I already had my lottery generator app already up and running somewhat in Visual Studio so it made sense to continue using it there. I refactored my Rank and Unrank algorithms into a CombinatorialNumberSystem class for consumption, which had the added benefit of simplifying how I handle values of n and k, and allowing the up-front calculation of write-once variables such as the dualOfZero.

Next I created a new FastCombinatorialNumberSystem class that removed that second loop and reduced the mathematics of rebasing between zero to one, effectively merging a few pseudocode steps together. This simplified the code a lot and it now reads like real code.

Pushing both implementations of Rank through some performance tests shows a significant performance boost:

As you can see I ran , , and through both implementations to give the algorithms varying degrees of workout. For Rank, the vanilla CombinatorialNumberSystem averaged less than 54 nanoseconds to compete each calculation, but the FastCombinatorialNumberSystem was 43% faster on average and took less than half the memory. What a great payoff for just a little bit of refactoring!

The results for Unrank, on the other hand, were even more surprising:

While the FastCombinatorialNumberSystem does perform better as expected, it’s less than 2% faster and uses the same amount of memory. I’m calling that statistically insignificant; a surprising result given this is where I removed that double loop over and expected a big boost. Perhaps the magicians of compilation-time code optimisation did the work for me and came up with essentially the same code as my refactored version. This demonstrates the value of measuring your code for actual performance gains because here I was perceiving and expecting a big boost when there wasn’t any.

# Conclusion

Hopefully this post covers everything you need as a software engineer to understand the mathematics of k-combinations and provide usable algorithms for it. I’ve stuck the code for both my regular and fast combinatorial number systems, as well as the benchmarking program that runs them, up on a Github Gist.

Don’t forget I’m also hosting Dr McCaffrey’s original MSDN article (due to it disappearing off the web with MSDN being shut down). The link for that is also below: