You are here:

## Blog Archive:

* Can be used in order to search for older blogs Entries

## Search in blogs

 KeywordsPhrase

## Blog Categories:

* Can be used in order to search for blogs Entries by Categories

## Blog Tags:

* Can be used in order to search for blogs by keywords

## TNWikiSummit

 Read this before you use the blog!

## Grouping continuous Ranges together: Part 1, Gaps and Islands approach

מרץ25

Written by: ronen ariely
25/03/2019 13:19

# Introduction

Note! This post is draft, which I post fast as answer to question which asked at stackoverflow. Since I don't have time this week to prepare a well formatted post with images, full explanation, and well formatted text, but at the same time I want to provide a solution for the sake of these who asked for it, therefore I publish this as a first draft and provide the code for one of two solutions I will discuss.

Note! The solution which presented here is not the one which I recommend to use in this case. This solution address the question using a well known approach named "Gaps and Islands" and might fit for most cases. In my next post I will present a totally different approach using my personal trick, which can improve performance dramatically in some scenario.

So what do we have here? The post include code and some comments to solve several scenario of Grouping continuous Ranges together.

I will start with the simplest case which is ranges of integers, for example 2-4, 6-8, 8-10, 13-14 which should be grouped into 2-4, 6-10, 13-14.

Next I will move to explain issue related to the resolution of space between the ranges, and I will go to present a solution for Ranges of Decimal numbers.

Finally, using the solution which I presented in detail for INTEGERS I will present a solution for "Grouping continuous time-slots together", which was the original question in the forum.

Let's go dive into the code.

# Grouping together Continuous Integer-slots which does not overlap each other

Note! Please go step by step and execute each query in order to understand the topic. As I explained at this time I first publish the code as a fast solution and I do not add images like what the query returns. I will assume that you executed the code in my comments!

## Case Study 1.1

In our first scenario we assume that the slots does not include overlap points. By overlap points I mean that a point exists in more than one slot, for example 2-5, 4-7, where the range 4-5 exists in two slots
`DROP` `TABLE` `IF EXISTS Ranges`
`GO`
`CREATE` `TABLE` `RANGES(FromNum ``INT``, ToNum ``INT``)`
`GO`
`INSERT` `RANGES(FromNum, ToNum) ``VALUES`
`    ``(2,5)  ,`
`    ``(5,7)  ,`
`    ``(10,20)`
`GO`
`SELECT` `* ``FROM` `RANGES`
`GO`

Notice that the the first range end exactly at the point that the second range start which mean we can group these two ranges (Integer-slots) together into one range 2-7, which is our goal today

In this specific case we can solve the issue directly without using "Gaps and Island" tricks, using the function LAG

`SELECT` `FromNum, ToNum, L = LAG(ToNum, 1, 0) OVER(``ORDER` `BY` `FromNum)`
`from` `RANGES`
`GO`

* Notice that for the default value I am using the minimum value that the column might include. In this sample I assume that all the values are positive, so I can use 0

Since we don't have overlap ranges, "L <= FromNum" will always true

Each time the column L has the same value as FromNum, it means that we are in the same range as previous range

We can use this in order to create a column which will be Zero each time we continue the same range and will be more than zero for new range.

`SELECT` `FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(``ORDER` `BY` `FromNum))`
`from` `RANGES`
`GO`

Using SUM we can get a column which present the groups of ranges

`;``With` `MyCTE01 ``as``(`
`    ``SELECT` `FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(``ORDER` `BY` `FromNum))`
`    ``from` `RANGES`
`)`
`SELECT` `FromNum, ToNum, GroupNUm = ``SUM``(L) OVER(``ORDER` `BY` `FromNum)`
`FROM` `MyCTE01`
`GO`

And all we need to do now, is to GROUP by the column "GroupNum" and select the MIN of the column FromNum as the start of the combined range and the MAX of the ToNum as the end of the range

`;``With` `MyCTE01 ``as``(`
`    ``SELECT` `FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(``ORDER` `BY` `FromNum))`
`    ``from` `RANGES`
`)`
`,MyCTE02 ``as` `(`
`    ``SELECT` `FromNum, ToNum, GroupNum = ``SUM``(L) OVER(``ORDER` `BY` `FromNum)`
`    ``FROM` `MyCTE01`
`)`
`SELECT` `MIN``(FromNum) FromNum, ``MAX``(ToNum) ToNum`
`FROM` `MyCTE02`
`GROUP` `BY` `GroupNum`
`GO`

First scenario solved :-)

## Case study 1.2

Let's add another condition to our previous scenario. In this case we have another column (ID INT) in the source table. Now, we want to find ranges for each ID separately.

`DROP` `TABLE` `IF EXISTS Ranges`
`GO`
`CREATE` `TABLE` `RANGES(ID ``INT``, FromNum ``INT``, ToNum ``INT``)`
`GO`
`INSERT` `RANGES(ID, FromNum, ToNum) ``VALUES`
`    ``(1,2,5),(1,5,7),(1,10,20),`
`    ``(2,3,4),(2,7,10),(2,11,21)`
`GO`
`-- The different is negltic, and wecan use the bellow solution`
`;``With` `MyCTE01 ``as``(`
`    ``SELECT` `ID, FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(PARTITION ``BY` `ID ``ORDER` `BY` `FromNum))`
`    ``from` `RANGES`
`)`
`,MyCTE02 ``as` `(`
`    ``SELECT` `ID, FromNum, ToNum, GroupNum = ``SUM``(L) OVER(PARTITION ``BY` `ID ``ORDER` `BY` `FromNum)`
`    ``FROM` `MyCTE01`
`)`
`SELECT` `ID, ``MIN``(FromNum) FromNum, ``MAX``(ToNum) ToNum`
`FROM` `MyCTE02`
`GROUP` `BY` `ID,GroupNum`
`ORDER` `BY` `ID,GroupNum`
`GO`

But what if we have ranges which overlap one another? The solution above cannot work since it based on comparing only the endpoints of the ranges.

# Grouping together Continuous Integer-slots

Everything I did will fail if the ranges can have overlapping data. For example let's add to the table "RANGES" the following row:

`INSERT` `RANGES(ID, FromNum, ToNum)`
`Values` `(1, 6, 12)`
`GO`
`SELECT``* ``FROM` `RANGES`
`GO`

In this case the new range 6-12 overlap the range 2-7 which we got in the previous solution (without this new row). In addition it overlap the range 10-20.Therefore, the result we want to get is that all these ranges will be marge into a single range 2-20

`;``With` `MyCTE01 ``as``(`
`    ``SELECT` `ID, FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(PARTITION ``BY` `ID ``ORDER` `BY` `FromNum))`
`    ``from` `RANGES`
`)`
`,MyCTE02 ``as` `(`
`    ``SELECT` `ID, FromNum, ToNum, GroupNum = ``SUM``(L) OVER(PARTITION ``BY` `ID ``ORDER` `BY` `FromNum)`
`    ``FROM` `MyCTE01`
`)`
`SELECT` `ID, ``MIN``(FromNum) FromNum, ``MAX``(ToNum) ToNum`
`FROM` `MyCTE02`
`GROUP` `BY` `ID,GroupNum`
`ORDER` `BY` `ID,GroupNum`
`GO`

Unfortunately, the solution does not fit this case

## Case study

we can solve this issue by (1) reading row by row -> (2) check the previous total range for each new row we read -> and (3) fix the total ranges according the one we add now.

This approach might fit well if you use programming languages like C# for example. It can be implemented using cursor for example in SQL Server, but the performance will awful. SQL Server is A tabular server which mean it works well (In theory) on SET of data and not row by row. In order to work with the data as SET we can choose to use the classic "Gaps & Islands" solutions.

We can split each range to find all the points inside (for example 2-4 includes the points 2,3,4). Next we will find the combined ranges using the points we have.

For the sake of the our solutions we need to use a numbers table. If you don't have one in your server, then there is a good chance that you have queries that can be improved. A number table is very useful for a lot of case! In this case, Let's create numbers table

`DROP` `TABLE` `IF EXISTS Numbers`
`CREATE` `TABLE` `Numbers (N ``BIGINT` `CONSTRAINT` `Numbers_PK ``PRIMARY` `KEY` `CLUSTERED(N)); `
`WITH`
`    ``L0   ``AS``(``SELECT` `1 ``AS` `C ``UNION` `ALL` `SELECT` `1 ``AS` `O), ``-- 2 rows`
`    ``L1   ``AS``(``SELECT` `1 ``AS` `C ``FROM` `L0 ``AS` `A ``CROSS` `JOIN` `L0 ``AS` `B), ``-- 4 rows`
`    ``L2   ``AS``(``SELECT` `1 ``AS` `C ``FROM` `L1 ``AS` `A ``CROSS` `JOIN` `L1 ``AS` `B), ``-- 16 rows`
`    ``L3   ``AS``(``SELECT` `1 ``AS` `C ``FROM` `L2 ``AS` `A ``CROSS` `JOIN` `L2 ``AS` `B), ``-- 256 rows`
`    ``L4   ``AS``(``SELECT` `1 ``AS` `C ``FROM` `L3 ``AS` `A ``CROSS` `JOIN` `L3 ``AS` `B), ``-- 65,536 rows`
`    ``L5   ``AS``(``SELECT` `1 ``AS` `C ``FROM` `L4 ``AS` `A ``CROSS` `JOIN` `L4 ``AS` `B),`
`    ``Nums ``AS``(``SELECT` `ROW_NUMBER() OVER(``ORDER` `BY` `(``SELECT` `NULL``)) ``AS` `N ``FROM` `L5)`
`INSERT` `INTO` `Numbers ``SELECT` `TOP` `1000000 N ``FROM` `Nums ``ORDER` `BY` `N;`
`GO`

Now, we can use the numbers table in order to split each range and present the points inside the range.

`SELECT` `ID, FromNum, ToNum, N`
`FROM` `RANGES r`
`INNER` `JOIN` `Numbers n  ``ON` `n.N ``between` `r.FromNum ``and` `r.ToNum`
`ORDER` `BY` `ID`
`GO`

Notice that each row was duplicated according to the number of points inside the range. We DO not need the original columns "FromNum" and "ToNum".

`SELECT` `DISTINCT` `ID, N`
`FROM` `RANGES r`
`INNER` `JOIN` `Numbers n  ``ON` `n.N ``between` `r.FromNum ``and` `r.ToNum`
`ORDER` `BY` `ID, N`
`GO`

And now, WE need to find the gaps in the points series.

This step is a classic "Gaps and Islands" problem!

First we will add a column that we can group by it. Notice that ROW_NUMBER with "PARTITION BY ID ORDER BY ID,N" will fit well for the solution:

`;``With` `MyCTE01 ``as` `(`
`    ``SELECT` `DISTINCT` `ID, N`
`    ``FROM` `RANGES r`
`    ``INNER` `JOIN` `Numbers n  ``ON` `n.N ``between` `r.FromNum ``and` `r.ToNum`
`)`
`SELECT` `DISTINCT` `ID, N, MyGroup = N - ROW_NUMBER()OVER(PARTITION ``BY` `ID ``ORDER` `BY` `ID,N)`
`from` `MyCTE01`
`ORDER` `BY` `ID, N`
`GO`

Notice the result! For ID = 1 all the rows have the same value for "MyGroup", but for the ID=2we have two different groups.

As before, All we need now is to combined ranges simply by using GROUP BY and selecting the MIN and MAX for each group

`;``With` `MyCTE01 ``as` `(`
`    ``SELECT` `DISTINCT` `ID, N`
`    ``FROM` `RANGES r`
`    ``INNER` `JOIN` `Numbers n  ``ON` `n.N ``between` `r.FromNum ``and` `r.ToNum`
`)`
`,MyCTE02 ``as``(`
`    ``SELECT` `DISTINCT` `ID, N, MyGroup = N - ROW_NUMBER()OVER(PARTITION ``BY` `ID ``ORDER` `BY` `ID,N)`
`    ``from` `MyCTE01`
`)`
`SELECT` `ID, ``MIN``(N) FromNum, ``MAX``(N) ToNum`
`FROM` `MyCTE02`
`GROUP` `BY` `ID, MyGroup`
`ORDER` `BY` `ID, FromNum`
`GO`

victory!

Or maybe not?!?

This solution based on combining the points which we found between the original ranges. But if the space between two ranges is less or equal the space between the point, then the ranges with the small space between them will be considered as one slot!

For example if we have ranges 1-3, 4-5 then we find integers 1,2,3 in the first range and 4,5 in the second range. The result will include the points 1,2,3,4,5. In this result it seems like we don't have any gap and the combined range will be 1-5

But in fact we have a gap between 3 and 4

The solution for this issue is simply to use points which have smaller distance then the the gaps between the ranges.

The solution above based on INTEGERS which mean it will work well for gaps that are larger than 1.

The solution for smaller gaps is to use decimal. And in our case we can use use resolution of 0.5

`DROP` `TABLE` `IF EXISTS T`
`GO`
`CREATE` `TABLE` `T (FromNum ``INT``, ToNum ``INT``)`
`INSERT` `T(FromNum, ToNum) ``VALUES` `(1,3), (4,5)`
`GO`

`-- Bad soluition, using points with distance 1 (integer)`
`;``With` `MyCTE01 ``as` `(`
`    ``SELECT` `DISTINCT` `N`
`    ``FROM` `T r`
`    ``INNER` `JOIN` `Numbers n  ``ON` `n.N ``between` `r.FromNum ``and` `r.ToNum`
`)`
`,MyCTE02 ``as``(`
`    ``SELECT` `DISTINCT` `N, MyGroup = N - ROW_NUMBER()OVER(``ORDER` `BY` `N)`
`    ``from` `MyCTE01`
`)`
`SELECT` `MIN``(N) FromNum, ``MAX``(N) ToNum`
`FROM` `MyCTE02`
`GROUP` `BY` `MyGroup`
`ORDER` `BY` `FromNum`
`GO``-- solution 1-5`

# Grouping together Continuous Deciaml-slots

Working solution! using numbers with 0.5 distance will solve the issue. Let's create DECIMAL numbers table which include numbers with 1/2 distance.

`SELECT` `TOP` `10000 N = ``CONVERT``(``DECIMAL``(11,2),N)/2`
`    ``INTO` `NumbersDecimal`
`FROM` `Numbers`
`ORDER` `BY` `N`
`GO`
`CREATE` `CLUSTERED ``INDEX` `IX_N ``ON` `NumbersDecimal (N)`
`GO`
`SELECT` `TOP` `1000 * ``FROM` `NumbersDecimal`
`GO`

and now we can use the exact same logic for solution

`;With MyCTE01 ``as` `(`
`    ``SELECT DISTINCT N`
`    ``FROM T r`
`    ``INNER JOIN NumbersDecimal n  ON n.N between r.FromNum ``and` `r.ToNum`
`)`
`,MyCTE02 ``as``(`
`    ``SELECT DISTINCT N, MyGroup = N - CONVERT(DECIMAL(11,2),ROW_NUMBER()OVER(ORDER BY N))/2`
`    ``from MyCTE01`
`)`
`SELECT MIN(N) FromNum, MAX(N) ToNum`
`FROM MyCTE02`
`GROUP BY MyGroup`
`ORDER BY FromNum`
`GO`

And this is the real VICTORY!

The Conclusion and the most important point in this post, is that in order to use this approach, the resolution of the points should be smaller than the slots in the ranges!

# Grouping together Continuous Time-slots

Note! for this demo I will use the the DDL+DML which was provided in the original question,which mean that in fact,here I will present an optional solution for the question using the "Gaps and Islands" approach.

Note! Once more I want to repeat myself! The solution which presented here is not the one which I recommend to use in this case. This solution address the question using a well known approach named "Gaps and Islands" and might fit for most cases. In my next post I will present a totally different approach using my personal trick, which can improve performance dramatically in some scenario.

`DROP` `TABLE` `IF EXISTS TEST`
`GO`
`CREATE` `TABLE` `TEST (ID ``int``, tFrom datetime, tUntil dateTime)`
`insert` `into` `TEST ``Values` `(1,``'2019-1-1 12:00'``, ``'2019-1-1 13:00'``)`
`insert` `into` `TEST ``Values` `(1,``'2019-1-1 13:00'``, ``'2019-1-1 14:00'``)`
`insert` `into` `TEST ``Values` `(1,``'2019-1-1 14:00'``, ``'2019-1-1 16:00'``)`
`insert` `into` `TEST ``Values` `(1,``'2019-1-1 18:00'``, ``'2019-1-1 19:00'``)`
`insert` `into` `TEST ``Values` `(1,``'2019-1-1 19:00'``, ``'2019-1-1 20:00'``)`
`insert` `into` `TEST ``Values` `(1,``'2019-1-1 20:00'``, ``'2019-1-1 21:00'``)`
`insert` `into` `TEST ``Values` `(1,``'2019-1-1 22:00'``, ``'2019-1-1 23:00'``)`
`insert` `into` `TEST ``Values` `(2,``'2019-1-1 12:00'``, ``'2019-1-1 13:00'``)`
`insert` `into` `TEST ``Values` `(2,``'2019-1-1 13:00'``, ``'2019-1-1 14:00'``)`
`insert` `into` `TEST ``Values` `(2,``'2019-1-1 14:00'``, ``'2019-1-1 16:00'``)`
`insert` `into` `TEST ``Values` `(2,``'2019-1-1 18:00'``, ``'2019-1-1 19:00'``)`
`insert` `into` `TEST ``Values` `(2,``'2019-1-1 19:00'``, ``'2019-1-1 20:00'``)`
`insert` `into` `TEST ``Values` `(2,``'2019-1-1 20:00'``, ``'2019-1-1 21:00'``)`
`insert` `into` `TEST ``Values` `(2,``'2019-1-1 22:00'``, ``'2019-1-1 23:00'``)`
`GO`
`SELECT` `ID, tFrom, tUntil`
`FROM` `TEST`

Note! We can convert the DATETIME values into INTEGERS andsimply use the solution for "Grouping together Continuous Deciaml-slots".

`SELECT` `ID, tFrom, tUntil, DATEDIFF_BIG(``MINUTE``, ``'2018-01-01'``, tFrom) s, DATEDIFF_BIG(``MINUTE``, ``'2018-01-01'``, tUntil) e`
`FROM` `TEST`

This will work well if our entire data is limited to a small range of dates, for example using the date "2018-01-01" as the smaller date. unfortunately, using MINUTE as resolution with older date will enforce us to use range of numbers that come to almost 100 million.

Moreover, think about the option that we need to use higher resolution like seconds or milliseconds...

To solve this issue, I will use a small change in the code, and I will use a Times table which include the points of times which we want to use according to the resolution which we need.

In the case presented in the original question, the slot can be one hour or more and I remind you that we need to use higher resolution. Therefore, I will use resolution of 10 minutes, and starting time at 2010-01-01.

`DROP` `TABLE` `IF EXISTS Times`
`GO`
`SELECT` `DT = DATEADD(``MINUTE``, N*10, ``'2010-01-01'``)`
`    ``INTO` `Times `
`FROM` `Numbers`
`GO`
`CREATE` `CLUSTERED ``INDEX` `IX_DT ``ON` `Times(DT)`
`GO`
`SELECT` `TOP` `1000 DT ``from` `Times`
`GO`

`SELECT` `ID, tFrom, tUntil, DT`
`FROM` `TEST t`
`INNER` `JOIN` `Times dt ``ON` `DT ``between` `tFrom ``and` `tUntil`
`GO`

`;``With` `MyCTE01 ``as` `(`
`    ``SELECT` `DISTINCT` `ID, DT`
`    ``FROM` `TEST t`
`    ``INNER` `JOIN` `Times dt ``ON` `DT ``between` `tFrom ``and` `tUntil`
`)`
`,MyCTE02 ``as``(`
`    ``SELECT` `ID, DT, `
`        ``MyGroup = `
`            ``DATEDIFF(``MINUTE``,`
`                ``DATEADD(``MINUTE``, 10 * ROW_NUMBER()OVER(PARTITION ``BY` `ID ``ORDER` `BY` `ID,DT),0), `
`                ``DT`
`            ``)`
`    ``from` `MyCTE01`
`    ``--order by ID,DT`
`)`
`SELECT` `ID, ``MIN``(DT) tFrom, ``MAX``(DT) tUntil`
`FROM` `MyCTE02`
`GROUP` `BY` `ID, MyGroup`
`ORDER` `BY` `ID, tFrom`
`GO`

Done!

You are welcome to test it

.

# Resources and More information

Categories: SQL

## Read More:

Top Microsoft TechNet articles by Ronen Ariely