en-UShe-IL
You are here:   Blog
Register   |  Login

Blog Archive:

Maximize
* Can be used in order to search for older blogs Entries

Search in blogs


Blog Categories:

Maximize
* Can be used in order to search for blogs Entries by Categories

Blog Tags:

Maximize
* Can be used in order to search for blogs by keywords

TNWikiSummit


Awared MVP 


 


Microsoft® Community Contributor 


Microsoft® Community Contributor


 Read first, before you you use the blog! Maximize
מרץ25

Written by: ronen ariely
25/03/2019 13:19 RssIcon

Introduction

Note! This post is draft, which I post fast as answer to question which asked at stackoverflow. Since I don't have time this week to prepare a well formatted post with images, full explanation, and well formatted text, but at the same time I want to provide a solution for the sake of these who asked for it, therefore I publish this as a first draft and provide the code for one of two solutions I will discuss.

Note! The solution which presented here is not the one which I recommend to use in this case. This solution address the question using a well known approach named "Gaps and Islands" and might fit for most cases. In my next post I will present a totally different approach using my personal trick, which can improve performance dramatically in some scenario.

So what do we have here? The post include code and some comments to solve several scenario of Grouping continuous Ranges together.

I will start with the simplest case which is ranges of integers, for example 2-4, 6-8, 8-10, 13-14 which should be grouped into 2-4, 6-10, 13-14.

Next I will move to explain issue related to the resolution of space between the ranges, and I will go to present a solution for Ranges of Decimal numbers.

Finally, using the solution which I presented in detail for INTEGERS I will present a solution for "Grouping continuous time-slots together", which was the original question in the forum.

Let's go dive into the code.

Grouping together Continuous Integer-slots which does not overlap each other

Note! Please go step by step and execute each query in order to understand the topic. As I explained at this time I first publish the code as a fast solution and I do not add images like what the query returns. I will assume that you executed the code in my comments!

Case Study 1.1

In our first scenario we assume that the slots does not include overlap points. By overlap points I mean that a point exists in more than one slot, for example 2-5, 4-7, where the range 4-5 exists in two slots  
DROP TABLE IF EXISTS Ranges
GO
CREATE TABLE RANGES(FromNum INT, ToNum INT)
GO
INSERT RANGES(FromNum, ToNum) VALUES
    (2,5)  ,
    (5,7)  ,
    (10,20)
GO
SELECT * FROM RANGES
GO

Notice that the the first range end exactly at the point that the second range start which mean we can group these two ranges (Integer-slots) together into one range 2-7, which is our goal today

In this specific case we can solve the issue directly without using "Gaps and Island" tricks, using the function LAG

SELECT FromNum, ToNum, L = LAG(ToNum, 1, 0) OVER(ORDER BY FromNum)
from RANGES
GO

* Notice that for the default value I am using the minimum value that the column might include. In this sample I assume that all the values are positive, so I can use 0

Since we don't have overlap ranges, "L <= FromNum" will always true 

Each time the column L has the same value as FromNum, it means that we are in the same range as previous range

We can use this in order to create a column which will be Zero each time we continue the same range and will be more than zero for new range.

SELECT FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(ORDER BY FromNum))
from RANGES
GO

Using SUM we can get a column which present the groups of ranges

;With MyCTE01 as(
    SELECT FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(ORDER BY FromNum))
    from RANGES
)
SELECT FromNum, ToNum, GroupNUm = SUM(L) OVER(ORDER BY FromNum)
FROM MyCTE01
GO

And all we need to do now, is to GROUP by the column "GroupNum" and select the MIN of the column FromNum as the start of the combined range and the MAX of the ToNum as the end of the range

;With MyCTE01 as(
    SELECT FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(ORDER BY FromNum))
    from RANGES
)
,MyCTE02 as (
    SELECT FromNum, ToNum, GroupNum = SUM(L) OVER(ORDER BY FromNum)
    FROM MyCTE01
)
SELECT MIN(FromNum) FromNum, MAX(ToNum) ToNum
FROM MyCTE02
GROUP BY GroupNum
GO

First scenario solved :-)

Case study 1.2

Let's add another condition to our previous scenario. In this case we have another column (ID INT) in the source table. Now, we want to find ranges for each ID separately.

DROP TABLE IF EXISTS Ranges
GO
CREATE TABLE RANGES(ID INT, FromNum INT, ToNum INT)
GO
INSERT RANGES(ID, FromNum, ToNum) VALUES
    (1,2,5),(1,5,7),(1,10,20),
    (2,3,4),(2,7,10),(2,11,21)
GO
-- The different is negltic, and wecan use the bellow solution
;With MyCTE01 as(
    SELECT ID, FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(PARTITION BY ID ORDER BY FromNum))
    from RANGES
)
,MyCTE02 as (
    SELECT ID, FromNum, ToNum, GroupNum = SUM(L) OVER(PARTITION BY ID ORDER BY FromNum)
    FROM MyCTE01
)
SELECT ID, MIN(FromNum) FromNum, MAX(ToNum) ToNum
FROM MyCTE02
GROUP BY ID,GroupNum
ORDER BY ID,GroupNum
GO

But what if we have ranges which overlap one another? The solution above cannot work since it based on comparing only the endpoints of the ranges.

Grouping together Continuous Integer-slots

Everything I did will fail if the ranges can have overlapping data. For example let's add to the table "RANGES" the following row:

INSERT RANGES(ID, FromNum, ToNum)
Values (1, 6, 12)
GO
SELECT* FROM RANGES
GO

In this case the new range 6-12 overlap the range 2-7 which we got in the previous solution (without this new row). In addition it overlap the range 10-20.Therefore, the result we want to get is that all these ranges will be marge into a single range 2-20

;With MyCTE01 as(
    SELECT ID, FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(PARTITION BY ID ORDER BY FromNum))
    from RANGES
)
,MyCTE02 as (
    SELECT ID, FromNum, ToNum, GroupNum = SUM(L) OVER(PARTITION BY ID ORDER BY FromNum)
    FROM MyCTE01
)
SELECT ID, MIN(FromNum) FromNum, MAX(ToNum) ToNum
FROM MyCTE02
GROUP BY ID,GroupNum
ORDER BY ID,GroupNum
GO

Unfortunately, the solution does not fit this case 

Case study

we can solve this issue by (1) reading row by row -> (2) check the previous total range for each new row we read -> and (3) fix the total ranges according the one we add now.

This approach might fit well if you use programming languages like C# for example. It can be implemented using cursor for example in SQL Server, but the performance will awful. SQL Server is A tabular server which mean it works well (In theory) on SET of data and not row by row. In order to work with the data as SET we can choose to use the classic "Gaps & Islands" solutions. 

We can split each range to find all the points inside (for example 2-4 includes the points 2,3,4). Next we will find the combined ranges using the points we have.

For the sake of the our solutions we need to use a numbers table. If you don't have one in your server, then there is a good chance that you have queries that can be improved. A number table is very useful for a lot of case! In this case, Let's create numbers table

DROP TABLE IF EXISTS Numbers
CREATE TABLE Numbers (N BIGINT CONSTRAINT Numbers_PK PRIMARY KEY CLUSTERED(N));
WITH
    L0   AS(SELECT 1 AS C UNION ALL SELECT 1 AS O), -- 2 rows
    L1   AS(SELECT 1 AS C FROM L0 AS A CROSS JOIN L0 AS B), -- 4 rows
    L2   AS(SELECT 1 AS C FROM L1 AS A CROSS JOIN L1 AS B), -- 16 rows
    L3   AS(SELECT 1 AS C FROM L2 AS A CROSS JOIN L2 AS B), -- 256 rows
    L4   AS(SELECT 1 AS C FROM L3 AS A CROSS JOIN L3 AS B), -- 65,536 rows
    L5   AS(SELECT 1 AS C FROM L4 AS A CROSS JOIN L4 AS B),
    Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS N FROM L5)
INSERT INTO Numbers SELECT TOP 1000000 N FROM Nums ORDER BY N;
GO

Now, we can use the numbers table in order to split each range and present the points inside the range.

SELECT ID, FromNum, ToNum, N
FROM RANGES r
INNER JOIN Numbers n  ON n.N between r.FromNum and r.ToNum
ORDER BY ID
GO

Notice that each row was duplicated according to the number of points inside the range. We DO not need the original columns "FromNum" and "ToNum".

SELECT DISTINCT ID, N
FROM RANGES r
INNER JOIN Numbers n  ON n.N between r.FromNum and r.ToNum
ORDER BY ID, N
GO

And now, WE need to find the gaps in the points series.

This step is a classic "Gaps and Islands" problem!

First we will add a column that we can group by it. Notice that ROW_NUMBER with "PARTITION BY ID ORDER BY ID,N" will fit well for the solution:

;With MyCTE01 as (
    SELECT DISTINCT ID, N
    FROM RANGES r
    INNER JOIN Numbers n  ON n.N between r.FromNum and r.ToNum
)
SELECT DISTINCT ID, N, MyGroup = N - ROW_NUMBER()OVER(PARTITION BY ID ORDER BY ID,N)
from MyCTE01
ORDER BY ID, N
GO

Notice the result! For ID = 1 all the rows have the same value for "MyGroup", but for the ID=2we have two different groups.

As before, All we need now is to combined ranges simply by using GROUP BY and selecting the MIN and MAX for each group

;With MyCTE01 as (
    SELECT DISTINCT ID, N
    FROM RANGES r
    INNER JOIN Numbers n  ON n.N between r.FromNum and r.ToNum
)
,MyCTE02 as(
    SELECT DISTINCT ID, N, MyGroup = N - ROW_NUMBER()OVER(PARTITION BY ID ORDER BY ID,N)
    from MyCTE01
)
SELECT ID, MIN(N) FromNum, MAX(N) ToNum
FROM MyCTE02
GROUP BY ID, MyGroup
ORDER BY ID, FromNum
GO

victory!

Or maybe not?!?

This solution based on combining the points which we found between the original ranges. But if the space between two ranges is less or equal the space between the point, then the ranges with the small space between them will be considered as one slot! 

For example if we have ranges 1-3, 4-5 then we find integers 1,2,3 in the first range and 4,5 in the second range. The result will include the points 1,2,3,4,5. In this result it seems like we don't have any gap and the combined range will be 1-5 

But in fact we have a gap between 3 and 4

The solution for this issue is simply to use points which have smaller distance then the the gaps between the ranges.

The solution above based on INTEGERS which mean it will work well for gaps that are larger than 1.

The solution for smaller gaps is to use decimal. And in our case we can use use resolution of 0.5

DROP TABLE IF EXISTS T
GO
CREATE TABLE T (FromNum INT, ToNum INT)
INSERT T(FromNum, ToNum) VALUES (1,3), (4,5)
GO
 
-- Bad soluition, using points with distance 1 (integer)
;With MyCTE01 as (
    SELECT DISTINCT N
    FROM T r
    INNER JOIN Numbers n  ON n.N between r.FromNum and r.ToNum
)
,MyCTE02 as(
    SELECT DISTINCT N, MyGroup = N - ROW_NUMBER()OVER(ORDER BY N)
    from MyCTE01
)
SELECT MIN(N) FromNum, MAX(N) ToNum
FROM MyCTE02
GROUP BY MyGroup
ORDER BY FromNum
GO-- solution 1-5

 

Grouping together Continuous Deciaml-slots

Working solution! using numbers with 0.5 distance will solve the issue. Let's create DECIMAL numbers table which include numbers with 1/2 distance.

SELECT TOP 10000 N = CONVERT(DECIMAL(11,2),N)/2
    INTO NumbersDecimal
FROM Numbers
ORDER BY N
GO
CREATE CLUSTERED INDEX IX_N ON NumbersDecimal (N)
GO
SELECT TOP 1000 * FROM NumbersDecimal
GO

and now we can use the exact same logic for solution

;With MyCTE01 as (
    SELECT DISTINCT N
    FROM T r
    INNER JOIN NumbersDecimal n  ON n.N between r.FromNum and r.ToNum
)
,MyCTE02 as(
    SELECT DISTINCT N, MyGroup = N - CONVERT(DECIMAL(11,2),ROW_NUMBER()OVER(ORDER BY N))/2
    from MyCTE01
)
SELECT MIN(N) FromNum, MAX(N) ToNum
FROM MyCTE02
GROUP BY MyGroup
ORDER BY FromNum
GO

And this is the real VICTORY!

The Conclusion and the most important point in this post, is that in order to use this approach, the resolution of the points should be smaller than the slots in the ranges!

Grouping together Continuous Time-slots

Note! for this demo I will use the the DDL+DML which was provided in the original question,which mean that in fact,here I will present an optional solution for the question using the "Gaps and Islands" approach.

Note! Once more I want to repeat myself! The solution which presented here is not the one which I recommend to use in this case. This solution address the question using a well known approach named "Gaps and Islands" and might fit for most cases. In my next post I will present a totally different approach using my personal trick, which can improve performance dramatically in some scenario.

DROP TABLE IF EXISTS TEST
GO
CREATE TABLE TEST (ID int, tFrom datetime, tUntil dateTime)
insert into TEST Values (1,'2019-1-1 12:00', '2019-1-1 13:00')
insert into TEST Values (1,'2019-1-1 13:00', '2019-1-1 14:00')
insert into TEST Values (1,'2019-1-1 14:00', '2019-1-1 16:00')
insert into TEST Values (1,'2019-1-1 18:00', '2019-1-1 19:00')
insert into TEST Values (1,'2019-1-1 19:00', '2019-1-1 20:00')
insert into TEST Values (1,'2019-1-1 20:00', '2019-1-1 21:00')
insert into TEST Values (1,'2019-1-1 22:00', '2019-1-1 23:00')
insert into TEST Values (2,'2019-1-1 12:00', '2019-1-1 13:00')
insert into TEST Values (2,'2019-1-1 13:00', '2019-1-1 14:00')
insert into TEST Values (2,'2019-1-1 14:00', '2019-1-1 16:00')
insert into TEST Values (2,'2019-1-1 18:00', '2019-1-1 19:00')
insert into TEST Values (2,'2019-1-1 19:00', '2019-1-1 20:00')
insert into TEST Values (2,'2019-1-1 20:00', '2019-1-1 21:00')
insert into TEST Values (2,'2019-1-1 22:00', '2019-1-1 23:00')
GO
SELECT ID, tFrom, tUntil
FROM TEST

Note! We can convert the DATETIME values into INTEGERS andsimply use the solution for "Grouping together Continuous Deciaml-slots".

SELECT ID, tFrom, tUntil, DATEDIFF_BIG(MINUTE, '2018-01-01', tFrom) s, DATEDIFF_BIG(MINUTE, '2018-01-01', tUntil) e
FROM TEST

This will work well if our entire data is limited to a small range of dates, for example using the date "2018-01-01" as the smaller date. unfortunately, using MINUTE as resolution with older date will enforce us to use range of numbers that come to almost 100 million.

Moreover, think about the option that we need to use higher resolution like seconds or milliseconds...

To solve this issue, I will use a small change in the code, and I will use a Times table which include the points of times which we want to use according to the resolution which we need.

In the case presented in the original question, the slot can be one hour or more and I remind you that we need to use higher resolution. Therefore, I will use resolution of 10 minutes, and starting time at 2010-01-01.

DROP TABLE IF EXISTS Times
GO
SELECT DT = DATEADD(MINUTE, N*10, '2010-01-01')
    INTO Times
FROM Numbers
GO
CREATE CLUSTERED INDEX IX_DT ON Times(DT)
GO
SELECT TOP 1000 DT from Times
GO

 

SELECT ID, tFrom, tUntil, DT
FROM TEST t
INNER JOIN Times dt ON DT between tFrom and tUntil
GO

 

;With MyCTE01 as (
    SELECT DISTINCT ID, DT
    FROM TEST t
    INNER JOIN Times dt ON DT between tFrom and tUntil
)
,MyCTE02 as(
    SELECT ID, DT,
        MyGroup =
            DATEDIFF(MINUTE,
                DATEADD(MINUTE, 10 * ROW_NUMBER()OVER(PARTITION BY ID ORDER BY ID,DT),0),
                DT
            )
    from MyCTE01
    --order by ID,DT
)
SELECT ID, MIN(DT) tFrom, MAX(DT) tUntil
FROM MyCTE02
GROUP BY ID, MyGroup
ORDER BY ID, tFrom
GO

Done!

You are welcome to test it

Conclusions: Advantages, disadvantages, and limitations

.

Resources and More information