מרץ25
Written by:
ronen ariely
25/03/2019 13:19
Introduction
Note! This post is draft, which I post fast as answer to question which asked at stackoverflow. Since I don't have time this week to prepare a well formatted post with images, full explanation, and well formatted text, but at the same time I want to provide a solution for the sake of these who asked for it, therefore I publish this as a first draft and provide the code for one of two solutions I will discuss.
Note! The solution which presented here is not the one which I recommend to use in this case. This solution address the question using a well known approach named "Gaps and Islands" and might fit for most cases. In my next post I will present a totally different approach using my personal trick, which can improve performance dramatically in some scenario.
So what do we have here? The post include code and some comments to solve several scenario of Grouping continuous Ranges together.
I will start with the simplest case which is ranges of integers, for example 2-4, 6-8, 8-10, 13-14 which should be grouped into 2-4, 6-10, 13-14.
Next I will move to explain issue related to the resolution of space between the ranges, and I will go to present a solution for Ranges of Decimal numbers.
Finally, using the solution which I presented in detail for INTEGERS I will present a solution for "Grouping continuous time-slots together", which was the original question in the forum.
Let's go dive into the code.
Grouping together Continuous Integer-slots which does not overlap each other
Note! Please go step by step and execute each query in order to understand the topic. As I explained at this time I first publish the code as a fast solution and I do not add images like what the query returns. I will assume that you executed the code in my comments!
Case Study 1.1
In our first scenario we assume that the slots does not include overlap points. By overlap points I mean that a point exists in more than one slot, for example 2-5, 4-7, where the range 4-5 exists in two slots
DROP
TABLE
IF EXISTS Ranges
GO
CREATE
TABLE
RANGES(FromNum
INT
, ToNum
INT
)
GO
INSERT
RANGES(FromNum, ToNum)
VALUES
(2,5) ,
(5,7) ,
(10,20)
GO
SELECT
*
FROM
RANGES
GO
Notice that the the first range end exactly at the point that the second range start which mean we can group these two ranges (Integer-slots) together into one range 2-7, which is our goal today
In this specific case we can solve the issue directly without using "Gaps and Island" tricks, using the function LAG
SELECT
FromNum, ToNum, L = LAG(ToNum, 1, 0) OVER(
ORDER
BY
FromNum)
from
RANGES
GO
* Notice that for the default value I am using the minimum value that the column might include. In this sample I assume that all the values are positive, so I can use 0
Since we don't have overlap ranges, "L <= FromNum" will always true
Each time the column L has the same value as FromNum, it means that we are in the same range as previous range
We can use this in order to create a column which will be Zero each time we continue the same range and will be more than zero for new range.
SELECT
FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(
ORDER
BY
FromNum))
from
RANGES
GO
Using SUM we can get a column which present the groups of ranges
;
With
MyCTE01
as
(
SELECT
FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(
ORDER
BY
FromNum))
from
RANGES
)
SELECT
FromNum, ToNum, GroupNUm =
SUM
(L) OVER(
ORDER
BY
FromNum)
FROM
MyCTE01
GO
And all we need to do now, is to GROUP by the column "GroupNum" and select the MIN of the column FromNum as the start of the combined range and the MAX of the ToNum as the end of the range
;
With
MyCTE01
as
(
SELECT
FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(
ORDER
BY
FromNum))
from
RANGES
)
,MyCTE02
as
(
SELECT
FromNum, ToNum, GroupNum =
SUM
(L) OVER(
ORDER
BY
FromNum)
FROM
MyCTE01
)
SELECT
MIN
(FromNum) FromNum,
MAX
(ToNum) ToNum
FROM
MyCTE02
GROUP
BY
GroupNum
GO
First scenario solved :-)
Case study 1.2
Let's add another condition to our previous scenario. In this case we have another column (ID INT) in the source table. Now, we want to find ranges for each ID separately.
DROP
TABLE
IF EXISTS Ranges
GO
CREATE
TABLE
RANGES(ID
INT
, FromNum
INT
, ToNum
INT
)
GO
INSERT
RANGES(ID, FromNum, ToNum)
VALUES
(1,2,5),(1,5,7),(1,10,20),
(2,3,4),(2,7,10),(2,11,21)
GO
-- The different is negltic, and wecan use the bellow solution
;
With
MyCTE01
as
(
SELECT
ID, FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(PARTITION
BY
ID
ORDER
BY
FromNum))
from
RANGES
)
,MyCTE02
as
(
SELECT
ID, FromNum, ToNum, GroupNum =
SUM
(L) OVER(PARTITION
BY
ID
ORDER
BY
FromNum)
FROM
MyCTE01
)
SELECT
ID,
MIN
(FromNum) FromNum,
MAX
(ToNum) ToNum
FROM
MyCTE02
GROUP
BY
ID,GroupNum
ORDER
BY
ID,GroupNum
GO
But what if we have ranges which overlap one another? The solution above cannot work since it based on comparing only the endpoints of the ranges.
Grouping together Continuous Integer-slots
Everything I did will fail if the ranges can have overlapping data. For example let's add to the table "RANGES" the following row:
INSERT
RANGES(ID, FromNum, ToNum)
Values
(1, 6, 12)
GO
SELECT
*
FROM
RANGES
GO
In this case the new range 6-12 overlap the range 2-7 which we got in the previous solution (without this new row). In addition it overlap the range 10-20.Therefore, the result we want to get is that all these ranges will be marge into a single range 2-20
;
With
MyCTE01
as
(
SELECT
ID, FromNum, ToNum, L = FromNum - (LAG(ToNum, 1, 0) OVER(PARTITION
BY
ID
ORDER
BY
FromNum))
from
RANGES
)
,MyCTE02
as
(
SELECT
ID, FromNum, ToNum, GroupNum =
SUM
(L) OVER(PARTITION
BY
ID
ORDER
BY
FromNum)
FROM
MyCTE01
)
SELECT
ID,
MIN
(FromNum) FromNum,
MAX
(ToNum) ToNum
FROM
MyCTE02
GROUP
BY
ID,GroupNum
ORDER
BY
ID,GroupNum
GO
Unfortunately, the solution does not fit this case
Case study
we can solve this issue by (1) reading row by row -> (2) check the previous total range for each new row we read -> and (3) fix the total ranges according the one we add now.
This approach might fit well if you use programming languages like C# for example. It can be implemented using cursor for example in SQL Server, but the performance will awful. SQL Server is A tabular server which mean it works well (In theory) on SET of data and not row by row. In order to work with the data as SET we can choose to use the classic "Gaps & Islands" solutions.
We can split each range to find all the points inside (for example 2-4 includes the points 2,3,4). Next we will find the combined ranges using the points we have.
For the sake of the our solutions we need to use a numbers table. If you don't have one in your server, then there is a good chance that you have queries that can be improved. A number table is very useful for a lot of case! In this case, Let's create numbers table
DROP
TABLE
IF EXISTS Numbers
CREATE
TABLE
Numbers (N
BIGINT
CONSTRAINT
Numbers_PK
PRIMARY
KEY
CLUSTERED(N));
WITH
L0
AS
(
SELECT
1
AS
C
UNION
ALL
SELECT
1
AS
O),
-- 2 rows
L1
AS
(
SELECT
1
AS
C
FROM
L0
AS
A
CROSS
JOIN
L0
AS
B),
-- 4 rows
L2
AS
(
SELECT
1
AS
C
FROM
L1
AS
A
CROSS
JOIN
L1
AS
B),
-- 16 rows
L3
AS
(
SELECT
1
AS
C
FROM
L2
AS
A
CROSS
JOIN
L2
AS
B),
-- 256 rows
L4
AS
(
SELECT
1
AS
C
FROM
L3
AS
A
CROSS
JOIN
L3
AS
B),
-- 65,536 rows
L5
AS
(
SELECT
1
AS
C
FROM
L4
AS
A
CROSS
JOIN
L4
AS
B),
Nums
AS
(
SELECT
ROW_NUMBER() OVER(
ORDER
BY
(
SELECT
NULL
))
AS
N
FROM
L5)
INSERT
INTO
Numbers
SELECT
TOP
1000000 N
FROM
Nums
ORDER
BY
N;
GO
Now, we can use the numbers table in order to split each range and present the points inside the range.
SELECT
ID, FromNum, ToNum, N
FROM
RANGES r
INNER
JOIN
Numbers n
ON
n.N
between
r.FromNum
and
r.ToNum
ORDER
BY
ID
GO
Notice that each row was duplicated according to the number of points inside the range. We DO not need the original columns "FromNum" and "ToNum".
SELECT
DISTINCT
ID, N
FROM
RANGES r
INNER
JOIN
Numbers n
ON
n.N
between
r.FromNum
and
r.ToNum
ORDER
BY
ID, N
GO
And now, WE need to find the gaps in the points series.
This step is a classic "Gaps and Islands" problem!
First we will add a column that we can group by it. Notice that ROW_NUMBER with "PARTITION BY ID ORDER BY ID,N" will fit well for the solution:
;
With
MyCTE01
as
(
SELECT
DISTINCT
ID, N
FROM
RANGES r
INNER
JOIN
Numbers n
ON
n.N
between
r.FromNum
and
r.ToNum
)
SELECT
DISTINCT
ID, N, MyGroup = N - ROW_NUMBER()OVER(PARTITION
BY
ID
ORDER
BY
ID,N)
from
MyCTE01
ORDER
BY
ID, N
GO
Notice the result! For ID = 1 all the rows have the same value for "MyGroup", but for the ID=2we have two different groups.
As before, All we need now is to combined ranges simply by using GROUP BY and selecting the MIN and MAX for each group
;
With
MyCTE01
as
(
SELECT
DISTINCT
ID, N
FROM
RANGES r
INNER
JOIN
Numbers n
ON
n.N
between
r.FromNum
and
r.ToNum
)
,MyCTE02
as
(
SELECT
DISTINCT
ID, N, MyGroup = N - ROW_NUMBER()OVER(PARTITION
BY
ID
ORDER
BY
ID,N)
from
MyCTE01
)
SELECT
ID,
MIN
(N) FromNum,
MAX
(N) ToNum
FROM
MyCTE02
GROUP
BY
ID, MyGroup
ORDER
BY
ID, FromNum
GO
victory!
Or maybe not?!?
This solution based on combining the points which we found between the original ranges. But if the space between two ranges is less or equal the space between the point, then the ranges with the small space between them will be considered as one slot!
For example if we have ranges 1-3, 4-5 then we find integers 1,2,3 in the first range and 4,5 in the second range. The result will include the points 1,2,3,4,5. In this result it seems like we don't have any gap and the combined range will be 1-5
But in fact we have a gap between 3 and 4
The solution for this issue is simply to use points which have smaller distance then the the gaps between the ranges.
The solution above based on INTEGERS which mean it will work well for gaps that are larger than 1.
The solution for smaller gaps is to use decimal. And in our case we can use use resolution of 0.5
DROP
TABLE
IF EXISTS T
GO
CREATE
TABLE
T (FromNum
INT
, ToNum
INT
)
INSERT
T(FromNum, ToNum)
VALUES
(1,3), (4,5)
GO
-- Bad soluition, using points with distance 1 (integer)
;
With
MyCTE01
as
(
SELECT
DISTINCT
N
FROM
T r
INNER
JOIN
Numbers n
ON
n.N
between
r.FromNum
and
r.ToNum
)
,MyCTE02
as
(
SELECT
DISTINCT
N, MyGroup = N - ROW_NUMBER()OVER(
ORDER
BY
N)
from
MyCTE01
)
SELECT
MIN
(N) FromNum,
MAX
(N) ToNum
FROM
MyCTE02
GROUP
BY
MyGroup
ORDER
BY
FromNum
GO
-- solution 1-5
Grouping together Continuous Deciaml-slots
Working solution! using numbers with 0.5 distance will solve the issue. Let's create DECIMAL numbers table which include numbers with 1/2 distance.
SELECT
TOP
10000 N =
CONVERT
(
DECIMAL
(11,2),N)/2
INTO
NumbersDecimal
FROM
Numbers
ORDER
BY
N
GO
CREATE
CLUSTERED
INDEX
IX_N
ON
NumbersDecimal (N)
GO
SELECT
TOP
1000 *
FROM
NumbersDecimal
GO
and now we can use the exact same logic for solution
;With MyCTE01
as
(
SELECT DISTINCT N
FROM T r
INNER JOIN NumbersDecimal n ON n.N between r.FromNum
and
r.ToNum
)
,MyCTE02
as
(
SELECT DISTINCT N, MyGroup = N - CONVERT(DECIMAL(11,2),ROW_NUMBER()OVER(ORDER BY N))/2
from MyCTE01
)
SELECT MIN(N) FromNum, MAX(N) ToNum
FROM MyCTE02
GROUP BY MyGroup
ORDER BY FromNum
GO
And this is the real VICTORY!
The Conclusion and the most important point in this post, is that in order to use this approach, the resolution of the points should be smaller than the slots in the ranges!
Grouping together Continuous Time-slots
Note! for this demo I will use the the DDL+DML which was provided in the original question,which mean that in fact,here I will present an optional solution for the question using the "Gaps and Islands" approach.
Note! Once more I want to repeat myself! The solution which presented here is not the one which I recommend to use in this case. This solution address the question using a well known approach named "Gaps and Islands" and might fit for most cases. In my next post I will present a totally different approach using my personal trick, which can improve performance dramatically in some scenario.
DROP
TABLE
IF EXISTS TEST
GO
CREATE
TABLE
TEST (ID
int
, tFrom datetime, tUntil dateTime)
insert
into
TEST
Values
(1,
'2019-1-1 12:00'
,
'2019-1-1 13:00'
)
insert
into
TEST
Values
(1,
'2019-1-1 13:00'
,
'2019-1-1 14:00'
)
insert
into
TEST
Values
(1,
'2019-1-1 14:00'
,
'2019-1-1 16:00'
)
insert
into
TEST
Values
(1,
'2019-1-1 18:00'
,
'2019-1-1 19:00'
)
insert
into
TEST
Values
(1,
'2019-1-1 19:00'
,
'2019-1-1 20:00'
)
insert
into
TEST
Values
(1,
'2019-1-1 20:00'
,
'2019-1-1 21:00'
)
insert
into
TEST
Values
(1,
'2019-1-1 22:00'
,
'2019-1-1 23:00'
)
insert
into
TEST
Values
(2,
'2019-1-1 12:00'
,
'2019-1-1 13:00'
)
insert
into
TEST
Values
(2,
'2019-1-1 13:00'
,
'2019-1-1 14:00'
)
insert
into
TEST
Values
(2,
'2019-1-1 14:00'
,
'2019-1-1 16:00'
)
insert
into
TEST
Values
(2,
'2019-1-1 18:00'
,
'2019-1-1 19:00'
)
insert
into
TEST
Values
(2,
'2019-1-1 19:00'
,
'2019-1-1 20:00'
)
insert
into
TEST
Values
(2,
'2019-1-1 20:00'
,
'2019-1-1 21:00'
)
insert
into
TEST
Values
(2,
'2019-1-1 22:00'
,
'2019-1-1 23:00'
)
GO
SELECT
ID, tFrom, tUntil
FROM
TEST
Note! We can convert the DATETIME values into INTEGERS andsimply use the solution for "Grouping together Continuous Deciaml-slots".
SELECT
ID, tFrom, tUntil, DATEDIFF_BIG(
MINUTE
,
'2018-01-01'
, tFrom) s, DATEDIFF_BIG(
MINUTE
,
'2018-01-01'
, tUntil) e
FROM
TEST
This will work well if our entire data is limited to a small range of dates, for example using the date "2018-01-01" as the smaller date. unfortunately, using MINUTE as resolution with older date will enforce us to use range of numbers that come to almost 100 million.
Moreover, think about the option that we need to use higher resolution like seconds or milliseconds...
To solve this issue, I will use a small change in the code, and I will use a Times table which include the points of times which we want to use according to the resolution which we need.
In the case presented in the original question, the slot can be one hour or more and I remind you that we need to use higher resolution. Therefore, I will use resolution of 10 minutes, and starting time at 2010-01-01.
DROP
TABLE
IF EXISTS Times
GO
SELECT
DT = DATEADD(
MINUTE
, N*10,
'2010-01-01'
)
INTO
Times
FROM
Numbers
GO
CREATE
CLUSTERED
INDEX
IX_DT
ON
Times(DT)
GO
SELECT
TOP
1000 DT
from
Times
GO
SELECT
ID, tFrom, tUntil, DT
FROM
TEST t
INNER
JOIN
Times dt
ON
DT
between
tFrom
and
tUntil
GO
;
With
MyCTE01
as
(
SELECT
DISTINCT
ID, DT
FROM
TEST t
INNER
JOIN
Times dt
ON
DT
between
tFrom
and
tUntil
)
,MyCTE02
as
(
SELECT
ID, DT,
MyGroup =
DATEDIFF(
MINUTE
,
DATEADD(
MINUTE
, 10 * ROW_NUMBER()OVER(PARTITION
BY
ID
ORDER
BY
ID,DT),0),
DT
)
from
MyCTE01
--order by ID,DT
)
SELECT
ID,
MIN
(DT) tFrom,
MAX
(DT) tUntil
FROM
MyCTE02
GROUP
BY
ID, MyGroup
ORDER
BY
ID, tFrom
GO
Done!
You are welcome to test it
Conclusions: Advantages, disadvantages, and limitations
.
Resources and More information