מרץ29
Written by:
ronen ariely
29/03/2022 10:03 
Background
When I wrote this post I had a dilemma regarding the title, since I want to bring here 2 different aspects. The first aspect is technical and related to presenting a proven solution in terms of performance over the use of internal queries. The second aspect is more of a psychological discussion related to the choices that users make while completely ignoring performance in small systems. This post discuss using the function LAG and LEAD vs using sub-queries.
This post based on the discussion in this thread at the Microsoft QnA forum.
During the discussion in the forum I tried to lead the OP to the answer and on the way to give him the best value in learning. As in most cases, I am not trying to provide answers but guidance on how to get answers on your own.
we progressed gradually with the discussion while I mostly guide the OP with a new response and the OP change the original question, adding more and more information until the OP was able to provide all the information we needed in order to solve the OP needs. At this point one of the top supporter in the forum Viorel joined the discussion and provided an answer to the updated question.
However, the answer in this specific case was not optimal. Unfortunately in most cases in the forums, most people that come to ask a question, do not care about learning but only about the solution, even so in my opinion the road is just as important as the end point. The road (the learning) is what will help the person to solve the next issue and not just the current one - teach a man to fish and you feed him for a lifetime...
The op marked the answer he got and I assume that from his point of view the discussion ended, but I wanted to present the solution which might be tens time better in some cases, which is what I will do in this post.... so let's start
The request
We have a table which summaries Customers orders. The goal is to add another column to the table with value that is calculated from other rows in the table.
The table we have can be build using the following DDL+DML
DROP
TABLE
IF EXISTS [calisma3_2]
GO
CREATE
TABLE
[calisma3_2] (Cust_id
VARCHAR
(20),
Customer_Name
VARCHAR
(50),
order_date
DATE
,
order_number
INT
)
INSERT
INTO
[calisma3_2]
VALUES
(
'Cust_1'
,
'Muhammed Macintyre'
,
'2010-10-13'
, 1),
(
'Cust_10'
,
'CLAUDIA MINER'
,
'2012-08-04'
,1),
(
'Cust_100'
,
'CHARLETTO MELTON'
,
'2009-06-21'
,1),
(
'Cust_100'
,
'CHARLETTO MELTON'
,
'2010-05-07'
,3),
(
'Cust_1000'
,
'JANE WACO'
,
'2010-10-06'
,1),
(
'Cust_1000'
,
'JANE WACO'
,
'2012-02-06'
,3),
(
'Cust_1001'
,
'SANDRA FLANAGAN'
,
'2009-01-08'
,1),
(
'Cust_1001'
,
'SANDRA FLANAGAN'
,
'2011-10-24'
,3),
(
'Cust_1002'
,
'STEVE NGUYEN'
,
'2010-01-05'
,1),
(
'Cust_1002'
,
'STEVE NGUYEN'
,
'2010-12-14'
,3)
GO
ALTER
TABLE
[calisma3_2]
ADD
fark
INT
GO
SELECT
*
FROM
[calisma3_2]
GO

The rules for the value that we want to update in the new column name "fark" are:
Each Cust_id has exactly two rows. One with the value 1 in the column order_number 1 and the other with the value 3. For the sake of the discussion I will use the term pair rows for the two matching rows for the same cust_id.
We want to fill the "fark" column for each cust_id with the difference in days between the value of the the column order_date of the two pairs rows of the same cust_id. In other worlds, the diffrence between the order_date of this cust_id that sppear in the row with the order_number 1 and the order_date of this cust_id that sppear in the row with the order_number 3.
The first approach which I assume that most people will choose and it is also the one which was used by Viorel, is to use sub-queries in order to get the value of the column order_date in the matching pair row for the cust_id.
with that being said since we only deal with two rows for each cust_id, we can avoid using sub-queries and use the functions LAG and LEAD to get the value of the matching pair.
The basic idea of using LAG and LEAD for this scenario is as the following SELECT query:
SELECT
t0.Cust_id,t0.order_date, t0.order_number,
MyDif =
ABS
(DATEDIFF(
DAY
,
t0.order_date,
COALESCE
(
LEAD(t0.order_date) OVER (partition
by
t0.Cust_id
order
by
t0.order_date),
LAG(t0.order_date) OVER (partition
by
t0.Cust_id
order
by
t0.order_date)
)
))
FROM
[calisma3_2] t0
As you can notice there is no reason to use sub-queries.
For the sake of performance we will create an CLUSTERED INDEX which fits both queries
DROP
INDEX
IF EXISTS Inx_Cust_id_order_number
ON
[calisma3_2]
GO
CREATE
CLUSTERED
INDEX
Inx_Cust_id_order_number
ON
[calisma3_2](Cust_id,order_number);
GO
Let's compare performance of the solutions when we UPDATE the column as requested in the original question.
Using sub-queries
SET
STATISTICS
IO
ON
GO
UPDATE
t
SET
fark = DATEDIFF (
DAY
,
(
SELECT
order_date
FROM
calisma3_2
WHERE
Cust_Id = t.Cust_Id
and
order_number = 1 ),
(
SELECT
order_date
FROM
calisma3_2
WHERE
Cust_Id = t.Cust_Id
and
order_number = 3 ))
from
calisma3_2 t
GO
Table 'calisma3_2'. Scan count 21, logical reads 62, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 1, logical reads 23, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Notice that we have Scan count 21: one Scan came from the outer query which scan the table, and for each row in the table we executed two sub-queries. In this simple demo we have only 10 rows which mean we got 10*2 more scans for sub-queries. If this is was a real case with tens or millions of rows that we could end with 2 times millions of scans.
Total IO 62+23 = 85
Using LAG and LEAD functions
SET
STATISTICS
IO
ON
GO
;
With
MyCTE
as
(
SELECT
t0.Cust_id,t0.order_date, t0.order_number,
MyDif =
ABS
(DATEDIFF(
DAY
,
t0.order_date,
COALESCE
(
LEAD(t0.order_date) OVER (partition
by
t0.Cust_id
order
by
t0.order_number),
LAG(t0.order_date) OVER (partition
by
t0.Cust_id
order
by
t0.order_number)
)
))
FROM
[calisma3_2] t0
)
UPDATE
[calisma3_2]
SET
fark = MyDif
FROM
[calisma3_2]
INNER
JOIN
MyCTE
ON
[calisma3_2].Cust_id = MyCTE.Cust_id
and
[calisma3_2].order_number = MyCTE.order_number
GO
Table 'calisma3_2'. Scan count 2, logical reads 43, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, page server reads 0, read-ahead reads 0, page server read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob page server reads 0, lob read-ahead reads 0, lob page server read-ahead reads 0.
In this case we only need to have Scan count 2 and the total IO was only 43 which is a huge improvement for the sub-queries solution.
unfortunately the OP did not care about performance in the explanation that "For this case, it has given the same" since he have only 50 rows. This is a huge mistake and I failed to explain the OP that performance can impact other queries and the general behavior of the server. If you have no real reason, then please don't be lazy and choose the better solution according to your case!
Conclusions
In most cases using aggregate functions like LAG and LEAD can perform much better then using sub-queries.
Any time that your goal is to use in each row a value which is a calculation of other rows, then try to avoid sub-queries and find a way to solve the issue using aggregate functions.
If such aggregate function does not exists then it will probably be a good idea to add such function using SQLCLR.
and always remember not to be lazy!