en-UShe-IL
You are here:   Blog
Register   |  Login

Blog Archive:

Maximize
* Can be used in order to search for older blogs Entries

Search in blogs


Blog Categories:

Maximize
* Can be used in order to search for blogs Entries by Categories

Blog Tags:

Maximize
* Can be used in order to search for blogs by keywords

TNWikiSummit


Awared MVP

 


Microsoft® Community Contributor 


Microsoft® Community Contributor


 Read this before you use the blog! Maximize

Recent Entries

Minimize
יונ8

Written by: ronen ariely
08/06/2020 05:02 RssIcon

Let's "play" with some data and what fit better in these days then to use the Coronavirus Covid-19 data?

Preparation: Getting the Source data

The source of the data whichn I will use in this post was taken from the CSSEGISandData GitHub project. The project includes the COVID-19 Data Repository created by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University. 

CSSEGISandData GitHub project can be found here:

https://github.com/CSSEGISandData/COVID-19

For the sake of this sample code, I wil use the data from a specific day:

https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_daily_reports/03-28-2020.csv

The data is in a csv comma delimited file.

Step 1: Create format file in order to query the data

The source data includes 12 columns: FIPS, Admin2, Province_State, Country_Region, Last_Update, Lat, Long_, Confirmed, Deaths, Recovered, Active, Combined_Key. You can use the following basic format file.

12.0 
12 
1   SQLCHAR    0    100     ","     1      FIPS            "" 
2   SQLCHAR    0    100     ","     2      Admin2          SQL_Latin1_General_CP1_CI_AS 
3   SQLCHAR    0    100     ","     3      Province_State  SQL_Latin1_General_CP1_CI_AS 
4   SQLCHAR    0    100     ","     4      Country_Region  ""
5   SQLCHAR    0    100     ","     5      Last_Update     ""
6   SQLCHAR    0    100     ","     6      Lat             ""
7   SQLCHAR    0    100     ","     7      Long_           ""
8   SQLCHAR    0    100     ","     8      Confirmed       ""
9   SQLCHAR    0    100     ","     9      Deaths          ""
10  SQLCHAR    0    100     ","     10     Recovered       ""
11  SQLCHAR    0    100     ","     11     Active          ""
12  SQLCHAR    0    100     "\r\n"  12     Combined_Key    ""

Step 2: Query the data in the file using OPENROWSET

As first step I always like to watch the data, before I do any ETL manipulation and storing the data.

SELECT 
    FIPS, Admin2, Province_State, Country_Region, Last_Update, Lat,
    Long_, Confirmed, Deaths, Recovered, Active, Combined_Key
FROM OPENROWSET(
    BULK N'E:\Users\ronen\Desktop\x\03-28-2020.csv',
    FORMATFILE = N'E:\Users\ronen\Desktop\x\csv.fmt',
    FIRSTROW=2,
    FORMAT='CSV') AS CoronaData
GO

Step 3: Create new table to store the data

At this time we can create new table and import the data while executing some ETL task like converting the string data into the right data type which fit the columns

CREATE TABLE [dbo].[Covid-19](
    [FIPS]           BIGINT            ,
    [Admin2]         varchar(100) NULL  ,
    [Province_State] varchar(100) NULL  ,
    [Country_Region] varchar(100) NULL  ,
    [Last_Update]    DATETIME2 NULL     ,
    [Lat]            Decimal(16,13) NULL ,
    [Long_]          Decimal(16,13) NULL ,
    [Confirmed]      INT NULL           ,
    [Deaths]         INT NULL           ,
    [Recovered]      INT NULL           ,
    [Active]         INT NULL           ,
    [Combined_Key]   varchar(100) NULL
) ON [PRIMARY]
GO

Step 4: Import the data to the table

And now we can INSERT the data while converting the data types

;with MyCTE as (
    SELECT 
        FIPS, Admin2, Province_State, Country_Region, Last_Update, Lat,
        Long_, Confirmed, Deaths, Recovered, Active, Combined_Key
    FROM OPENROWSET(
        BULK N'E:\Users\ronen\Desktop\x\03-28-2020.csv',
        FORMATFILE = N'E:\Users\ronen\Desktop\x\csv.fmt',
        FIRSTROW=2,
        FORMAT='CSV') AS CoronaData
)
INSERT [dbo].[Covid-19](
    [FIPS],[Admin2],[Province_State],[Country_Region],[Last_Update],
    [Lat],[Long_],[Confirmed],[Deaths],[Recovered],[Active],
    [Combined_Key] 
)
SELECT
        CONVERT(BIGINT,FIPS),
        Admin2, Province_State, Country_Region,
        CONVERT(DATETIME2,Last_Update),
        CONVERT(DECIMAL(16,13),Lat),
        CONVERT(DECIMAL(16,13),Long_),
        CONVERT(INT,Confirmed),
        CONVERT(INT,Deaths),
        CONVERT(INT,Recovered),
        CONVERT(INT,Active),
        Combined_Key
FROM MyCTE
GO

Note! You can download the rest of the data of all dates from the project site and repeat the same steps in order to import it to the [Covid-19] table.

Next we can start to Analyzing the data.

Analyze the number of confirmed for each day in each country

Get a row for each country and region with the information of the number of confirmed for each day

;with MyCTE as(
    select * from (
        SELECT
            Country_Region,--Province_State,
            FirstDayOfMonth = FORMAT (
                DATEADD(DAY,1,EOMONTH(DATEADD(MONTH,-1,Last_Update))),
                'yyyy-MM-dd', 'en-US' )
            , Confirmed--, Deaths, Recovered, Active
        FROM dbo.[Covid-19]
    ) src
    pivot(
        MAX(Confirmed)
        for FirstDayOfMonth in (
            -- You can add all the dates which you imported to the table
            [2020-03-01], [2020-03-02], [2020-03-03], [2020-03-04], [2020-03-05]
        )
    ) piv
)
select * from MyCTE
Order by Country_Region--,Province_State
GO

for the sake of the post I only imported the data from a single day which mean that all the rest of the columns which not have data as can be seen bellow:

What more can we do next?

1. You can use exactly the same query to get the Deaths, Recovered, and Active effected people in each country. Simply replace in the query the word Confirmed with one of the others.

2. You can get more details of each Province in each country. For this you simply need to un-comment the column Province_State from the query above in the SELECT and in the WHERE sections.



So...


This was a small taste demonstrate what we are doing in the last few months in order to help the appropriate authorities analyzing the numbers and get the right decision 😀

Unfortunately, we can only provide the data and the numbers but the fact is that the actual decisions are probably made based on political personal gain😕, but we are speaking about the technical aspect here so let's leave the political discussion to a different channel 🙄