en-UShe-IL
You are here:   Blog
Register   |  Login

Blog Archive:

Maximize
* Can be used in order to search for older blogs Entries

Search in blogs


Blog Categories:

Maximize
* Can be used in order to search for blogs Entries by Categories

Blog Tags:

Maximize
* Can be used in order to search for blogs by keywords

TNWikiSummit


Awared MVP 


 


Microsoft® Community Contributor 


Microsoft® Community Contributor


Read first, before you you use the blog

Maximize
אפר10

Written by: ronen ariely
10/04/2018 22:00 RssIcon

a new awesome feature was added to the MSDN profile (EU GDPR)!


Background

The European Parliament (EP) is the directly elected parliamentary institution of the European Union (EU). As you probably heard the EU Parliament approved a new set of lows named General Data Protection Regulation (GDPR) which enforced as of 25 May 2018.

Seems like Microsoft papered their services to the new GDPR regulation, and as part of the preparation several weeks ago a new feature named “PRIVACY” was added to the MSDN and TechNet profile. This feature enables us to download the data that Microsoft stores about our activities!

Yes! it's an AWESOME feature, but it's absolutely useless without a way to view the data.

Unfortunately, the data that is given to us is not formatted in a readable way, but in one long line of text [*updated] (JSON FORMAT) which can be in the size of hundreds of megabytes, if you are active member in the system.

In this post, I show how to get our data, view the data in raw format (JSON) and in readable format, and I gave queries to parse all the data into tabular structure using the built-in JSON features in SQL Server.

Table of content

 
BackgroundWhat this post about 
Download the information about your activities Get started and get your data that Microsoft Stores
View the content of the activities data filesShort tutorial on how you can read the data directly.
analyze the content of the activities data filesThis chapter is long and technical ! It point for people that familiar with SQL Server and writing queries. In this section I upload the JSON data into SQL Server, and parse it into tables, so it will be simple to use and read. All the scripts are available to download
Step 1: Create new SQL Server Database
Step 2: Import JSON files to SQL Server
Step 2.1: Import the JSON source data to a new table
Step 3: parse the forums JSON data and store in new tables
Step 4: parse the blog JSON data and store in new tables
Step 5: parse the Profile JSON data
Step 6: parse the TechNet Wiki JSON data
to sum upIf you are really lazy, You can jump here to get a link to download the scripts
From Microsoft team - feedback and future plan
  

  

Download the information about your activities

It is obvious that Microsoft collects the information that we add to the system. For example, each time that we send a message in the forums, write an article or add comment in the TechNet Wiki, write new post in one of Microsoft blogs or add comment to a blog, add code and project to the Gallery, or simply edit our profile information.

Now, you can view all the data and download it!

You can find it by going to your profile page. Your profile URL should look like bellow link (replace my nickname “pituach” with yours):
https://social.msdn.microsoft.com/Profile/pituach

Under your profile page you can see four tabs: EXPERIENCE, ACTIVITY, PRIVACY, and FAQ

* The PRIVACY option only shown when you are logged to the system and in your personal profile, while the other three options can be seen by anyone.

In order to download the data, click on PRIVACY.

Quote from the Profile page: “This is where you can download a copy of your data stored in MSDN and TechNet communities (Forum, Wiki, Blog, Gallery, Profile). If you have personal content that you want to export from Microsoft products-like your email, calendar, and photos-you can export it from within those products.”

The challenge to view the content is not simple for most users. Unfortunately, despite of the great new option we got, this feature is absolutely useless for most users without a way to view the data. The data that we get (from MSDN Forum, TechNet Forum and MS Forum, TechNet Wiki, MSDN Blog, TechNet Blog, and MSDN Profile) is in JSON files. If you will open the file in text editor you will see an endless one long line of text, assuming that you can open the file... Moreover! There is so much data that Microsoft collected on our activities that if you are an active member in the system and you registered for long time and have a lot of activities, then the file(s) might be in a size of several megabytes or theoretically even a gigabyte. Simple applications like Notepad and even Notepad++ cannot even open files in that size, and even if you succeed to open the files in applications like Notepad++, the content is not readable since it is not formatted!

OK, let;'s first get the data. Please click “create” on each of the options in the page. This will take several minutes till the Microsoft system collects your data and creates files to download.

Once the files are ready, you will see a link “download”. If you click on the link in most common browsers, it will open the file instead of download it, since these are JSON files and the browsers know how to read/present it. The only option which is not a JSON file it the Gallery data which is a zip file that includes our projects and code we uploaded.

IMPORTANT! Do not save the browser page that that include the JOSN content! Right click on the download link and select to save the link. This will keep the original encoding of the files.

How can we view the content?!?
Moreover, how can we analyze the content?

View the content of the activities data files

As first step, the simplest option I recommend is to open the file with Visual Studio Code (or any other developing tool which able to parse JSON), if the file is not too big.

* One of the advantages of Visual Studio Code (VS) is that we can use it without installation, which make it very useful tool to have in disk-on-key or any external device.

Visual Studio Code recognizes the type of content since the file name has extension (suffixes) “. JSON”. you can use one of the bellow options in order to view the content in a readable format

  • (all OS) Right click anywhere in the document body, and from the menu select “Format Document”
  • (Windows) Click Shift + Alt + F
  • (Mac) Click Shift + Option + F
  • (Ubuntu) Click Ctrl + Shift + I

* Using Visual Studio (not code version) you can format the content using Ctrl + K + F
* Using Notepad++ you can use the JSON-Viewer extension to format the text

By the way. There are lots of online services, which you can search for, that provide the option to format a JSON text. I do not recommend to use these for your personal data files but this is another option, at least for the small size files.

This gives you a great way to have a fast look at the content, as well as understand the format of the content, which is something we can use in order to build our analyzing solution. Unfortunately, analyze text from several megabytes of data manually, is not something we probably can do. Moreover, the data in the file is not necessarily sorted by date.

analyze the content of the activities data files

For the analyze the data, we will need to store the data in our database, and most database’s servers include built-in support for JSON. For the sake of this tutorial we will use the built-in JSON support in SQL Server 2016 and above.

Fortunately, almost any common developing technology and language includes support for JSON format, and as part of that it’s probably includes a built-in function to format JSON that come in one line (not readable by human but much more effective for machine) into a pretty format including Dot.Net, JavaScript, python, and so on…

Step 1: Create new SQL Server Database

For the sake of the post I am using local SQL Server (you can use version 2016 and above), but you can work with Azure SQL Database as well.

* If you choose to use Azure SQL Database, then it is recommended to ipload the files to the Azure Blob for faster importing to the server. To create new Azure database, you can use the Azure portal or with queries using management studio like “SQL Server Operations Studio“ or “SQL Server Management Studio”.

Step 2: Import JSON files to SQL Server

Unfortunately, First time that I used this tool, I clicked on the download link instead of directly download the file. As a result, when I saved the page from the browser it was saved in ASCII encoding. If you are using only English then this can be used directly, but if you participate in multiple languages activities like me, then I needed to convert the files to Unicode. The original files are encoded in "UTF-8 with BOM".

If you use SQL Server 2016 and above and you want to use “BULK INSERT” or “bcp” in order to store the original JSON data in a table, which my first preferred, then you can use the encoding "UTF-8 with BOM", but if you use older versions of SQL Server or you want to use OPENROWSET in order to parse the data in the files (store only the parsed data), then you should convert the files into UTF-16LE.

Change the encoding of a file in Visual Studio Code:

  • Open the files in Visual Studio code.
  • In the bottom bar, you'll see a label with the file encoding (check the image below). Click it,
  • Select the option “Save with encoding” in the popup that opened.
  • Select the encoding according to the way you want to use the file (UTF-8 with BOM, or UTF-16LE).


Note! I HIGHLY recommend to check this post for more information about character set, character code point, and character encoding in SQL Server.

Not! SQL Server 2016 and above added support for UTF-8 in some tasks, SQL Server 2012 and above added support for UTF-16 codepoint (range 0 through 0x10FFFF), and earlier version used only UCS-2 codepoint (range 0 through 0xFFFF) as Unicode.

Note! Byte Order Mark (BOM) is a Unicode character at the start of a text stream. It configures the byte order, or endianness, which the text stream is stored in. Endianness refers to the sequential order in which bytes are arranged into larger numerical values.

Step 2.1: Import the JSON source data to a new table

DROP TABLE IF EXISTS MicrosoftPrivacyData;
create table MicrosoftPrivacyData (Activity NVARCHAR(MAX))
GO
-- make sure the files are stored in "UTF-8 with BOM"
-- I saved the files in names: Forums.json, Blog.json, Profile.json
BULK INSERT MicrosoftPrivacyData FROM N'E:\ator\Desktop\MSDN Profile\_Original Files\Forums.json'
    WITH(CODEPAGE = 65001, ROWTERMINATOR =' |\n', KEEPIDENTITY);
GO
BULK INSERT MicrosoftPrivacyData FROM N'E:\ator\Desktop\MSDN Profile\_Original Files\Blog.json'
    WITH(CODEPAGE = 65001, ROWTERMINATOR =' |\n', KEEPIDENTITY );
GO
BULK INSERT MicrosoftPrivacyData FROM N'E:\ator\Desktop\MSDN Profile\_Original Files\Profile.json'
    WITH(CODEPAGE = 65001, ROWTERMINATOR =' |\n', KEEPIDENTITY );
GO
BULK INSERT MicrosoftPrivacyData FROM N'E:\ator\Desktop\MSDN Profile\_Original Files\TechNet Wiki.json'
    WITH(CODEPAGE = 65001, ROWTERMINATOR =' |\n', KEEPIDENTITY );
GO
ALTER TABLE MicrosoftPrivacyData
    add ActivityID int identity (1,1)
GO
-- the order of the data is not guaranteed, but for most cases it will be in the order that we inserted it
-- for the sake of the post I iwll assume that this is the case, which means
-- the forums activities get value 1 for the ActivityID
-- the Blog activities get value 2 for the ActivityID
-- the Profile activities get value 3 for the ActivityID
-- let's confirm the data
select ActivityID, Activity from MicrosoftPrivacyData
GO

Step 3: parse the forums JSON data and store in new tables

Note! This section fit the current format of the files. If Microsoft will change the file’s format then we will need to make some adjustment in the queries.

If we want to parse the data directly from the file without first store the original JSON data, then we can use OPENROWSET. In this case you should convert the files to UTF-16LE encode.

DECLARE @ForumsDiscussions NVARCHAR(MAX)
SELECT @ForumsDiscussions = BulkColumn
FROM OPENROWSET (
    BULK N'E:\_Original Files\Forums.json'
    , SINGLE_NCLOB
) as j
select @ForumsDiscussions as OriginalText

Once the variable @ForumsDiscussions includes the content (value of the JSON), we can start to parse it and to organize the data in tabular format. For the sake of this post we will get the data from the new table we have just created (MicrosoftPrivacyData). The rest of the script is the same.

DECLARE @ForumsDiscussions NVARCHAR(MAX)
SELECT @ForumsDiscussions = Activity from MicrosoftPrivacyData where ActivityID = 1
  
IF (ISJSON(@ForumsDiscussions) = 1) BEGIN
    PRINT 'JSON is Valid... start to parse the data'
      
    -- this mapped the titles of new threads
    -- in order to find the content of the message we can search
    -- the DiscussionID in the messages list
    drop table if exists ForumsDiscussionTitles;
    SELECT * into ForumsDiscussionTitles
    FROM OPENJSON(@ForumsDiscussions, '$.discussion')
        WITH(
            DiscussionID VARCHAR(MAX) '$.DiscussionID',
            DiscussionGroup VARCHAR(MAX) '$.DiscussionGroup',
            BehaviorType VARCHAR(MAX) '$.BehaviorType',
            Archived VARCHAR(MAX) '$.Archived',
            Locked VARCHAR(MAX) '$.Locked',
            CreatedDate VARCHAR(MAX) '$.CreatedDate',
            LastUpdatedDate VARCHAR(MAX) '$.LastUpdatedDate',
            Name VARCHAR(MAX) '$.Name',
            FlagType VARCHAR(MAX) '$.FlagType',
            FlagReason VARCHAR(MAX) '$.FlagReason',
            FlaggedDate VARCHAR(MAX) '$.FlaggedDate'
        )
  
    -- this is the list of messages content
    drop table if exists ForumsDiscussionBody;
    SELECT * into ForumsDiscussionBody
    FROM OPENJSON(@ForumsDiscussions, '$.message')
        WITH(
            DiscussionID VARCHAR(MAX) '$.DiscussionID',
            CreatedDate VARCHAR(MAX) '$.CreatedDate',
            LastUpdatedDate VARCHAR(MAX) '$.LastUpdatedDate',
            Archived VARCHAR(MAX) '$.Archived',
            LCID VARCHAR(MAX) '$.LCID',
            BodyText VARCHAR(MAX) '$.BodyText'
        )
          
    -- this is the list of userPreferences
    drop table if exists ForumsUserPreferences;
    SELECT * into ForumsUserPreferences
    FROM OPENJSON(@ForumsDiscussions, '$.userPreferences')
        WITH(
            ThreadViewMode VARCHAR(MAX) '$.ThreadViewMode',
            AutoUpdatesEnabled VARCHAR(MAX) '$.AutoUpdatesEnabled',
            GroupAnswers VARCHAR(MAX) '$.GroupAnswers',
            ShowTagging VARCHAR(MAX) '$.ShowTagging',
            SubscribeToAlerts VARCHAR(MAX) '$.SubscribeToAlerts',
            Signature VARCHAR(MAX) '$.Signature',
            ShowRelatedThreads VARCHAR(MAX) '$.ShowRelatedThreads',
            ShowStatistics VARCHAR(MAX) '$.ShowStatistics',
            UseNNTPBridge VARCHAR(MAX) '$.UseNNTPBridge',
            ReceiveEmail VARCHAR(MAX) '$.ReceiveEmail',
            RequestedOn VARCHAR(MAX) '$.RequestedOn'
        )
  
          
    -- this is the list of messages content
    drop table if exists ForumsuserRoles;
    SELECT * into ForumsuserRoles
    FROM OPENJSON(@ForumsDiscussions, '$.userRole')
        WITH(
            Role VARCHAR(MAX) '$.Role'
        )
END
  
select * from ForumsDiscussionTitles
select * from ForumsDiscussionBody
select * from ForumsUserPreferences
select * from ForumsuserRoles
GO


Step 4: parse the blog JSON data and store in new tables

DECLARE @ForumsDiscussions NVARCHAR(MAX)
SELECT @ForumsDiscussions = Activity from MicrosoftPrivacyData where ActivityID = 2
 
IF (ISJSON(@ForumsDiscussions) = 1) BEGIN
    PRINT 'JSON is Valid... start to parse the data'
     
    -- Store the blog's comments
    drop table if exists blogComments;
    SELECT * into blogComments
    FROM OPENJSON(@ForumsDiscussions, '$')
        WITH(
            comment_date VARCHAR(MAX) '$.comment_date',
            comment_content VARCHAR(MAX) '$.comment_content'
        )
END
 
select * from blogComments
GO

Step 5: parse the Profile JSON data

DECLARE @ForumsDiscussions NVARCHAR(MAX)
SELECT @ForumsDiscussions = Activity from MicrosoftPrivacyData where ActivityID = 3
 
IF (ISJSON(@ForumsDiscussions) = 1) BEGIN
    PRINT 'JSON is Valid... start to parse the data'
     
    -- Store the Profile Contacts
    drop table if exists ProfileContacts;
    SELECT * into ProfileContacts
    FROM OPENJSON(@ForumsDiscussions, '$.attribute')
        WITH(
            Contact VARCHAR(MAX) '$.Value'
        )
         
    -- Store the Profile Info
    drop table if exists ProfileInfo;
    SELECT * into ProfileInfo
    FROM OPENJSON(@ForumsDiscussions, '$')
        WITH(
            [DisplayName] VARCHAR(MAX) '$.profile[0].DisplayName',
            [LastUpdatedDate] VARCHAR(MAX) '$.profile[0].LastUpdatedDate',
            [SmallAvatarBase64String] VARCHAR(MAX) '$.profile[0].SmallAvatarBase64String',
            [LargeAvatarBase64String] VARCHAR(MAX) '$.profile[0].LargeAvatarBase64String',
            [ExtraLargeAvatarBase64String] VARCHAR(MAX) '$.profile[0].ExtraLargeAvatarBase64String',
            [Signature] VARCHAR(MAX) '$.profile[0].Signature',
            [Url] VARCHAR(MAX) '$.profile[0].Url',
            [IsMVP] bit '$.profile[0].IsMVP',
            [IsMicrosoftEmployee] bit '$.profile[0].IsMicrosoftEmployee',
            [IsInappropriate] bit '$.profile[0].IsInappropriate',
            [AcceptedEula] VARCHAR(MAX) '$.profile[0].AcceptedEula',
            [OrganizationName] VARCHAR(MAX) '$.profile[0].OrganizationName',
            [ProfileCreationDate] datetime2 '$.profile[0].ProfileCreationDate',
            [AgreementsAcceptanceDate] datetime2 '$.profile[0].AgreementsAcceptanceDate',
            [IsPartner] bit '$.profile[0].IsPartner',
            [FirstName] VARCHAR(MAX) '$.profile[0].FirstName',
            [LastName] VARCHAR(MAX) '$.profile[0].LastName',
            [IsMicrosoftContingentStaff] bit '$.profile[0].IsMicrosoftContingentStaff',
            [IsMCC] bit '$.profile[0].IsMCC',
             
            [TotalPoints] int '$.userPoints[0].TotalPoints',
            [PointsLastUpdated] datetime2 '$.userPoints[0].LastUpdated'
        )
 
         
         
    -- Store the user Achievements
    drop table if exists ProfileUserAchievements;
    SELECT * into ProfileUserAchievements
    FROM OPENJSON(@ForumsDiscussions, '$.userAchievement')
        WITH(
            [AchievementKey] VARCHAR(MAX) '$.AchievementKey',
            [MedalId] int '$.MedalId',
            [EarnedDate] datetime2 '$.EarnedDate'
        )
         
    -- Store the user Activity Feed
    drop table if exists ProfileUserActivityFeed;
    SELECT * into ProfileUserActivityFeed
    FROM OPENJSON(@ForumsDiscussions, '$.userActivityFeed')
        WITH(
            [Application] VARCHAR(MAX) '$.Application',
            [Activity] VARCHAR(MAX) '$.Activity',
            [When] datetime2 '$.When'
        )
 
         
    drop table if exists ProfileUserAchievements;
    SELECT * into ProfileUserAchievements
    FROM OPENJSON(@ForumsDiscussions, '$.userAchievement')
        WITH(
            [AchievementKey] VARCHAR(MAX) '$.AchievementKey',
            [MedalId] int '$.MedalId',
            [EarnedDate] datetime2 '$.EarnedDate'
        )
 
END
 
select * from ProfileInfo
select * from ProfileUserAchievements
select * from ProfileUserActivityFeed
select * from ProfileContacts
GO

Step 6: parse the TechNet Wiki JSON data

Parsing the Wiki data was a bit problematic in my case since the file was so big that even Visual Studio Code and Notepad++ could not open it.

I needed to find another way to examine the format of the JSON text in order to create the script that parse the data into the tables. The solution was to execute the bellow query

DECLARE @ForumsDiscussions NVARCHAR(MAX)
SELECT @ForumsDiscussions = Activity from MicrosoftPrivacyData where ActivityID = 4
SELECT * FROM OPENJSON(@ForumsDiscussions, '$')

The above query returns all the main Keys in the JSON: wikiPage, pageComment, message, pageRevision, userProfile, userRole. For each key I will create new table.

In order to know which columns, I will create in the table I simply examine the JSON of the first member in the array.

DECLARE @ForumsDiscussions NVARCHAR(MAX)
SELECT @ForumsDiscussions = Activity from MicrosoftPrivacyData where ActivityID = 4
SELECT * FROM OPENJSON(@ForumsDiscussions, '$.wikiPage[0]')
SELECT * FROM OPENJSON(@ForumsDiscussions, '$.pageComment[0]')
SELECT * FROM OPENJSON(@ForumsDiscussions, '$.message[0]')
SELECT * FROM OPENJSON(@ForumsDiscussions, '$.pageRevision[0]')
SELECT * FROM OPENJSON(@ForumsDiscussions, '$.userProfile[0]')
SELECT * FROM OPENJSON(@ForumsDiscussions, '$.userRole[0]')

Once I got the information I can built my script to parse all the data and insert to the new tables

The script has the same format basically as previous queries. In current case the script is a bit too long to present in the post. You can download all the scripts (queries) from this link.

to sum up

Wow… this was long post

In this post I presented Microsoft new feature for the MSDN and TechNet PRIVACY, which fits the European Parliament General Data Protection Regulation (GDPR) rules. This feature enables us to download the data that Microsoft stores about our activities!

Unfortunately, the data that is given to us is not formatted in a readable way. In this post, I showed how to get our data, view the data in raw format (JSON), and I gave some queries to parse all the data into tabular structure using the built-in JSON features in SQL Server.

If you are lazy and only want to get the data and start use it, then you can simply download the scripts from here, but if you are interested in the procedure I used in order to examine the data and develop the scripts, then you are welcome to read the post fully.

I hop this post and the scripts are useful

If you have any comment or feedback you can find me on Facebook.

Have a great day,
Ronen Ariely

From Microsoft team - feedback and future plan:

Note! About this section
Most of the communication that I do with Microsoft teams are under Non-disclosure agreement (NDA). Therefore, I am limited to talk about everything. After I got email with some great feedback about my post, I asked permission to publish part of the email, which brings important information about future plan accordingly to my blog. To clarify, this is not a commitment to do so, but only an optional plan at this time.
This section is a copy "as it is" from the email that I got

Thank you Ronen for waiting.

The engineering team has studied your blog post carefully and is inspired by your solution.

What we plan to do:

·  Fix unreadability issue: We’re going to reformat the JSON string into multiple lines.

·  Fix too large file size issue: We’re going to compress JSON file into ZIP file to save the size. This way can also avoid browser auto presenting the file which takes long if the JSON file is large. When click “download” link then user get a zip file to save locally.

·  We also plan to add recommendation in our UI to advocate user open the exported file by VS Code, just like the proposal in your post. By this way, even big file can be opened smoothly.

·  No extra documentation is needed.

Sample of the reformatted JSON file when open by VS Code:

In addition, I want to thank you not only in this email.

Do you have post in facebook or twitter so our team could RT it and say “thank you”?

Regards,

Hui Xie | PM of MSDN profile & blog


Update - May 09, 2018

Update on this thread. Microsoft team has shipped two changes:

  • “Pretty Json”: i.e. reformat the JSON file for better reading. This means that we can view the file directly if it is not too big to be opened in local app.
  • Zipped file: to mitigate the too large file issue. This means that we can download the file faster and that we do not encounter the problem that the file is opened in the browser instead to be downloaded when we click the download link.



Tags: Microsoft , SQL , JSON
Categories: SQL , Other...
Location: Blogs Parent Separator Public blog