Advent of Code 2017 Day 4–#SQLNewBlogger

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

This is day 4 of the Advent of Code 2017. If you want to read about the puzzles, start with Day 1.

This is Day 4, which deals with checking a series of passphrases for duplicates. The idea is that for some input, each of the “words” in the string must be unique. This means that if I have a phrase of “dog cat horse”, it’s valid. If I have “dog cat bird cat”, it’s invalid.

This feels like an easy thing to do in SQL. Let’s begin.

First, I’ll build a table and load some data.

CREATE TABLE Day4
(
     passphrase VARCHAR(100)
)
GO
INSERT dbo.Day4
(
     passphrase
)
VALUES
    ('aa bb cc dd ee' )
  , ('aa bb cc dd aa')
  , ('aa bb cc dd aaa')

This is the test set. In this set, there are two values that are valid and one that isn’t. I need to figure out which is which.

This seems like a perfect place to use String_Split(). Since this is a small set (512 rows), performance isn’t a big concern. If it were, I’d be moving to use Jeff Moden’s string splitter.

For my tests, I decided to try something quick and dirty. I built a quick inline TVF to return string_split values. This is the code:

CREATE FUNCTION dbo.words
     (@input AS VARCHAR(100))
RETURNS TABLE
AS
RETURN ( SELECT value FROM STRING_SPLIT(@input, ' ') AS ss )

This made things easier for me to read. I like seeing the code clean, and this way I could easily most to a custom version of Jeff’s code if I wanted.

I quickly tried a short CTE with a CROSS APPLY to get the sets of rows that didn’t have unique passphrases. I thought this was easier since I know these are incorrect. If I look for corrects, it’s a harder issue as I’ll have counts of 1 in both correct and incorrect strings.

WITH ctePass
AS
(
SELECT d.passphrase 
        , ss.value, 
        cnt = COUNT(*)
FROM day4 AS d
     CROSS APPLY dbo.words(d.passphrase) AS ss
--WHERE d.passphrase = 'aa bb cc dd ee'
GROUP BY d.passphrase, ss.value 
HAVING COUNT(*) > 1
)

I also added a CTE that just gets me the total count.

, cteCount
AS
(SELECT total = COUNT(*) FROM day4)

Next, I added a CTE that would get my unique passphrases from the first CTE.

, cteUnique
AS
(
SELECT ctePass.passphrase
, cnt = COUNT(cnt) 
  FROM ctePass
  GROUP BY ctePass.passphrase
  )

Finally, I subtract the invalids from the total to get the vlaids. This gave me the answer that solved the puzzle.

SELECT total - COUNT(*) FROM cteUnique, cteCount
GROUP BY cteCount.total

Now let’s load the actual data. I cut and pasted data into a file. Let’s load that.

TRUNCATE TABLE day4
GO
BULK INSERT dbo.Day4 FROM 'e:\Documents\GitHub\AdventofCode\2017\Day4\Day4_Input.txt'
GO
SELECT 
  *
  FROM dbo.Day4 AS d

I can see I have 512 rows in my table, so my result must be between 1 and 512.  When I ran this, I got the correct result.

A slight admission here. I misread at first, so I didn’t have the total and just got the 57. When that didn’t work, I re-read things and realized this. I tried to check for the valids, but realized that it was simpler if I just subtracted.

Part II

Part II is tricky, and I made a mistake the first time I tried to solve this. It took a bit, but I eventually figured out the issue.

In this part, we have to avoid anagrams. That means if the phrase as ‘car’ and ‘rac’, it’s invalid. Getting anagrams is tricky, and I looked around to find a post on SO that covers this. I adapted that code from Pawel Dyl to search for anagrams.

My first task was to put this into a function:

CREATE OR ALTER FUNCTION dbo.CheckAnagram
     (@w1 VARCHAR(50)
     , @w2 VARCHAR(50)
)
RETURNS int
AS
begin

declare @r INT;

WITH Src
AS
(
SELECT T.W1 ,
        T.W2
FROM
(
     VALUES
         (@w1, @w2)
) AS T (W1, W2)
) ,
      Numbered
AS
(
SELECT Src.W1 ,
        Src.W2 ,
        Num = ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM Src
) ,
      Splitted
AS
(
SELECT Num ,
        Word1 = W1 ,
        Word2 = W2 ,
        L1 = LEFT(W1, 1) ,
        L2 = LEFT(W2, 1) ,
        W1 = SUBSTRING(W1, 2, LEN(W1)) ,
        W2 = SUBSTRING(W2, 2, LEN(W2))
FROM Numbered
UNION ALL
SELECT Num ,
        Word1 ,
        Word2 ,
        L1 = LEFT(W1, 1) ,
        L2 = LEFT(W2, 1) ,
        W1 = SUBSTRING(W1, 2, LEN(W1)) ,
        W2 = SUBSTRING(W2, 2, LEN(W2))
FROM Splitted
WHERE LEN(W1) > 0
       AND LEN(W2) > 0
) ,
      SplitOrdered
AS
(
SELECT Splitted.Num ,
        Splitted.Word1 ,
        Splitted.Word2 ,
        Splitted.L1 ,
        Splitted.L2 ,
        Splitted.W1 ,
        Splitted.W2 ,
        LNum1 = ROW_NUMBER() OVER (PARTITION BY Num ORDER BY L1) ,
        LNum2 = ROW_NUMBER() OVER (PARTITION BY Num ORDER BY L2)
FROM Splitted
)
SELECT @r =  CASE
                   WHEN COUNT(*) = LEN(S1.Word1)
                        AND COUNT(*) = LEN(S1.Word2) THEN
                       1
                   ELSE
                       0
               END
FROM SplitOrdered AS S1
     JOIN
     SplitOrdered AS S2
         ON S1.L1 = S2.L2
            AND S1.Num = S2.Num
            AND S1.LNum1 = S2.LNum2
GROUP BY S1.Num ,
          S1.Word1 ,
          S1.Word2

IF @r IS NULL 
  SET @r = 0
RETURN @r
end
GO

Next, I wanted to get a list of items to check. Again, STRING_SPLIT is what I used, but a Jeff’s code works as well. I decided to use another function, which will split the string and then do a cross join to run both combinations of works against each other to check for anagrams.

CREATE OR ALTER FUNCTION dbo.SplitandCheck
(@string AS VARCHAR(100))
RETURNS int
AS
BEGIN
DECLARE @r INT = 0;

WITH mycte
AS
(
SELECT checks = CASE WHEN a.value = b.value then 0
ELSE dbo.CheckAnagram(a.value, b.value)
end
FROM STRING_SPLIT(@string, ' ') AS a CROSS JOIN STRING_SPLIT(@string, ' ') AS b
)
SELECT @r = SUM(a.checks)
FROM mycte a

This seemed to work, but when I got the final result with this code, it was wrong. Too high.

WITH mycte
AS
(
SELECT d.passphrase, IsAnagram = dbo.SplitandCheck(d.passphrase)
FROM dbo.Day4 AS d
)
SELECT mycte.IsAnagram, COUNT(*)
FROM mycte
--  WHERE mycte.IsAnagram = 0
GROUP BY mycte.IsAnagram

I had checked for 0s, but also included other results to try and debug my code. As I went through here, I realized that some fo the input data had a string like “oicgs rrol zvnbna rrol”. Clearly “rrol is an anagram of “rrol”. Initially I was knocking those out as a cross join includes those anyway.

As a result, I added this code to my function.

IF @r = 0
AND EXISTS(
SELECT COUNT(*)
FROM STRING_SPLIT(@string, ' ') AS ss
GROUP BY ss.value
HAVING COUNT(*) > 1
)
SET @r = 1

This will check for duplicate values. If there are dups, then clearly we have an issue already.

This give me a set of items with 1 dup as well as those with anagrams. The 0 result is the count of valid passphrases.

Posted in Blog | Tagged , , , | Leave a comment

The GDPR Sky is Falling

David Poole wrote a nice summary of the GDPR regulations that come into force this spring in Europe. He covers a number of the sections, trying to provide a simple explanation of the potential issues from the perspective of a data engineer. That’s likely the role most of us fill , and I think David does a good job of trying to note the items that he (and maybe you) need to be concerned about.

If you want to read the full text, and you should, it’s here. To me, this is a more sensible, easier to understand type of regulation. It’s way better than SOX and most other regulations I’ve had to deal with, with a better view of balancing the idea that companies won’t have all the answers, and might not choose the best technology but do need to make an effort. I don’t think this will excuse just continuing to do business as you have, but it does read as though courts and authorities will have flexibility in their interpretation.

One of the main things that should be pointed out is that the 10 million Euro fine is a max, not a minimum. The same things goes for the potential 2% of global turnover (revenue for the US folks). These are the highest potential penalties, though if you have made some effort to protect data and comply, I doubt you’d see fines at this level unless you’re negligent.

The keys parts of this regulation are that companies should be paying more than lip service to data privacy and protection. They should be designing and building software and infrastructure that protects data, and also considers the point of view of the individual or organization that is the subject of the data. That’s a good move, in my opinion, having us actually think about the data and the ramifications of its use, sale, transfer, and release, rather than just focusing on our own goals. Most data professionals I know keep this in mind, so GDPR is a step in the right direction to push management to care.

We’ve got information at the Redgate site, which will help guide you. We are building features into existing and new products, and we’d love to sell you software if you can use it, but we’re also learning and trying to share what we know. This goes along with the core values at Redgate of being a part of the community and giving back, through SQLServerCentral, Simple Talk, blogs, and more.

Ultimately no one knows what GDPR will bring, and its application can present a risk to any of us that gather data from EU residents. I know Brent Ozar as already decided to stop EU business for the time being to avoid taking on this risk, and I’m sure other small companies may do the same thing. In one way that’s a shame, though a reasonable decision for a company. In another way, this opens opportunities for other businesses. People in Europe still need goods and services, and there are plenty of ways to comply with GDPR that I don’t think will be too hard, especially for those businesses based in Europe that won’t have a choice. There will be other companies that can fill any void left by companies that cease working in the EU.

GDPR isn’t the end of the world. I think it’s a good move in the right direction to balancing data value and protections. I’d like to see a better framework in the US that also ensures individuals have rights to exercise some control over all the data being gathered about them, as well as something that forces companies to actually consider data protection in their systems. There may be some bad results from GDPR, but most of us will adapt and continue to do business as we have in the past, albeit with better data controls.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 5.4MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged , , , | Leave a comment

SSMS Line Numbers

I typically don’t turn on line numbers if SSMS, but while working with someone on a bit of code, they were referencing line numbers in a large script. By default SSMS shows you the line, column, and character at the bottom (as well as insert/overwrite status) in the status bar. See the image below, where my cursor is on line 597.

2018-01-16 13_30_26-Semicolons.sql - [ReadOnly](local)_SQL2016.RLS (PLATO_Steve (64))_ - Microsoft

However, it’s visually harder to see lines here if those are the way you’re getting oriented with a script. It’s actually easier to have the line numbers on the left side.

This is easy to turn no in SSMS. First, click the Tools menu and choose options (as shown here).

2018-01-16 12_28_26-

Next, go down to the Text Editor section on the left, expand that and then expand the Transact-SQL area. Click General, and you’ll see a checkbox for Line numbers on the right. Click that.

2018-01-16 12_28_38-Options

And line numbers appear.

2018-01-16 12_28_45-Semicolons.sql - [ReadOnly](local)_SQL2016.RLS (PLATO_Steve (64))_ - Microsoft

I do find these distracting most of the time, but there are situations where the line numbers are handy, especially when collaborating with others.

Posted in Blog | Tagged , , | 1 Comment

Finding the Titles in R

PASS has released the videos to members from this past Summit. I say TJay Belt today ask about relating a video name to a session. I have the USB drive, so I looked on there. Here are the videos:

2018-01-10 13_25_47-Video

Not terribly helpful. If you run the HTML file from the stick, you see this:

2018-01-10 13_26_23-PASS Summit 2017

If I hover over a title, I see the link as a specific video file. For example, the first one is 65545.mp4. With that, I looked around and found a javascript file with information in it.

The structure was like this:

//SID
Col0[0] = "65073";
Col0[1] = "65091";

…

//Speaker Name
Col2[0] = "Steve Stedman";
Col2[1] = "Kellyn Pot'Vin-Gorman";

…

//Session Name
Col4[0] = "Your Backup and Recovery Strategy";
Col4[1] = "DevOps Tool Combinations for Winning Agility";

All the data is in one file, but the index in each array matches. So Col0[0] is the SID for video 65073, which has Col2[0] as the speaker and col4[0] as the title.

Now I want to get these in some sort of order. First, let me copy this data into separate files. That will make importing easier. I’ll copy the SID array into one file, the speaker array into a second file and the title array into a third.

This gives me data like the list above, but I need to clean that. This is easiest in Sublime, with a few replacements. I did

  • “COL[“ –> “”
  • “] = “ –> “,”
  • “;” –> “”

This gives me a clean file that looks like this:

2018-01-10 13_29_18-e__Documents_R_titles.txt - Sublime Text

Working in R

I almost started to move this into T-SQL and a table, but since I’ve been playing with R, I decided to see what I could do there. First, I know I need to load data, so I the first file into a data frame.

session.index = read.csv("e:\\Documents\\R\\videosid.txt", sep=",")

The column names aren’t great, so we’ll fix those:

 colnames(session.index) <- c("Index", "SessionSID")

Now
let’s get the other data.

session.speaker = read.csv("e:\\Documents\\R\\passspeaker.txt", sep=",")
> session.title = read.csv("e:\\Documents\\R\\titles.txt", sep=",") 
> colnames(session.speaker) <- c("Index", "Speaker")
> colnames(session.title) <- c("Index", "Title")

I have three data frames. I want to combine them. Let’s do that. I’ll use the merge() function to do this. Since I’ve got common column names, I’ll use those.

> pass.videos <- merge(session.index, session.title, by="Index")

> pass.videos <- merge(pass.videos, session.speaker, by="Index")

This gives me a data frame with the index, title, and speaker. Now I’ve got the data merged, let’s produce a file..

 write.table(pass.videos, file="e:\\Documents\\R\\passvideos.txt",sep=",")

With that done, I can see I have a list of video numbers, titles, and speakers.

"Index","SessionSID","Session","Speaker"
 "1",1,65091,"DevOps Tool Combinations for Winning Agility","Kellyn Pot'Vin-Gorman"
 "2",2,65092,"Oracle vs. SQL Server - The War of the Indices","Kellyn Pot'Vin-Gorman"
 "3",3,65112,"Make Power BI Your Own with the Power BI APIs","Steve Wake"

I did something in R. Smile

Posted in Blog | Tagged , , | 2 Comments

Customize Software or Process

Many of us work with software and often wish that we something was designed differently. I’m sure many of us feel the same way about our database schemas, which is usually even harder to change. In any case, we often want to mold software to fit our thought process. We’re often asked to alter software to meet the expectations of our customers as well, trying to alter visuals or workflows to match some method of working that our customer desires.

We often try to do this with COTS (Commercial off the shelf) software, notably ERP systems. SAP and Oracle applications, and similar software packages, have made many consultants rich and required millions (maybe billions) of dollars of investment by companies that look to ensure their logos, their workflow, the specific process is implemented in software. Even many of us that purchase some infrastructure software, or even get open source applications, may spend lots of resources on getting the software to fit our organization.

The thought that most people have had is that it’s cheaper to make the software fit the system, or the people, than vice versa. However, is that the best way to proceed? Do we want to customize software to work in a way different from the way in which is was designed? Or should we learn to alter some of our processes to better fit with how the tools work? Do we think that we really have some special secret in our process that makes us more efficient? Or are we resistant to change?

For some software, like SAP, the system is designed to morph and fit an organization. Certainly when there are hundreds, or thousands of users, it might be worth the cost of customizing the software to fit the users. However when we have more specific software, such as that used to monitor infrastructure or deploy software or track features, with dozens of users, do we want to spend a lot of time changing the way things work? I sometimes wonder if we should instead focus on getting our people to learn a new way of working that flows with the software. After all, upgrades, patches, and other minor changes are less disruptive.

I don’t think there’s just one answer here, and certainly there are valid reasons to spend time and money on changing a system. I just think it’s worth some thought to be sure that changing software is a better decision that adapting our process to work with the application.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.9MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged | Leave a comment

Reading GDPR

In case you’re interested, the GDPR law is actually not bad to read. You might be affected by this, so go through the regulations. I’m doing that this week.

You can also see a nice article from David Poole at SQLServerCentral.

I do think GDPR will affect many of us, but to what extent, I’m not sure. Comments on what you think welcome.

Posted in Blog | Tagged , | Leave a comment

tsqlt Tests for Advent of Code 2017 Day 2

This is day 2 of the Advent of Code 2017. If you want to read about the puzzles, start with Day 1. As I worked through the puzzles, I decided that I should be testing using their test sets and solving the issues that way. This lets me use the sample data, but also add in my own sets to cover strange situations.

Here are the tests that I used for each part of day 2.

Part I

This wasn’t a tough puzzle, and the test is fairly simple. I had a function that solves the puzzle with the help of input. My test just sets up the sample input in the table, tab delimited, and then calls the function to calculate the total.

EXEC tsqlt.NewTestClass @ClassName = N'tDay2'
go
CREATE OR ALTER PROCEDURE tDay2.[test day2 sample input]
AS
BEGIN
     ---------------
     -- Assemble
     ---------------
     DECLARE
         @expected INT  18,
         @actual   int;
     
     EXEC tsqlt.faketable @TableName = 'Day2', @SchemaName = 'dbo';
     INSERT dbo.Day2 (DataRow)
      VALUES ('5    1    9    5')
           , ('7    5    3')
           , ('2    4    6    8')

    ---------------
     -- Act
     ---------------
     SELECT  @actual = SUM(b.diff)
      FROM day2 a
      CROSS APPLY dbo.AdventChecksum (a.DataRow) b

    ---------------
     -- Assert    
     ---------------
     EXEC tSQLt.AssertEquals
         @Expected = @expected,
         @Actual = @actual,
         @Message = N'An incorrect calculation occurred.';
END
GO
EXEC tsqlt.run 'tDay2.[test day2 sample input]';

Part II

The test here just calls a different function and has different input.

CREATE OR ALTER PROCEDURE tDay2.[test day2 b sample input]
AS
BEGIN
     ---------------
     -- Assemble
     ---------------
     DECLARE
         @expected INT = 9,
         @actual   int;
     
     EXEC tsqlt.faketable @TableName = 'Day2', @SchemaName = 'dbo';
     INSERT dbo.Day2 (DataRow)
      VALUES ('5    9    2    8')
           , ('9    4    7    3')
           , ('3    8    6    5')

    ---------------
     -- Act
     ---------------
     SELECT  @actual = SUM(b.divmatch)
      FROM day2 a
      CROSS APPLY dbo.AdventChecksum3 (a.DataRow) b

    ---------------
     -- Assert    
     ---------------
     EXEC tSQLt.AssertEquals
         @Expected = @expected,
         @Actual = @actual,
         @Message = N'An incorrect calculation occurred.';
END
GO
EXEC tsqlt.run 'tDay2.[test day2 b sample input]';

GO





Posted in Blog | Tagged , , , | Leave a comment

The Travel Review

I got a note from United this week that summarized my travel for 2017. I’ve been feeling a bit itchy as I haven’t had any trips scheduled for 2018 so far. Strange for me, but looking over the summary, I’m glad.

Last year I flew 44 times with United, for 81k miles. I had 25 domestic flights and 19 international ones. That’s kind of crazy. Of course, outside of Denver, London is my most visited location.

For hotels, I logged 26 stays with Hilton that totaled 62 nights. That was low because I got stuck in a few other hotels at events early in the year. Plus I think I had 8 or 9 Air BnB nights. A lot of time away from home.

I’m a numbers person, so it’s neat to see the summaries from the services I use.

The one big number for me that shows me I traveled too much? Total workouts for the year: 265. That’s way too low.

Posted in Blog | Tagged , | Leave a comment

Disable or Drop

When I started working with SQL Server and Windows, it seemed that the administrative side of many actions was limited. We could add and drop many items, but that was it. Relatively few tools allowed setting limits or disabling them, which was a pain. After all, I’d gotten used to setting up accounts for vendors and contractors in Active Directory, often disabling them when they weren’t in use. I couldn’t do this in SQL Server for many objects, which was a pain and an administrative burden to reset them up when troubleshooting issues. This was pre-PowerShell and .NET when any SMO access was a project in and of itself.

These days SQL Server has done a great job of adding in the ability to pause or disable many objects. We’ve had the ability to lock out an account for many versions, which is a great way to setup a vendor tech support account when it’s needed. This is especially important for security these days, as we may want to be sure that we prevent access by any suspect accounts. Disabling them allows us to prevent their use, but keep all their rights and permissions in the even they are valid accounts.

We can disable indexes, which can be useful as a precursor to deleting them at some point. We can disable triggers, which is incredibly useful when you are testing or debugging actions on a table. We can disable Extended Events, audits, and more. All of these are useful actions for a developer or administrator, if you use them.

When things go wrong, we’re often stressed and pressed for time. If there are issues with a system, many of us make snap decisions, which might fix the problem or make the situation worse. Even in those cases where we fix an issue, deleting or dropping objects might cause is extra work later. My question for you is what’s your first reaction? When you need to make a quick change to remove something, security access, an index, etc., do you drop or disable?

I would hope that you disable, as this removes the effects but keeps the object in the system with associated meta data. Rebuilding permissions or trying to get the old trigger code is a pain (since few people use a VCS, please start doing this). It’s possible that you won’t even be able to get things reset back up in the same manner. That might be fine, but it’s not ideal as new code should be tested, and in a crisis, it likely won’t be looked at too closely. At least the previous version of the code was tested in production.

Build the habit to disable, not drop, and I think you’ll be glad you did.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.8MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged , | Leave a comment

Job Interviews: What is Normalization?

This is part of a series that looks at interview questions that you might be asked.

One of the cornerstones of a relational database is normalization. This is a database design technique that is used to decide how we might distribute data among tables and columns. If someone were to ask you “what is normalization” in your next job interview, what would you say? How do you answer this question?

Depending on the position, and your experience, perhaps you’d talk about this in practical terms, with an example. Maybe you have a scenario you use to explain the concepts. I tend to lean towards an e-commerce type database, with orders, customers, and products. That’s an easy concept for people to grasp. I can then give examples of how normalization might change the way we store data.

Do you there are various forms of normalization? We can talk about first, second, third, fourth, or other normal forms of the database. Perhaps you can explain these from memory? Maybe you know the academic definitions. Maybe you have your own description. I certainly think if the position may involve designing tables that you should be able to talk about the differences between these forms and determine what form a particular database is in.

My answer would include quick explanations of how I would avoid repeating groups, columns not dependent on the PK, and how sometimes the third normal form starts to impact performance because of the number of joins required for simple queries. I would also be ready to expand on these topics, explaining what a PK is and how to choose one.

This is a complex topic,and I doubt an interviewer would expect anyone to provide a thirty minute lecture, which wouldn’t completely cover the topic. Instead, I would think that if an interviewer is interested in this topic, they will ask probing questions to determine if you understand the concepts. This means you can’t just memorize the definition. You should ensure that if I have you a table, you could break it down into the entities for 1st, 2nd, and 3rd normal form. Perhaps you should read a few basic articles to get some understanding. Maybe you need to delve a little deeper into design and the anomalies that normalization attempts to mitigate.

Prepping for an interview shouldn’t be a cram session to learn more, but really a review of concepts you understand. Make sure you have some examples to explain your thoughts and practice a few of these questions with a friend by giving them an answer and listening to their feedback.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.6MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged | Leave a comment