Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.
This is day 4 of the Advent of Code 2017. If you want to read about the puzzles, start with Day 1.
This is Day 4, which deals with checking a series of passphrases for duplicates. The idea is that for some input, each of the “words” in the string must be unique. This means that if I have a phrase of “dog cat horse”, it’s valid. If I have “dog cat bird cat”, it’s invalid.
This feels like an easy thing to do in SQL. Let’s begin.
First, I’ll build a table and load some data.
CREATE TABLE Day4 ( passphrase VARCHAR(100) ) GO INSERT dbo.Day4 ( passphrase ) VALUES ('aa bb cc dd ee' ) , ('aa bb cc dd aa') , ('aa bb cc dd aaa')
This is the test set. In this set, there are two values that are valid and one that isn’t. I need to figure out which is which.
This seems like a perfect place to use String_Split(). Since this is a small set (512 rows), performance isn’t a big concern. If it were, I’d be moving to use Jeff Moden’s string splitter.
For my tests, I decided to try something quick and dirty. I built a quick inline TVF to return string_split values. This is the code:
CREATE FUNCTION dbo.words (@input AS VARCHAR(100)) RETURNS TABLE AS RETURN ( SELECT value FROM STRING_SPLIT(@input, ' ') AS ss )
This made things easier for me to read. I like seeing the code clean, and this way I could easily most to a custom version of Jeff’s code if I wanted.
I quickly tried a short CTE with a CROSS APPLY to get the sets of rows that didn’t have unique passphrases. I thought this was easier since I know these are incorrect. If I look for corrects, it’s a harder issue as I’ll have counts of 1 in both correct and incorrect strings.
WITH ctePass AS ( SELECT d.passphrase , ss.value, cnt = COUNT(*) FROM day4 AS d CROSS APPLY dbo.words(d.passphrase) AS ss --WHERE d.passphrase = 'aa bb cc dd ee' GROUP BY d.passphrase, ss.value HAVING COUNT(*) > 1 )
I also added a CTE that just gets me the total count.
, cteCount AS (SELECT total = COUNT(*) FROM day4)
Next, I added a CTE that would get my unique passphrases from the first CTE.
, cteUnique AS ( SELECT ctePass.passphrase , cnt = COUNT(cnt) FROM ctePass GROUP BY ctePass.passphrase )
Finally, I subtract the invalids from the total to get the vlaids. This gave me the answer that solved the puzzle.
SELECT total - COUNT(*) FROM cteUnique, cteCount GROUP BY cteCount.total
Now let’s load the actual data. I cut and pasted data into a file. Let’s load that.
TRUNCATE TABLE day4 GO BULK INSERT dbo.Day4 FROM 'e:\Documents\GitHub\AdventofCode\2017\Day4\Day4_Input.txt' GO SELECT * FROM dbo.Day4 AS d
I can see I have 512 rows in my table, so my result must be between 1 and 512. When I ran this, I got the correct result.
A slight admission here. I misread at first, so I didn’t have the total and just got the 57. When that didn’t work, I re-read things and realized this. I tried to check for the valids, but realized that it was simpler if I just subtracted.
Part II is tricky, and I made a mistake the first time I tried to solve this. It took a bit, but I eventually figured out the issue.
In this part, we have to avoid anagrams. That means if the phrase as ‘car’ and ‘rac’, it’s invalid. Getting anagrams is tricky, and I looked around to find a post on SO that covers this. I adapted that code from Pawel Dyl to search for anagrams.
My first task was to put this into a function:
CREATE OR ALTER FUNCTION dbo.CheckAnagram (@w1 VARCHAR(50) , @w2 VARCHAR(50) ) RETURNS int AS begin declare @r INT; WITH Src AS ( SELECT T.W1 , T.W2 FROM ( VALUES (@w1, @w2) ) AS T (W1, W2) ) , Numbered AS ( SELECT Src.W1 , Src.W2 , Num = ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM Src ) , Splitted AS ( SELECT Num , Word1 = W1 , Word2 = W2 , L1 = LEFT(W1, 1) , L2 = LEFT(W2, 1) , W1 = SUBSTRING(W1, 2, LEN(W1)) , W2 = SUBSTRING(W2, 2, LEN(W2)) FROM Numbered UNION ALL SELECT Num , Word1 , Word2 , L1 = LEFT(W1, 1) , L2 = LEFT(W2, 1) , W1 = SUBSTRING(W1, 2, LEN(W1)) , W2 = SUBSTRING(W2, 2, LEN(W2)) FROM Splitted WHERE LEN(W1) > 0 AND LEN(W2) > 0 ) , SplitOrdered AS ( SELECT Splitted.Num , Splitted.Word1 , Splitted.Word2 , Splitted.L1 , Splitted.L2 , Splitted.W1 , Splitted.W2 , LNum1 = ROW_NUMBER() OVER (PARTITION BY Num ORDER BY L1) , LNum2 = ROW_NUMBER() OVER (PARTITION BY Num ORDER BY L2) FROM Splitted ) SELECT @r = CASE WHEN COUNT(*) = LEN(S1.Word1) AND COUNT(*) = LEN(S1.Word2) THEN 1 ELSE 0 END FROM SplitOrdered AS S1 JOIN SplitOrdered AS S2 ON S1.L1 = S2.L2 AND S1.Num = S2.Num AND S1.LNum1 = S2.LNum2 GROUP BY S1.Num , S1.Word1 , S1.Word2 IF @r IS NULL SET @r = 0 RETURN @r end GO
Next, I wanted to get a list of items to check. Again, STRING_SPLIT is what I used, but a Jeff’s code works as well. I decided to use another function, which will split the string and then do a cross join to run both combinations of works against each other to check for anagrams.
CREATE OR ALTER FUNCTION dbo.SplitandCheck (@string AS VARCHAR(100)) RETURNS int AS BEGIN DECLARE @r INT = 0; WITH mycte AS ( SELECT checks = CASE WHEN a.value = b.value then 0 ELSE dbo.CheckAnagram(a.value, b.value) end FROM STRING_SPLIT(@string, ' ') AS a CROSS JOIN STRING_SPLIT(@string, ' ') AS b ) SELECT @r = SUM(a.checks) FROM mycte a
This seemed to work, but when I got the final result with this code, it was wrong. Too high.
WITH mycte AS ( SELECT d.passphrase, IsAnagram = dbo.SplitandCheck(d.passphrase) FROM dbo.Day4 AS d ) SELECT mycte.IsAnagram, COUNT(*) FROM mycte -- WHERE mycte.IsAnagram = 0 GROUP BY mycte.IsAnagram
I had checked for 0s, but also included other results to try and debug my code. As I went through here, I realized that some fo the input data had a string like “oicgs rrol zvnbna rrol”. Clearly “rrol is an anagram of “rrol”. Initially I was knocking those out as a cross join includes those anyway.
As a result, I added this code to my function.
IF @r = 0 AND EXISTS( SELECT COUNT(*) FROM STRING_SPLIT(@string, ' ') AS ss GROUP BY ss.value HAVING COUNT(*) > 1 ) SET @r = 1
This will check for duplicate values. If there are dups, then clearly we have an issue already.
This give me a set of items with 1 dup as well as those with anagrams. The 0 result is the count of valid passphrases.