Innovation Needs Information

We are great about sharing knowledge in the technical community. I think the SQL Server community is even better than most technologies as we do a good job of doing so without the rude RTFM that I find in other communities. However, overall we share lots of information with others. I do know that many other communities are catching up and I really appreciate the answers I’ve gotten from others when trying to repair my tractor or auto.

As the world looks to bring innovative solutions to various problems, it seems that machine learning will play some role. There are some amazing advances taking place in all sorts of areas, and certainly no shortage of organizations that are trying to see if ML enhancements will improve their capabilities. We hear about ML all the time as data professionals, and some of you are starting to try and use these algorithms to build new applications.

I ran across a great post about innovation that notes that the fuel for ML and related technologies is data. It’s important for training and evaluating systems, and there are companies making a lot of money providing training data. However, there is also lots of data inside companies, especially some of the big social media and communication companies that gives them an advantage. I think that’s OK, after all, these companies have innovated to build large scale enterprises and devoted resources to collecting data.

However, I do think that more and more data is needed, and companies will work to collect it if they can. That means more and more of us will manage additional data, all of which needs to be cleansed, stored, managed, and protected. More work for us, but also lots of opportunity. Especially if you learn how to process data in new ways.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.9MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged | Leave a comment

SQL in the City is coming Feb 28

The next edition of the SQL in the City broadcast is coming Feb 28, and I’m planning travel now to head back to the UK for the event. We don’t have a schedule yet, but once we do, it will be on the Redgate Hub Livestream page.

This edition will likely include lots more GDPR/security/masking content as well as updates on the new products.

Mark the date on your calendar, let your boss know it’s a training day, and get some friends together to join us for the day.

Posted in Blog | Tagged , , , | Leave a comment

tSQLt Tests for Advent of Code 2017 Day 4

This is day 4 of the Advent of Code 2017. If you want to read about the puzzles, start with Day 1. As I worked through the puzzles, I decided that I should be testing using their test sets and solving the issues that way. This lets me use the sample data, but also add in my own sets to cover strange situations.

Here are the tests that I used for each part of day 4.

Part I

This is a fairly simple test. I’m returning a result set since the solution is a single query, but this is really a scalar. In this cas,e I’ll create a one row, one column expected table and then get the results from my solution (inside a proc) and insert into Actual.

The rest is standard tSQLt testing framework. Fake a table, enter data.

CREATE OR ALTER PROCEDURE tDay4.[test Day4a sample data]
AS
BEGIN
     -- Assemble
     EXEC tsqlt.FakeTable @TableName = N'Day4';
     
     INSERT dbo.Day4
(
     passphrase
)
-- SQL Prompt formatting off
VALUES
    ('aa bb cc dd ee' )
  , ('aa bb cc dd aa')
  , ('aa bb cc dd aaa')

-- SQL Prompt formatting on
    CREATE TABLE #Expected (valid INT);
    INSERT #Expected
    ( valid)
    VALUES
    (1  );
    SELECT *
     INTO #actual
     FROM #Expected AS e
     WHERE 1 = 0;


     -- Act
   INSERT #actual
    EXEC dbo.Day4_a;

    -- Assert
     EXEC tsqlt.AssertEqualsTable @Expected = N'#Expected' ,
                                  @Actual = N'#Actual' ,
                                  @Message = N'Incorrect number of valid passphrases';
     
     
END
GO

EXEC tsqlt.run 'tDay4.[test Day4a sample data]';

Part II

This is the same as part I, but I change the inputs and results.

CREATE OR ALTER PROCEDURE tDay4.[test Day4b sample data]
 AS
 BEGIN
 -- Assemble
 EXEC tsqlt.FakeTable @TableName = N'Day4';

INSERT dbo.Day4
 (
 passphrase
 )
 -- SQL Prompt formatting off
 VALUES
 ('abcde fghij' )
 , ('abcde xyz ecdab')
 , ('a ab abc abd abf abj')
 , ('iiii oiii ooii oooi oooo')
 , ('oiii ioii iioi iiio')

-- SQL Prompt formatting on
 CREATE TABLE #Expected (valid INT);
 INSERT #Expected
 ( valid)
 VALUES
 (3  );
 SELECT *
 INTO #actual
 FROM #Expected AS e
 WHERE 1 = 0;

-- Act
 INSERT #actual
 EXEC dbo.Day4_b;

-- Assert
 EXEC tsqlt.AssertEqualsTable @Expected = N'#Expected' ,
 @Actual = N'#Actual' ,
 @Message = N'Incorrect number of valid passphrases';

END
 GO

EXEC tsqlt.run 'tDay4.[test Day4b sample data]';
Posted in Blog | Tagged , , , | Leave a comment

Finding the Default Path with dbatools

I really like the dbatools project. This is a series of PowerShell cmdlets that are built by the community and incredibly useful for migrations between SQL Servers, but also for various administrative actions. I have a short series on these items.

It’s been awhile since I worked on any dbatools learning with holidays and travel. I find these cmdlets to be really handy, and if I had to manage a large estate of instances, they would be invaluable.

I ran across Get-DbaDefaultPath on another blog, and thought this would be a handy little item to have. It is.

I know the project changes, so I ran an update-module to get the latest items first. Then I tried the cmdlet. I ran a simple query against a local instance, and I quickly get the details, my instance and the Data, Log, Backup, and ErrorLog locations.

2018-01-15 11_54_46-cmd - powershell (Admin)

This is handy information, especially as I often have multiple instances (same or different machines) and I may want to make sure I don’t put a database on a small drive, or I need to find out where a backup is (or errorlog).

The normal way of getting this information for me has been to right click the instance in SSMS, possibly connect first, get the properties, and look at the panels in the dialog. It works, but it’s slow.

This is a much quicker way for me to find out paths, which  just makes admin easier. With tab completion, this will be the new way I find paths.

The advantages of using this to gather paths, check sizes, and do some scripting to find files, copy them, make decisions about where to create databases, etc. are many. I can see this would be a great way to build scripts that include some decision making that adapts a simple process to new environments.

If you haven’t tried dbatools, do it today. It’s a fantastic administration tool for your toolbelt.

Posted in Blog | Tagged , , | Leave a comment

How to Grow In Your Career

This is a great thread on Twitter from Jim Burger, a long time software developer. The thread is a distillation of how he keeps a passion for learning across decades. I think it’s a good list of items, seven of them in total. I’ll give a quick summary, but he has a bit more depth in each tweet. The seven are:

  • Invest in your brain
  • Embrace exploration
  • Find the fun in fundamentals
  • Mentors from from all kinds of places
  • Use the community as an opportunity
  • You don’t have to finish learning
  • Ask people to ask you questions

This is a great set of rules to help you learn. I certainly believe in the first one and that’s one reason I preach for you to keep learning. You can always get a little better and grow your skills. Apart from the fun, it helps train your brain to keep growing and stay flexible. I also think that finding the fun in fundamentals is great. I appreciate the simple things I accomplish, and I like writing about those. Part of the reason I keep doing #sqlnewblogger posts.

I also think the community is a huge help. Someone asks a question, and even if it gets answered before you can, use this as an opportunity to solve the problem. Answer a question in our forums. Lots of people learn by helping others solve problems. In fact, plenty of my learning has come from solving problems for others. I get blog fodder, test myself, and solidify my own understanding of how things work.

If you’ve found a way to continue to have fun learning, let us know today. I know that I tend to do a little but of everything, but certainly I take pride in what works, and even what doesn’t. The latter is always an opportunity for me to learn a bit more.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 2.9MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged | Leave a comment

Advent of Code 2017 Day 4–#SQLNewBlogger

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

This is day 4 of the Advent of Code 2017. If you want to read about the puzzles, start with Day 1.

This is Day 4, which deals with checking a series of passphrases for duplicates. The idea is that for some input, each of the “words” in the string must be unique. This means that if I have a phrase of “dog cat horse”, it’s valid. If I have “dog cat bird cat”, it’s invalid.

This feels like an easy thing to do in SQL. Let’s begin.

First, I’ll build a table and load some data.

CREATE TABLE Day4
(
     passphrase VARCHAR(100)
)
GO
INSERT dbo.Day4
(
     passphrase
)
VALUES
    ('aa bb cc dd ee' )
  , ('aa bb cc dd aa')
  , ('aa bb cc dd aaa')

This is the test set. In this set, there are two values that are valid and one that isn’t. I need to figure out which is which.

This seems like a perfect place to use String_Split(). Since this is a small set (512 rows), performance isn’t a big concern. If it were, I’d be moving to use Jeff Moden’s string splitter.

For my tests, I decided to try something quick and dirty. I built a quick inline TVF to return string_split values. This is the code:

CREATE FUNCTION dbo.words
     (@input AS VARCHAR(100))
RETURNS TABLE
AS
RETURN ( SELECT value FROM STRING_SPLIT(@input, ' ') AS ss )

This made things easier for me to read. I like seeing the code clean, and this way I could easily most to a custom version of Jeff’s code if I wanted.

I quickly tried a short CTE with a CROSS APPLY to get the sets of rows that didn’t have unique passphrases. I thought this was easier since I know these are incorrect. If I look for corrects, it’s a harder issue as I’ll have counts of 1 in both correct and incorrect strings.

WITH ctePass
AS
(
SELECT d.passphrase 
        , ss.value, 
        cnt = COUNT(*)
FROM day4 AS d
     CROSS APPLY dbo.words(d.passphrase) AS ss
--WHERE d.passphrase = 'aa bb cc dd ee'
GROUP BY d.passphrase, ss.value 
HAVING COUNT(*) > 1
)

I also added a CTE that just gets me the total count.

, cteCount
AS
(SELECT total = COUNT(*) FROM day4)

Next, I added a CTE that would get my unique passphrases from the first CTE.

, cteUnique
AS
(
SELECT ctePass.passphrase
, cnt = COUNT(cnt) 
  FROM ctePass
  GROUP BY ctePass.passphrase
  )

Finally, I subtract the invalids from the total to get the vlaids. This gave me the answer that solved the puzzle.

SELECT total - COUNT(*) FROM cteUnique, cteCount
GROUP BY cteCount.total

Now let’s load the actual data. I cut and pasted data into a file. Let’s load that.

TRUNCATE TABLE day4
GO
BULK INSERT dbo.Day4 FROM 'e:\Documents\GitHub\AdventofCode\2017\Day4\Day4_Input.txt'
GO
SELECT 
  *
  FROM dbo.Day4 AS d

I can see I have 512 rows in my table, so my result must be between 1 and 512.  When I ran this, I got the correct result.

A slight admission here. I misread at first, so I didn’t have the total and just got the 57. When that didn’t work, I re-read things and realized this. I tried to check for the valids, but realized that it was simpler if I just subtracted.

Part II

Part II is tricky, and I made a mistake the first time I tried to solve this. It took a bit, but I eventually figured out the issue.

In this part, we have to avoid anagrams. That means if the phrase as ‘car’ and ‘rac’, it’s invalid. Getting anagrams is tricky, and I looked around to find a post on SO that covers this. I adapted that code from Pawel Dyl to search for anagrams.

My first task was to put this into a function:

CREATE OR ALTER FUNCTION dbo.CheckAnagram
     (@w1 VARCHAR(50)
     , @w2 VARCHAR(50)
)
RETURNS int
AS
begin

declare @r INT;

WITH Src
AS
(
SELECT T.W1 ,
        T.W2
FROM
(
     VALUES
         (@w1, @w2)
) AS T (W1, W2)
) ,
      Numbered
AS
(
SELECT Src.W1 ,
        Src.W2 ,
        Num = ROW_NUMBER() OVER (ORDER BY (SELECT 1))
FROM Src
) ,
      Splitted
AS
(
SELECT Num ,
        Word1 = W1 ,
        Word2 = W2 ,
        L1 = LEFT(W1, 1) ,
        L2 = LEFT(W2, 1) ,
        W1 = SUBSTRING(W1, 2, LEN(W1)) ,
        W2 = SUBSTRING(W2, 2, LEN(W2))
FROM Numbered
UNION ALL
SELECT Num ,
        Word1 ,
        Word2 ,
        L1 = LEFT(W1, 1) ,
        L2 = LEFT(W2, 1) ,
        W1 = SUBSTRING(W1, 2, LEN(W1)) ,
        W2 = SUBSTRING(W2, 2, LEN(W2))
FROM Splitted
WHERE LEN(W1) > 0
       AND LEN(W2) > 0
) ,
      SplitOrdered
AS
(
SELECT Splitted.Num ,
        Splitted.Word1 ,
        Splitted.Word2 ,
        Splitted.L1 ,
        Splitted.L2 ,
        Splitted.W1 ,
        Splitted.W2 ,
        LNum1 = ROW_NUMBER() OVER (PARTITION BY Num ORDER BY L1) ,
        LNum2 = ROW_NUMBER() OVER (PARTITION BY Num ORDER BY L2)
FROM Splitted
)
SELECT @r =  CASE
                   WHEN COUNT(*) = LEN(S1.Word1)
                        AND COUNT(*) = LEN(S1.Word2) THEN
                       1
                   ELSE
                       0
               END
FROM SplitOrdered AS S1
     JOIN
     SplitOrdered AS S2
         ON S1.L1 = S2.L2
            AND S1.Num = S2.Num
            AND S1.LNum1 = S2.LNum2
GROUP BY S1.Num ,
          S1.Word1 ,
          S1.Word2

IF @r IS NULL 
  SET @r = 0
RETURN @r
end
GO

Next, I wanted to get a list of items to check. Again, STRING_SPLIT is what I used, but a Jeff’s code works as well. I decided to use another function, which will split the string and then do a cross join to run both combinations of works against each other to check for anagrams.

CREATE OR ALTER FUNCTION dbo.SplitandCheck
(@string AS VARCHAR(100))
RETURNS int
AS
BEGIN
DECLARE @r INT = 0;

WITH mycte
AS
(
SELECT checks = CASE WHEN a.value = b.value then 0
ELSE dbo.CheckAnagram(a.value, b.value)
end
FROM STRING_SPLIT(@string, ' ') AS a CROSS JOIN STRING_SPLIT(@string, ' ') AS b
)
SELECT @r = SUM(a.checks)
FROM mycte a

This seemed to work, but when I got the final result with this code, it was wrong. Too high.

WITH mycte
AS
(
SELECT d.passphrase, IsAnagram = dbo.SplitandCheck(d.passphrase)
FROM dbo.Day4 AS d
)
SELECT mycte.IsAnagram, COUNT(*)
FROM mycte
--  WHERE mycte.IsAnagram = 0
GROUP BY mycte.IsAnagram

I had checked for 0s, but also included other results to try and debug my code. As I went through here, I realized that some fo the input data had a string like “oicgs rrol zvnbna rrol”. Clearly “rrol is an anagram of “rrol”. Initially I was knocking those out as a cross join includes those anyway.

As a result, I added this code to my function.

IF @r = 0
AND EXISTS(
SELECT COUNT(*)
FROM STRING_SPLIT(@string, ' ') AS ss
GROUP BY ss.value
HAVING COUNT(*) > 1
)
SET @r = 1

This will check for duplicate values. If there are dups, then clearly we have an issue already.

This give me a set of items with 1 dup as well as those with anagrams. The 0 result is the count of valid passphrases.

Posted in Blog | Tagged , , , | Leave a comment

The GDPR Sky is Falling

David Poole wrote a nice summary of the GDPR regulations that come into force this spring in Europe. He covers a number of the sections, trying to provide a simple explanation of the potential issues from the perspective of a data engineer. That’s likely the role most of us fill , and I think David does a good job of trying to note the items that he (and maybe you) need to be concerned about.

If you want to read the full text, and you should, it’s here. To me, this is a more sensible, easier to understand type of regulation. It’s way better than SOX and most other regulations I’ve had to deal with, with a better view of balancing the idea that companies won’t have all the answers, and might not choose the best technology but do need to make an effort. I don’t think this will excuse just continuing to do business as you have, but it does read as though courts and authorities will have flexibility in their interpretation.

One of the main things that should be pointed out is that the 10 million Euro fine is a max, not a minimum. The same things goes for the potential 2% of global turnover (revenue for the US folks). These are the highest potential penalties, though if you have made some effort to protect data and comply, I doubt you’d see fines at this level unless you’re negligent.

The keys parts of this regulation are that companies should be paying more than lip service to data privacy and protection. They should be designing and building software and infrastructure that protects data, and also considers the point of view of the individual or organization that is the subject of the data. That’s a good move, in my opinion, having us actually think about the data and the ramifications of its use, sale, transfer, and release, rather than just focusing on our own goals. Most data professionals I know keep this in mind, so GDPR is a step in the right direction to push management to care.

We’ve got information at the Redgate site, which will help guide you. We are building features into existing and new products, and we’d love to sell you software if you can use it, but we’re also learning and trying to share what we know. This goes along with the core values at Redgate of being a part of the community and giving back, through SQLServerCentral, Simple Talk, blogs, and more.

Ultimately no one knows what GDPR will bring, and its application can present a risk to any of us that gather data from EU residents. I know Brent Ozar as already decided to stop EU business for the time being to avoid taking on this risk, and I’m sure other small companies may do the same thing. In one way that’s a shame, though a reasonable decision for a company. In another way, this opens opportunities for other businesses. People in Europe still need goods and services, and there are plenty of ways to comply with GDPR that I don’t think will be too hard, especially for those businesses based in Europe that won’t have a choice. There will be other companies that can fill any void left by companies that cease working in the EU.

GDPR isn’t the end of the world. I think it’s a good move in the right direction to balancing data value and protections. I’d like to see a better framework in the US that also ensures individuals have rights to exercise some control over all the data being gathered about them, as well as something that forces companies to actually consider data protection in their systems. There may be some bad results from GDPR, but most of us will adapt and continue to do business as we have in the past, albeit with better data controls.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 5.4MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged , , , | Leave a comment

SSMS Line Numbers

I typically don’t turn on line numbers if SSMS, but while working with someone on a bit of code, they were referencing line numbers in a large script. By default SSMS shows you the line, column, and character at the bottom (as well as insert/overwrite status) in the status bar. See the image below, where my cursor is on line 597.

2018-01-16 13_30_26-Semicolons.sql - [ReadOnly](local)_SQL2016.RLS (PLATO_Steve (64))_ - Microsoft

However, it’s visually harder to see lines here if those are the way you’re getting oriented with a script. It’s actually easier to have the line numbers on the left side.

This is easy to turn no in SSMS. First, click the Tools menu and choose options (as shown here).

2018-01-16 12_28_26-

Next, go down to the Text Editor section on the left, expand that and then expand the Transact-SQL area. Click General, and you’ll see a checkbox for Line numbers on the right. Click that.

2018-01-16 12_28_38-Options

And line numbers appear.

2018-01-16 12_28_45-Semicolons.sql - [ReadOnly](local)_SQL2016.RLS (PLATO_Steve (64))_ - Microsoft

I do find these distracting most of the time, but there are situations where the line numbers are handy, especially when collaborating with others.

Posted in Blog | Tagged , , | 1 Comment

Finding the Titles in R

PASS has released the videos to members from this past Summit. I say TJay Belt today ask about relating a video name to a session. I have the USB drive, so I looked on there. Here are the videos:

2018-01-10 13_25_47-Video

Not terribly helpful. If you run the HTML file from the stick, you see this:

2018-01-10 13_26_23-PASS Summit 2017

If I hover over a title, I see the link as a specific video file. For example, the first one is 65545.mp4. With that, I looked around and found a javascript file with information in it.

The structure was like this:

//SID
Col0[0] = "65073";
Col0[1] = "65091";

…

//Speaker Name
Col2[0] = "Steve Stedman";
Col2[1] = "Kellyn Pot'Vin-Gorman";

…

//Session Name
Col4[0] = "Your Backup and Recovery Strategy";
Col4[1] = "DevOps Tool Combinations for Winning Agility";

All the data is in one file, but the index in each array matches. So Col0[0] is the SID for video 65073, which has Col2[0] as the speaker and col4[0] as the title.

Now I want to get these in some sort of order. First, let me copy this data into separate files. That will make importing easier. I’ll copy the SID array into one file, the speaker array into a second file and the title array into a third.

This gives me data like the list above, but I need to clean that. This is easiest in Sublime, with a few replacements. I did

  • “COL[“ –> “”
  • “] = “ –> “,”
  • “;” –> “”

This gives me a clean file that looks like this:

2018-01-10 13_29_18-e__Documents_R_titles.txt - Sublime Text

Working in R

I almost started to move this into T-SQL and a table, but since I’ve been playing with R, I decided to see what I could do there. First, I know I need to load data, so I the first file into a data frame.

session.index = read.csv("e:\\Documents\\R\\videosid.txt", sep=",")

The column names aren’t great, so we’ll fix those:

 colnames(session.index) <- c("Index", "SessionSID")

Now
let’s get the other data.

session.speaker = read.csv("e:\\Documents\\R\\passspeaker.txt", sep=",")
> session.title = read.csv("e:\\Documents\\R\\titles.txt", sep=",") 
> colnames(session.speaker) <- c("Index", "Speaker")
> colnames(session.title) <- c("Index", "Title")

I have three data frames. I want to combine them. Let’s do that. I’ll use the merge() function to do this. Since I’ve got common column names, I’ll use those.

> pass.videos <- merge(session.index, session.title, by="Index")

> pass.videos <- merge(pass.videos, session.speaker, by="Index")

This gives me a data frame with the index, title, and speaker. Now I’ve got the data merged, let’s produce a file..

 write.table(pass.videos, file="e:\\Documents\\R\\passvideos.txt",sep=",")

With that done, I can see I have a list of video numbers, titles, and speakers.

"Index","SessionSID","Session","Speaker"
 "1",1,65091,"DevOps Tool Combinations for Winning Agility","Kellyn Pot'Vin-Gorman"
 "2",2,65092,"Oracle vs. SQL Server - The War of the Indices","Kellyn Pot'Vin-Gorman"
 "3",3,65112,"Make Power BI Your Own with the Power BI APIs","Steve Wake"

I did something in R. Smile

Posted in Blog | Tagged , , | 2 Comments

Customize Software or Process

Many of us work with software and often wish that we something was designed differently. I’m sure many of us feel the same way about our database schemas, which is usually even harder to change. In any case, we often want to mold software to fit our thought process. We’re often asked to alter software to meet the expectations of our customers as well, trying to alter visuals or workflows to match some method of working that our customer desires.

We often try to do this with COTS (Commercial off the shelf) software, notably ERP systems. SAP and Oracle applications, and similar software packages, have made many consultants rich and required millions (maybe billions) of dollars of investment by companies that look to ensure their logos, their workflow, the specific process is implemented in software. Even many of us that purchase some infrastructure software, or even get open source applications, may spend lots of resources on getting the software to fit our organization.

The thought that most people have had is that it’s cheaper to make the software fit the system, or the people, than vice versa. However, is that the best way to proceed? Do we want to customize software to work in a way different from the way in which is was designed? Or should we learn to alter some of our processes to better fit with how the tools work? Do we think that we really have some special secret in our process that makes us more efficient? Or are we resistant to change?

For some software, like SAP, the system is designed to morph and fit an organization. Certainly when there are hundreds, or thousands of users, it might be worth the cost of customizing the software to fit the users. However when we have more specific software, such as that used to monitor infrastructure or deploy software or track features, with dozens of users, do we want to spend a lot of time changing the way things work? I sometimes wonder if we should instead focus on getting our people to learn a new way of working that flows with the software. After all, upgrades, patches, and other minor changes are less disruptive.

I don’t think there’s just one answer here, and certainly there are valid reasons to spend time and money on changing a system. I just think it’s worth some thought to be sure that changing software is a better decision that adapting our process to work with the application.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.9MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged | Leave a comment