Loading All CSV Files with PowerShell

I ran across Denis Gobo’s post about working with names from Social Security cards and wanted to play with the dataset. However, rather than use xp_cmdshell, which I might have, I decided to use PowerShell. I need to repeat this with Python, but for now, this worked.

There is a file you can download that has a lot of .txt files, each one with a CSV. You can read about this in more detail in Denis’ post. Essentially there were three fields, and the year of the file in the name. I downloaded and unzipped the files into a folder.

Here was my process:

  1. Loop through the files
  2. Extract the year from the file name
  3. bcp in the file into a staging table
  4. call a proc that takes the year as a parameter and moves all data

With that in mind, here’s how I set things up.

The SQL Server Side

I started by creating two tables.

CREATE TABLE Names
(   FirstName  VARCHAR(500),
     Gender     CHAR(1),
     NameCount  INT,
     YearInFile INT
);
CREATE TABLE Names2
(   FirstName VARCHAR(500),
     Gender    CHAR(1),
     NameCount INT
);

With these in place, I created a procedure:

CREATE PROCEDURE MoveNames @year INT
AS
INSERT Names SELECT FirstName, Gender, NameCount, @year FROM Names2;
TRUNCATE TABLE Names2;
GO

I tested these and they worked fine for my process. I want to load into Names2, then to get the year, I move the data to the other table, adding the year. There are probably better ways, but this worked fine for a relatively small load (few million rows).

PowerShell

With PoSh, I started with a simple loop. I tend to set variables first since this makes this easier to turn into a function.

$sourcefolder = “E:\Downloads\names”

$sourcefiles = Get-ChildItem $sourcefolder -Filter “*.txt”

foreach($file in $sourcefiles ){
   $yearinfile = $file.Name.Substring(3, 4)
   write-host “Loading File:” $file.Name ” year: ” $yearinfile

}

Running this gets me a list of all files along with the  years. That’s what I want, since I’m going to process each file and load it in with bcp. Thanks to Mike Fal for the outline of how I’ll do this.

Next, I set more variables at the top:

$InstanceName = “.\SQL2016”
$DatabaseName = “Sandbox”
$StagingTableName = “Names2”
$StagingProc = “MoveNames”

I’ll use these in this code. I call Invoke-Expression to run bcp and then Invoke-Sqlcmd to run my proc.

$cmd = “bcp ‘$DatabaseName.dbo.[$StagingTableName]’ in ‘$file’ -S’$InstanceName’ -T -c -t’,'”

Invoke-Expression $cmd
Invoke-Sqlcmd -ServerInstance $InstanceName -Database $DatabaseName -Query “$StagingProc $yearinfile”

From here, I put this in a file and ran it from my download location. The result:

2017-06-16 12_41_08-Windows PowerShell ISE

and from SQL Server:

2017-06-16 12_41_50-SQLQuery7.sql - (local)_SQL2016.sandbox (PLATO_Steve (56))_ - Microsoft SQL Serv

Now I can run some queries, and find out where I stand. #26, it turns out.

2017-06-16 12_43_35-SQLQuery7.sql - (local)_SQL2016.sandbox (PLATO_Steve (56))_ - Microsoft SQL Serv

Posted in Blog | Tagged , , | 1 Comment

The Traveling Data Professional

It’s summertime in the Northern hemispheres, and a time when much of the world takes their holiday time away from work. Most kids are out of school for at least part of the June/July/August time period. With many of us workers trying to spend time with our families, this is the part of the year that we schedule trips and events.

Taking time away to be with family is important, but just getting away from work is healthy for all of us. Whether it’s a planned trip to some new location, visiting relatives in another city, or even just taking time to enjoy a leisure activity, we all need a break. Some of us will try to do too much during our break, arriving back at work physically worn out. Even in these situations, it’s still nice to get your mind away from the environment you face most days. I think any break is good, and especially in the era of smartphones and nearly ubiquitous connectivity, time away is good for your health.

This week, I’m wondering if any of you have plans for an adventure that you’re excited about. Are you going somewhere new or engaging in some activity you look forward to? Perhaps you don’t have big plans this year, but something you’re working towards in the future? Make us jealous and let us know what holiday break might be coming your way.

I’ve been more of a winter person for most of my life, looking to take time off from work when there’s snow on the ground, but I’ve had my share of summer trips. Usually my family will camp and take a few trips. With my children getting older, I expect to take even more trips when it’s just my wife and me. Since I can work from anywhere, I can follow along on some of her horse adventures.

This year we had planned on a long trip to Glacier National Park, but with older kids, scheduling looks to be a bit of an issue, so we’ll likely stick to camping in Colorado. However, I am honored to have been chosen to speak at the Data Platform Summit in Bangalore, India, so we’ll all have a bit of a vacation in August as we travel to a new country.

Enjoy the weekend and let me know what your plans are this year.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.4MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged | 1 Comment

Move a Stored Procedure to a New Schema–SQLNewBlogger

Another post for me that is simple and hopefully serves as an example for people trying to get blogging as #SQLNewBloggers.

One of the things I needed to do recently was move an object. I was testing the WideWorldImporters database and created an object in the dbo schema. That’s the default for me, which is fine. However, in this case I wanted it in a different schema.

The way to do this is with the ALTER SCHEMA command. There is a TRANSFER option, which takes the original schema and object name.

In my case, I had the dbo.GetOpenPurchaseOrderCount procedure in my database.

2017-06-07 14_40_29-SQLQuery1.sql - (local)_SQL2016.WideWorldImporters-RR (PLATO_Steve (53))_ - Micr

I used this command to move it.

ALTER SCHEMA Website
TRANSFER dbo.GetOpenPurchaseOrderCount

And then verified things moved.

2017-06-07 14_40_37-SQLQuery1.sql - (local)_SQL2016.WideWorldImporters-RR (PLATO_Steve (53))_ - Micr

 

SQLNewBlogger

This was one of those quick items where I checked the ALTER commands, thinking it was in there. I didn’t see a changeobjectschema procedure, and since this was a new skill, it was a 5 minute blog.

Posted in Blog | Tagged , , , | Leave a comment

Backup Plans for Data Loss

The British Airways computer failure has been on my mind for a few weeks. It seems like such an epic failure in 2017 from a high public company that many, many people depend on. I still struggle to believe the power explanation, and really, I’m not sure there’s any explanation that I would accept. At this point in our industry, there’s no good reason for any large, global company to not be able to restart services inside of a couple hours in a DR site. In fact, it really shouldn’t even take that long.

However, for many of us, we will have a failure or disaster at some point. It might not even be a hardware failure or system crash. It’s much more likely that a user will cause an issue. As our systems grow larger, perhaps even more loaded with transactions, we might not always be able to easily separate out good data from bad, and I would expect we’ll experience a restore.

For many of us this will mean we will lose some data from the system. Even with frequent log backups, we might end up with a short period where we can’t recover data. Most of us should have conversations with business stakeholders on what the acceptable level of data loss is, and plan to meet those requirements. We should also have plans around how to rebuild data. I wouldn’t recommend a full test on a system, but it might be worth a few conversations with those that deal with transactional data and discuss how the latest data might be recreated.

No one wants to lose data, and in many cases, there are ways to rebuild or recover the data with manual efforts. Perhaps your company has paper records, or maybe there’s an audit trail that could be used to reconstruct actions. Maybe you rely on memory or even customers to provide information again. Today I’m wondering if you’ve thought about how you might recover data in a non-technical way and what methods you’d use.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 3.6MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged , | Leave a comment

Using OPENROWSET in SQL Server on Linux

I wanted to import the million song dataset in SQL Server on Linux. There’s a github repo that has the SQL to allow you to use this with the graph database features. However, it’s built for Windows.

Linux is a slightly different beast. Once I started down this path, I had memories of working on SunOS in college, messing with permissions and moving files.

I run Ubuntu in VMWare, so I first downloaded the files to my Documents folder. That’s pretty easy. However, once there, the mssql user can’t read them. Rather than mess with permissions for my home, I decided to move these to a location where the mssql user could read them.

First, I need to use mv to move the files. However, the default location for SQL Server (/var/opt/mssql) doesn’t let me drop files in there. Instead, I need to sudo the mv.

sudo mv unique_tracks.txt /var/opt/mssql/unique_tracks.txt

I repeated this for each file.

However, I still had permissions errors. Files have their own permissions in Linux, so I needed to alter those. I decided to use chown since these are temp files the SQL Server will use and once imported, I’ll delete them.

chown mssql unique_tracks.txt

From here, I could easily run the OPENROWSET commands and get the data loaded. Now to play around with a graph.

Posted in Blog | Tagged , , , | 1 Comment

Speaking at the Data Platform Summit 2017

I am honored to be speaking at the Data Platform Summit 2017. The event takes place August 17-19, 2017 in Bangalore, India. This will be my first trip to India, and I’m excited to be traveling halfway around the world (literally 12:30 hours difference) to speak at this event.

I’ve got a pre-con on Aug 16, and then a few sessions during the main conference.  If you have the chance, register and come. I’ve heard good things about the event and I look forward to seeing some of you there.

See if you can pick me out of the Teaser Video:

Posted in Blog | Tagged , , | Leave a comment

Migrating Schema Between Platforms

PostgreSQL is a popular relational database, one that was designed to be an object relational system, based on standards, and allowing for extensibility. I’ve always thought this was a great system, exceeding SQL Server in some ways, though falling short in others. It’s popular, and certainly used enough so that Azure and Amazon Aurora include PostgreSQL compatability products.

I ran across an interesting post, where AWS has a Schema Conversion Tool that will help you migrate from Oracle to Aurora running PostgreSQL. I think this makes some sense as there are some capabilities in PostgreSQL that make moving Oracle code easy. AWS also has a Migration Service to move SQL Server to MySQL. I’m not sure how smoothly that goes, but SQL isn’t a terribly complex language and perhaps the conversion isn’t too bad, though I’m sure this isn’t painless.

I wonder how many people will choose to move their database systems to a new platform. Whether you’re on a commercial database, such as SQL Server or Oracle, or on a free one (MySQL, PostgreSQL), changing the underlying platform is disruptive for many applications. I’m sure there are some software packages that could change a driver and connection string and their code might work, but I bet that’s relatively few systems.

For most of us, the cost of change is high. New skills need to be learned and new code written. Staff may leave, requiring an organization to hire new employees, which is always a bit of a gamble. The entire process is also a disruption that prevents other work from being completed in a timely manner. Often a large ROI is needed to justify the change, though that’s not always the case as sometimes management decides to change for other reasons, regardless of cost, ignoring any ROI. I have even seen some developers psuh to make the platform change, working many extra hours because of some desire to port to a new platform for dubious reasons.

I do see plenty of companies looking to move away from Oracle systems, with a very large cost for support and maintenance. While SQL Server isn’t always cheap, it is much less expensive and Microsoft has programs to help port Oracle to SQL Server. Companies may also choose to move to MySQL or PostgreSQL from Oracle, accepting the disruption and costs with a long view towards reducing their yearly outlay for database services.

I have migrated platforms before, and it’s been an ordeal each time. I haven’t ported code to a new platform, and not sure I would. Perhaps if we were abandoning a codebase and rewriting the application I’d consider it, but in most cases, the cost of staff and time to rework code makes this an expensive proposition. If you have different thoughts or experiences, let us know today.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 4.2MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Leave a comment

DevOps – Downloading a File from the Internet with PoSh

One of the things we need to do as data professionals is move data files around. Often we’ll get a local path, but in looking for public data sets, I wanted to get a file from the Internet. In this case, a rather large file.

I could have put the URL in a browser, but the file was slow to load, and while waiting ten minutes or so for the download to complete and doing a “Save As” is quick, it isn’t easily repeatable. Plus, since I needed to get a few files, and might need to do it again, I thought a PoSh download would be better.

A quick search turned up a few options. I could use System.Net.WebClient, or I could use Invoke-WebRequest. I decided to use the latter because it could use credentials or even parse the file prior to doing some save.

I just had a simple item, so I used:

Invoke-WebRequest -Uri “http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/unique_tracks.txt” -OutFile “e:\Downloads\unique_tracks.txt”

I could easily have wrapped this in a function to simulate a copy command, and I may do that at some point. For now, I can drop this in a file, copy/paste, and change a few filenames.

If this were part of a regular process, such as getting files from a remote web server, I could easily automate this in a task on some server. For now, this is a quick, easy way to get a file from the Internet without a browser.

Posted in Blog | Tagged , , | 1 Comment

DevOps Can Help

Amazon had a load balancer failure in 2012. The analysis of the event shows that there were missing data in the devices that caused issues. The restore of data from these devices is complex, way more complex with less mature tools than most database platforms. The result was a nearly 10 hour period of time when some customers were experiencing issues.

In 2016, Gliffy had three days of downtime from a database error. In this case, an admin was updating a replicated system, but failed to sever a link with the primary node. Forgetting this step caused a data removal on the node, which replicated to the secondary nodes. They discovered the restore and replay of logs would take many days due to the size. They hadn’t practiced a DR situation in some time, and were not prepared for the delays.

Digital Ocean received alerts earlier in 2017 that some services were not functioning. They traced this down to the primary database being deleted. The issue was a process used the wrong credentials for automated testing, and I’m guessing that part of the testing was removing and rebuilding a database. Five hours across the middle of the night resulted in the main database being restored, and a couple more hours to get replicas caught up.

In the first two cases, there were issues with the deployment of changes to systems, as well as inadequate backup and restore processes. In both of these cases, I would argue that a good DevOps process would have automated the way the code was deployed, including ensuring that steps weren’t forgotten or predeployment backups captured the state of configuration. DevOps includes the “Ops” changes and should ensure that all state information is captured and stored in a VCS. If this had been done, it’s possible that these companies wouldn’t have had these issues.

In the last case, certainly whoever sets up a system is responsible for using the correct credentials. While it’s easy to say that a developer or tester shouldn’t know the production credentials, but it’s entirely possible that the person that configured the process would have the credentials. I don’t know what to do here, as the first test of this might cause the issue. Maybe a second set of eyes is important for security changes in automated systems? That certainly could be part of your DevOps process. What I’d like here is two factor authentication for all security setup, including for SQL Server.

DevOps isn’t a prescriptive set of things that someone does. Whenever I talk with people about DevOps and they give reasons why a particular step I’ve demonstrated won’t work for them, I tell them to stop doing that step. After all, the way you implement DevOps doesn’t have to match what I did. We each need to do what works for our environment, and ensure we have some consistency and repeatability in our process. Hopefully preventing downtime from simple mistakes.

Steve Jones

The Voice of the DBA Podcast

Listen to the MP3 Audio ( 4.2MB) podcast or subscribe to the feed at iTunes and Libsyn.

Posted in Editorial | Tagged | Leave a comment

The Need for DRE

“Today’s database professionals must be engineers, not administrators”

That’s a quote from an interview at O’Reilly on Database Reliability Engineering. I don’t want to quibble about whether those of us working in technology are engineers in name or not, but in function, I do think the world is changing to one where we need to do more than be caretakers of systems. We need to be actively involved in ensuring we have very reliable, well-designed systems that use the best practices and patterns of highly available and reliable database systems.

There is advice to learn and work closer with developers. I like that, and I do think the idea of learning to script and automate tasks is important, perhaps more important all the time, for data professionals. Some of the advice to look towards the quick moving, limited lifetime of many software components today doesn’t make sense. A database, at least some parts of a database, don’t fit within that model. There are valid reasons why a datastore must provide some stability and persistence for an application and won’t fit within a model of multiple systems. Or at least not practically within a model.

However, there is good advice in the piece to learn more about different types of database services and patterns, and consider other ways of implementing datastores other than a RDBMS. I wouldn’t necessarily abandon a RDBMS for some NoSQL store just because developers think it’s easier, but I would consider whether I actually need to very tight coupling between different parts of the database and complete consistency. I think there are a fair number of domains where a CQRS pattern or some distributed store would work well. The move to microservices might be an enabler for your business if you consider the advantages for OLTP type transactions.

The flip side of all the power that many stores provide is that many also make reporting and aggregating information more difficult. I would venture that in many cases, a data warehouse (perhaps even a RDBMS-based or columnar store) is necessary, along with the ETL process necessary to keep it up to date. These aren’t simple processes, and take resources to build. For most of us, because we don’t work at extreme scales, I’m not sure it’s worth leaving a relational platform, but I do think that we can learn to evolve and enhance our relational databases faster with DevOps ideas and techniques.

Steve Jones

 

Posted in Editorial | Tagged , , | Leave a comment