The Diverse Data Platform–Dativerse.io

The other day I got a note from Ben Weissman the other day that he was running a new event,  http://dativerse.io/. Not much information, but I liked this list in the call for speakers.

  1. Your session must be related to the Microsoft Data Platform
  2. You contribute to the diversity of the Microsoft Data Platform

I often talk about the first one, but the second? That’s interesting. I assume this means some sort of new talk that expands on the way we use the platform. New technology, new ways of putting things together, new something.

Or maybe a diverse speaker, someone that isn’t one of the same speakers that are often at most events. It’s an interesting idea here.

I don’t know if I’ll submit. While I can speak on the data platform, I don’t know I add much diversity in some ways. I do sometimes pick niche topics, but if I do something I’ve already done, is that adding to diversity?

I’m not sure, but I’m thinking about it. The CFS closes on May 16, 2021, so I, and you, have time.

Even if you don’t want to submit, perhaps you want to encourage someone else to submit, or even co-submit with you and help them deliver a session. We need to encourage and broaden our horizons, which is something that we all can work on.

Posted in Blog | Tagged , | Comments Off on The Diverse Data Platform–Dativerse.io

Finding Legal Data

Many of us have likely been asked about data science or machine learning in the last few years by someone in our organization. This has become a hot field, with many companies looking to try and find ways to use the technology to improve their work in some way. While I don’t know how successful these projects have been in organizations, I know that some areas seem to be finding practical applications. Image recognition, translation, and even some application scoring systems have benefited from machine learning algorithms.

To build a successful model that works well, we need training data. Often lots of training data, and then have some metadata about how our training data might be applicable to a particular question. For example, if we have lots of pictures of dogs, we might need to tag the different breeds in order for a system to differentiate among them. If we want loan applications scored, we should have a corpus of documents that are already scored. This metadata allows the system to learn.

There is a company, Clearview AI, that markets itself as a facial recognition system. To build their model, they scraped images from YouTube and other Internet sources, without consent from Google or the subjects of the videos. This is interesting, as the data itself is publicly available, but gathering it into a database and using it for other purposes might run afoul of data privacy laws, like the GDPR.

I don’t quite know how to feel about this use of public data. While I don’t mind people viewing my pictures, I’m not sure that I like the idea of them being copied into a database for some other purpose. That might seem silly, or even strange, but I do think there is power in data and more power in more data. Allowing others to put my images in a database and use them, perhaps to train a model to recognize me, feels like overreach.

If you need data for your company, or your idea, what can you do? Many people just scrape Google, Facebook, etc., and get data. That might cause you some legal issues in some places, and you ought to be aware of the implications if you choose to do this, or you are asked to do this. I don’t think this is how we want data to be gathered. I know there are some guidelines for responsible AI at Microsoft, but not necessarily rules in place for many companies. Hopefully that will change over time.

Steve Jones

Listen to the podcast at Libsyn, Stitcher, Spotify, or iTunes.

Posted in Editorial | Tagged , | Comments Off on Finding Legal Data

Daily Coping 19 Apr 2021

I started to add a daily coping tip to the SQLServerCentral newsletter and to the Community Circle, which is helping me deal with the issues in the world. I’m adding my responses for each day here. All my coping tips are under this tag. 

Today’s tip is to make sleep a priority and go to bed on time, turning off devices well before you turn in.

I don’t stay up late. I tend to try and relax at night with a book or a little TV with my wife. As I get tired, I’ll usually take off my glasses and then try to go to sleep rather than try to stay awake.

I had my first COVID shot recently, and I had a rough day afterwards. Between that, the shingles vaccine and some busy days, I’m trying to stay a bit more relaxed and take care of myself. While I am still trying to enjoy a busy life, I am also trying to not push too hard.

This week, I’ve been going to bed a little earlier, and trying to sleep well. I may watch a little TV with my wife, but I’m trying to keep it light and mindless, nothing that’s intense or excites my mind. Usually some silly comedy I can laugh at, and if I miss a bit of dialog, I don’t care because the story doesn’t really matter.

Posted in Blog | Tagged , , | Comments Off on Daily Coping 19 Apr 2021

2020 Advent of Code – Day 4

This series looks at the Advent of Code challenges.

As one of my goals, I’m working through challenges. This post looks at day 4.

Part 1

This is an interesting data set. It’s ugly, inconsistent, and spans across lines. In fact, to determine what any particular “row” is, you need to read line by line and process the data until you find a blank line. An SSIS or ADF exercise indeed.

However, I decided to try this with Python first. I thought the loading and splitting of items on consecutive lines would be easier. I opened the file and then scanned it for values like this:

for row in passports:
if row not in ['\n','\r\n']:
        currpassport += row.replace('\n',' ')

This let me look for a blank row. If I didn’t find one, I added the row to my current passport value. If I did have a blank value, I split the row into a dictionary:

currdict = dict(x.split(":") for x in currpassport.split(" ") if x)

From here, I could check the length being either 8 entries, or 7 entries if the cid was not present. I counted all these up to get to the answer.

Part II

This was annoying.  Validating each one of the entries based on years or a set of values. I knew I could build a number of validation functions, which is the better way. I ended up just using a series of IF statements to check values and set a validation variable. A sample of them is here:

if currdict["hgt"][-2:]=="cm" and ( int(currdict["hgt"][:-2]) < 150 or int(currdict["hgt"][:-2]) > 193):
                valid = 0
if currdict["hgt"][-2:]=="in" and ( int(currdict["hgt"][:-2]) < 59 or int(currdict["hgt"][:-2]) > 76):
                valid = 0
if currdict["ecl"] not in ["amb", "blu", "brn", "gry", "grn", "hzl", "oth"]:
                valid = 0
if currdict["hcl"][0] != "#" or len(currdict["hcl"]) != 7:

                valid = 0

This let me tally up the valid passports.

I still need to work on these in PoSh and SQL, but life has gotten in the way of things outside of work.

Posted in Blog | Tagged , , | Comments Off on 2020 Advent of Code – Day 4