Data Democracy: Tricks for KnowSQL

"All my life I want money and power" - Kendrick Lamar

NewSQL, NoSQL, MySQL, your SQL -- terms of art synonymous with data but they miss half of the picture. We know how to fit terabytes into our database, but how do we fit it into our culture? If your answer is just "KnowSQL", that is, train every possible stakeholder in the company on SQL, you have some explaining to do.

Within any organization time and knowledge are a finite resource, you can often trade one for the other. Actually, wait a sec, what would trading knowledge for time look like??

Trading time for knowledge -- this is the KnowSQL camp's main focus. Spend large amounts of time teaching everyone who needs knowledge of the data, knowledge of SQL to get it. Presumably the folks meant to learn SQL actually were hired for being good at something... something that they probably should be spending time on. KnowSQL is as absurd as would it be to claim that everyone should learn a specialized skill like management.

So how does one arrive at such an absurd idea?

Knowledge of data and results is really what we want; data-driven decisions trump wild guesses. The decision makers, however, don't often interact with the data directly. The process of getting access it ends up painfully siloed in the data departments as work backs up around them. Even trivial requests end up taking days, and one begins to understand the hardship of trying to have a dialogue by putting your mail on a boat and sending it across the sea in days of yore. In the scurvy-induced fever dream that this situation engenders, of course you'll have a few absurd ideas as long as it will solve the problem.

Democratizing data sounds promising. We can get rid of silos and get data-driven if we all know SQL, right? This is not as simple as it seems, because even if you do teach everyone SQL, you can bet that most will not understand the data schema fully and different departments will want to have different views on the data.

In a very real sense, your data schema is an API. Anyone who consumes the data will also be dependent on that structure not changing for their code to remain working. With enough of these dependencies it puts massive pressure to avoid changes in the schema. Every API becomes versioned. The more people touch the underlying data, the more dependencies they will code up that rely on that format. This often stagnates entire organizations as the cost of recording new attributes or seeing the data the way you need it to look is deemed imprudent due to the cascading changes and effects it will have on others.

We also must consider documentation. Even with fixed schemata, the knowledge of what the meaning of the fields are is often lost in translation or difficult to glean.

What we really need is a way to relieve the pressure on the data team because, let's face it: they really are the most qualified to answer very complex questions. Many of the smaller questions that different stakeholders need to ask of the data team are not that complicated and don't use the full statistical and database prowess that these data scientists trained to wield. For the same reason we don't have the managers do janitorial work for most of the day, we shouldn't have the data scientists treated as data janitors or sql monkeys for simple questions.

Here's how we hack this culture into a formidable and healthy one:

  1. Incorporate a graphical tool that allows the stakeholders to explore the data in their own virtual sandbox and ask their own questions without data scientist handholding or destructive edits to other people's work.

  2. Have the data documented in-place, at the schema level as people interact. Every documented field is likely at least one saved conversation or email.

  3. Build upon the shoulders of giants. Don't allow that one-off filter or cohort to go to waste. Find a way to make reusable abstractions and share with your team.

  4. Don't stagnate. You can leverage techniques like immutability, Virtual ETL, and schema inference to allow your data formats to grow and evolve without impacting legacy code.

  5. Let people focus on the knowledge they should have, and try to find ways to support each other in saving time for each other.

TL;DR -- Let's respect each other's time and find ways to enable data-driven decisions without absurd appeals to KnowSQL or massive delays from a data silo. We can have a better data culture than that.

Time is money; Knowledge is power.

Ryan Braley

Founder and Chief Architect at Designed the world's first distributed genetic algorithm in the cloud using Hadoop, built neural interface, behavioral analytics pioneer.

comments powered by Disqus