Healthy Data – O’Reilly

Platform analysis report

This summer, we started asking about “technical health.” We don’t see a lot of people asking what it means to use technology in healthy ways, at least not in so many words. That’s understandable because “technical health” is so broad that it’s difficult to think about.  It’s easy to ask a question like “Are you using agile methodologies?” and assume that means “technical health.”  Agile is good, right?  But agile is not the whole picture. Neither is being “data driven.” Or Lean. Or using the latest, coolest programming languages and frameworks. Nor are any of these trends, present or past, irrelevant.

To investigate what’s meant by “technical health,” we have begun a series of short surveys to help us understand about what technical health means, and to help our readers think about the technical health of their organizations. The first survey looked at the use of data. It ran from August 30, 2022 to September 30, 2022. We received 693 responses, of which 337 were complete (i.e., the respondent answered all the questions). We didn’t include the incomplete respondents in our results, a practice that’s consistent with our other, lengthier surveys.


Learn faster. Dig deeper. See farther.

No single question and answer stood out; we can’t say “everybody does X” or “nobody does Y.” Whether or not that’s healthy in and of itself, it suggests that there isn’t yet any consensus about the role data plays. For example, the first question was “What percentage of enterprise-wide decisions are driven primarily by data?” 19% of the respondents answered “25% or less”; 31% said “76% or more.” We were surprised to see that the percentage of respondents who said that most decisions aren’t data driven was so similar to the percentage who thought they are. The difference between 19% and 31% looks much larger on paper than it is in practice. Yes, it’s almost a 2:1 ratio, but it shows that a lot of respondent work for companies that aren’t using data in their decision making. Even more significant, fully half of the respondents put their companies in the “sort of data driven” middle ground (26-50% and 51-75% received 25% and 26% of responses, respectively.) Does this mean that most companies are somewhere along the path towards being data-driven, with the “25% or less” cohort representing companies that are “catching up”? It’s hard to say.

We saw similar answers when we asked what percentage of business processes are informed by real-time data: 33% of respondents said 25% or less, while 21% said 76% or more. (26-50% and 51-75% received 22% and 24% of responses, respectively.) Incorporating real-time data into business processes is a heavier lift than running a few reports before a management meeting, so it isn’t surprising that fewer people are making widespread use of real-time data. These responses also suggest that the industry is in the process of transformation, of deciding how to use real-time data. There are many possibilities: managing inventory, supply chains, and manufacturing processes; automating customer service; and reducing time spent on routine paperwork, to name a few. But we don’t yet see a clear direction.

The bane of data science has been the HIPPO: the “highest paid person’s opinion.” When the HIPPO is in the room, data is used primarily to justify decisions that have already been made. The questions we asked don’t tell us much about the presence of the HIPPO, but we have to wonder: Is that why 20% of the respondents say that data doesn’t have a big influence in corporate decision-making? Are the 31% who said that over 75% of management decisions are based on data being ironic or naive? We don’t know, and need to keep that ambiguity in mind. Data can’t be the final word in any decision; we can’t underestimate the importance of instinct and a gut understanding of business dynamics. Data is always historical and, as such, is often better at maintaining a status quo than at helping to build a future–though when used well, data can shine light on the status quo, and help you question it. Data that’s used solely to justify the HIPPO isn’t healthy. Our survey doesn’t say much about the influence of the HIPPO. That’s something you’ll need to ponder when considering your company’s technical health.

We’ve been tracking the democratization of data–the ability of staff members who aren’t data scientists, analysts, or something else with “data” in their title–to access and use data in their job. Staff members need the ability to access and use data on their own, without going through intermediaries like database administrators (DBAs) and other custodians to generate reports and give them the data they need to work effectively. Self-service data is at the heart of the democratization process–and being data-driven isn’t meaningful if only a select priesthood has access to the data.  Companies are slowly waking up to this reality. 26% of the respondents to our survey said that less than 20% of their company’s information workers had access to self-service query and analytics. That’s arguably a high percentage (and it was the most popular single answer), but we choose to see the glass as half (or three quarters) full: 74% said that more than 20% had access. (23% of the respondents said that 41% to 60% of their company’s data workers had self-service; 15% chose 61% to 80%; and 16% chose 81% to 100%.) No answer jumps out–but remember that, not so long ago, data was the property of actuaries, analysts, and DBAs. The walls between staff members and the data they need to do their job started to break down with the “data science” movement, but data scientists were still specialists and professionals. We’re still making the transition, but our survey shows that data is becoming more accessible, to more people, and we believe that’s healthy.

Roughly one third (35%) of the respondents said that their organization used a data catalog. That seems low, but it isn’t surprising. While we like to tell each other how quickly technology changes, the fact is that real adoption is almost always slow. Data catalogs are a relatively new technology; their age is measured in years, not decades. They’re gradually being accepted.

We got a similar result when we asked about data governance tools. 58% of the respondents said they weren’t using anything (“None of the above,” but “the above” included an option for a write-in.) SAP, IBM, SAS, and Informatica were leading choices (21%, 14%, 12%, and 11% respectively; respondents could select multiple answers). Again, we expect adoption of data governance tools to be slow. Data has been the “wild west” of the technology world for years, with few restrictions on what any organization could do with the data it collected. That party is coming to the end, but nobody’s pretending that the hangover is pleasant. Like data catalogs (to which they’re closely related), governance tools are relatively new and being adopted gradually.

Looking at the bigger picture, we see that companies are grappling with the demands of self-service data. They are also facing increasing regulation governing the use of data. Both of these trends require tooling to support them. Catalogs help users find and maintain metadata that shows what data exists and how it should be used; governance tools track data provenance and ensure that data is used in accordance with company policies and regulations. Fifteen years ago, we frequently heard “save everything, and wring every bit of value you can out of your data.” In the 2020s, it’s hard to see that as a good, healthy attitude. An important part of technological health is a commitment to use data ethically and legally. We believe we see movement in that direction.

Over the coming months, we’ll investigate technical health in other areas (next up is Security). For data health, we can close with some observations:

  • Data can’t be the only factor in decision making; human judgment plays an important role. But using data simply to justify a human decision that’s already been made is also a mistake. Technical health means knowing when and how to use data effectively; it’s a continuum, not a choice. We believe that companies are on the path to understanding that.
  • Empowering staff to make their own data queries and perform their own analyses can help them become more productive and engaged. But this doesn’t happen on its own. People need to know what data is available to them, and what that data means. That’s the purpose of a data catalog. And the use of data has to comply with regulations and company policies; that’s the purpose of governance. Data catalogs and governance tools are making inroads, but they’ve only started. Technical health means empowering users with the tools they need to make effective, ethical, and legal use of data.

Healthy data improves processes, questions preconceived opinions, and shines a light on practices that are unfair or discriminatory. We don’t expect anyone to look at their company and say “our data practices deserve a gold star”; that misses the point. Maintaining a healthy relationship to data is an ongoing practice, and that practice is still developing. We are learning to make better decisions with data; we are learning to implement governance to use data ethically (to say nothing of legally). Data health means that you and your company are on the path, not that you’ve arrived. We’re all making the same journey.




Source link