MAKING SUSTAINABLE & RELIABLE DATAFLOWS

Considerations for making good Power BI dataflows with Power Query Online

Note: While this article is a ‘Dataflows’ checklist, many of the tips can be cross-applied to any product using Power Query, from Excel to Power BI Datasets and others.

Check also the Power Query and Dataflows best practices from the Microsoft documentation.
It’s not the purpose of the article to repeat existing docs; I will only elaborate on undocumented points.

Slow or messy dataflows can become a bottleneck instead of enabling easier re-use of data & transformation logic.

DATAFLOWS IN POWER BI

A powerful and popular feature, Power BI dataflows enable users to centralize transformation logic with Power Query, storing shaped data in a managed, Power BI data lake. Using the intuitive Power Query user-interface, data can be filtered, aggregated, combined and transformed. While easy to create and use, Dataflows and Power Query in general contain a lot of underlying complexity that can make or break a data solution. Neglecting certain elements when making a dataflow can result in slow refreshes and large, unmanageable transformation logic. Given how quickly and easily a Dataflow can be created, shared and used, this can rapidly get out of control, causing bottlenecks and governance threats.

This can be easily avoided if a few key considerations are kept in mind when designing, building and governing Dataflows in Power BI. In this article, I list out key elements to note when making or managing a Dataflow, elaborating on points not already documented, elsewhere.

Power BI Dataflows Checklist - select items. Scroll down further for the full, up-to-date checklist.

ARE DOWNSTREAM USE-CASES & DEPENDENCIES MAPPED?

One of the most obvious use-cases for Dataflows are centralizing transformation logic, to ensure a “central truth” and avoid redundant, parallel logic in multiple solutions which risk becoming divergent. However, it’s important to define up-front what the final shape of the data will be. Missing columns or over-aggregated data will result in downstream consumers not being able to use the dataflow because it doesn’t fit their use-case. Thus, when designing a dataflow solution, it’s important to reflect on who the downstream consumers are and what their needs will be - “who will use this, and what will they use this for?” Try to define a common schema & detail level that fills all reasonable downstream uses of the dataflow, to best enable consumers to use it, even if they still apply slight aggregations or changes ontop. Examples are given below of situations to be avoided:

Dataflow misses columns needed in a downstream dataset or analysis
Example: Order reason field is removed from order data when it is needed for a DAX expression to calculate orders returned due to wrong size.
Dataflow over-aggregates data past a detail level needed in a downstream dataset or analysis
Example: Customer detail is lost when sales data is aggregated to the region level, when there exists a reporting need for customer detail data.
Dataflow applies a transformation that renders downstream requirements impossible
Example 1: Local currency transformed to EUR, when there exists a report requirement for seeing volumes in local currency or having dynamic currency conversion.
Example 2: Exchange rates rounded to two decimal places (Fixed Decimal) when 4-8 decimal places are needed for volatile currencies / rate types.

It’s thus important to be keenly aware of the reporting needs and scope when designing and constructing a dataflow, to ensure it delivers maximum value to the business.

IS THERE A BALANCE BETWEEN SIMPLICITY & OPTIMIZATION?

Generally, a clear sign of poorly-built Power Query is a long list of applied steps running the length of the right-hand margin. This often reflects an ad hoc transformation process by the creator, with redundant steps taking incremental (or even circuitous) routes to the final, desired data shape. This usually results in Power Query that is difficult to understand or maintain, and which struggles to perform with higher data volume or complexity.

To avoid or correct this situation, it’s worthwhile to plan out the Dataflow design ahead of time:

While not always necessary for smaller transformations, it’s helpful to define the desired endpoint for your data. Defining what you need helps you map out how to get there. Doing this mitigates that ad hoc transformation process, resulting in excessive or sub-optimal applied steps. It also helps you identify ahead of time specific solutions or code snippets which can be helpful to produce your desired result.

A survey of the available data is always necessary to define your starting point. Bridging between what you have and what you need is where the dataflow design comes in. While looking at the data, it can be difficult to see what you need in its structure, particularly if pivot and combine operations are necessary. Feel free to perform ad hoc transformations in a temporary query at this stage, but ensure to limit the data to not waste time waiting for data previews.

What can be helpful is using the Table.Profile function in Power Query to give an overview of the source.

Simply drawing out on paper or a whiteboard the steps needed to get the data in the desired shape may be helpful. It defines the core objectives of each group of steps. For example, you might be trying to get text data from a web source that has multiple pages. You may need to (1) Connect to the pages separately, (2) Identify the text elements, (3) Combine each page and (4) clean the result. In practice, there will be more steps than this in Power Query, but mapping out the data objectives helps keep that ‘endpoint’ clear.

Once you know the steps you want to take, you have to translate this to steps in Power Query, itself. Defining these steps makes it easier to search for or try existing solutions, favoring those which perform best on your data and are easiest to understand / implement.

Finally, you can implement these steps in the Dataflow, assessing the performance and complexity of each step. If you feel a particular set of steps take too long, or are too many, you can iteratively research and try alternative approaches.

While the number of applied steps are not directly indicative of poor performance as such, it’s important to think critically about better ways to approach a problem than a 12th “Replace Values” step.

On the other hand, however, overly optimized, complex custom functions or authored Power Query M code can also be problematic. Most users of Power Query likely use only the user interface, oblivious to the underlying code changes. If a solution is handed over that contains a lot of custom code copied into the Advanced Editor, it might be difficult for a future person to maintain unless it is well-documented or they have extensive knowledge of the mash-up language. Too often, I have seen users completely delete an optimized but complex code and re-do it all in the user interface when they needed to make a change. It’s thus important to be aware of the user context when a Dataflows solution is delivered or passed down, and generally to balance simplicity with the “optimal” solution.

Using dataflows to sail from
Bar Chart Bay to Insight Island…

TO CONCLUDE

Dataflows are a powerful tool that can sometimes feel like Pandora’s Box. It’s important to consider why the dataflow is being made and who is using it, and to follow best practices both for dataflows in general and the Power Query therein. By following a simple checklist of key considerations, we can ensure our dataflows become an important instrument to help us navigate our sea of data toward further insights, and away from governance risks & performance issues.

DATA GOBLINS DATAFLOW CHECKLIST

Version 1.0 - Last update: Oct 17, 2022

Use the below interactive checklist to follow-up on your Power BI Dataflow.
Many of these may also apply to any Power Query tool:

Click the question mark (?) for a link to a reference document. This will be learn.microsoft.com documentation; if none are available, this may be a Data Goblins article, or another blog post reference.

Dataflow Design

Check if you need a dataflow - does it fit the use-cases of single truth, re-used transformation logic? ?

Check if you can or should use the 'Common Data Model' ?

Check the 'Considerations & Limitations' in the Microsoft guidance documentation ?

Browse the starting data & define your desired 'end-point'; what transformation steps are needed? ?

Map dependencies; define the needs for each downstream point-of-consumption to avoid over-transformation ?

Plan where the Dataflow should live (which workspace - i.e. workspace specific for dataflows?) ?

Building a Dataflow

Apply a parameterized, early filter to limit data while building ?

Parameterize repeated strings, filters and data source information ?

Push transformations to the upstream data source where feasible ?

Remove unnecessary data; delete unneeded columns & aggregate unneded detail ?

Ensure query folding is maintained as long as possible ?

Avoid merges & appends; save 'expensive' steps that read full data for later ?

Follow Power Query best practices from Microsoft ?

Follow Dataflows best practices from Microsoft ?

Check data types & decimal precision ?

Disable load on intermediate data sources & functions ?

Design & add error handling on 'sensitive' steps where the query could fail ?

Optimizing a Dataflow

Reduce steps & complexity as much as feasible; tweak steps to ensure folding ?

Configure incremental refresh if you are on a premium capacity ?

Try advanced code-authoring approaches, editing the Advanced Editor query script ?

If connecting to Excel files, convert to .csv instead of .xlsx if feasible ?

Avoid query references to optimize if not needed, as it hits the source multiple times ?

Carefully test Table.Buffer and List.Buffer approaches if the query is not folding ?

If connecting to parquet files, understand PQ will not do predicate pushdown, etc. ?

Experiment with different 'hot'/'cold' partition strategies if no incremental refresh

Try different approaches for multiple find & replace, i.e. add conditional column

Configure enhanced compute engine (Premium) ?

Readying a Dataflow for Deployment & Handover

Define clearly who owns & updates the dataflow

Rename steps to concise, clear & logical names ?

Add helpful, clarifying comments to each step ?

Format the code in the advanced editor using VS Code's "Power Query" extension ?

Add a description to the dataflow and dataflow entity (table)

Configure scheduled refresh & endorsement ?

DATAFLOW USER TRAINING CHECKLIST

Training users takes time and effort, and the approach differs for every person being trained. Listed below are some elements of Power BI Dataflows that one should consider when training them when, why & how to use a Dataflow, effectively.

Train Dataflow Users

Advantages of dataflows: centralizing data/logic, scaling performance, etc. ?

Connecting to dataflows from Power BI or Excel ?

Which dataflow to use (i.e. certified, promoted) for what data ?

What are the Power Query best practices ?

What are the Dataflows best practices ?

Limitations & considerations of dataflows they can use or create ?

DOCUMENTATION & GOVERNANCE CHECKLIST

Documentation varies from project-to-project and team-to-team, but below are some things you can consider documenting for your Power BI Dataflow. The main thing you want to document is the lineage; what upstream sources are consumed & combined, and what downstream dependencies are consuming the dataflow.

Documentation for Power BI Dataflows

Basic Info: Name, SPOC, Location, Workspaces...

Purpose: Business questions/problems the model addresses

User segments: Intended audience and tools used

Complexity: atypical properties/features

Data sources: Type, when updated...

Security and Access: Who has build permissions...

Manual interventions & justifications (if any)

Typical refresh time; how long does it take to refresh?

Refresh schedule; what frequency does the Dataflow update? When do upstream sources update?

Schema of each entity / table; what columns are in the dataflow?

Volume of each entity / table; how many rows do they have? ?

Custom functions; what advanced M code has been authored? Can it be re-used?

Shared expressions; which expressions have disabled load and why?

If the Dataflow uses other subscriptions - Premium or AI features, for example

'Common Data Model' mapping ?

Sensitivity label / DLP policies; if it contains PII or restricted information ?

Lineage & Dependencies; what content are consuming the dataflow? ?

Oct 18 Power BI Dataflow Checklist

MAKING SUSTAINABLE & RELIABLE DATAFLOWS

DATAFLOWS IN POWER BI

ARE DOWNSTREAM USE-CASES & DEPENDENCIES MAPPED?

IS THERE A BALANCE BETWEEN SIMPLICITY & OPTIMIZATION?

TO CONCLUDE

DATA GOBLINS DATAFLOW CHECKLIST

Version 1.0 - Last update: Oct 17, 2022

Dataflow Design

Building a Dataflow

Optimizing a Dataflow

Readying a Dataflow for Deployment & Handover

DATAFLOW USER TRAINING CHECKLIST

Train Dataflow Users

DOCUMENTATION & GOVERNANCE CHECKLIST

Documentation for Power BI Dataflows

Latest Articles

Oct 18 Power BI Dataflow Checklist

MAKING SUSTAINABLE & RELIABLE DATAFLOWS

DATAFLOWS IN POWER BI

ARE DOWNSTREAM USE-CASES & DEPENDENCIES MAPPED?

IS THERE A BALANCE BETWEEN SIMPLICITY & OPTIMIZATION?

TO CONCLUDE

DATA GOBLINS DATAFLOW CHECKLIST

Version 1.0 - Last update: Oct 17, 2022

Dataflow Design

Building a Dataflow

Optimizing a Dataflow

Readying a Dataflow for Deployment & Handover

DATAFLOW USER TRAINING CHECKLIST

Train Dataflow Users

DOCUMENTATION & GOVERNANCE CHECKLIST

Documentation for Power BI Dataflows

Oct 18 Power BI Dashboard Checklist

Oct 4 Power BI Data Visualization - Ideas & Wishlist