Variable Values, Bound and Raw
- Published on
Slight's variables generally use safe bound values. But this doesn't always give enough flexibility. Slight allows you to span the tradeoffs of safety, convenience, and power, as your apps require them.
Slight apps are typically parameterized SQL queries. That means we can let users dynamically change the query when they run it by selecting what values should be used for the app's variables.Sidenote: We typically use "variable" in our product. Some people use "parameter" or "placeholder" for this. We felt that both already have more established meanings in the context of databases, and that "variable" is a better fit for how they're used in our apps. Let's check out an example query.
SELECT customer_id,
order_id
order_amount
FROM orders
WHERE order_type >= {{ min_date }}
SELECT customer_id,
order_id
order_amount
FROM orders
WHERE order_type >= {{ min_date }}
All parts of the query are fixed except the variable min_date
: our users should pick a date when they run the resulting app. For more details on this, see our docs on writing queries, and creating variables.
The app runners should only be able to input a date, as anything else would produce an error. In cases like this, we pass the user's desired min_date
value to the database alongside the query: we don't inject the user's date ourselves. This gives us:
- Safety: the database will reject anything that isn't a valid date.
- Ease of use: even if you're not worried about safety, you don't want your users unsure of how to provide the values for the app's variables. Slight makes it clear your users must provide a valid date in this example.
Not All Variables Can Be Bound
Variables that fit the above template — requiring simple values that the database itself can confirm are of the correct type — can cover a lot of cases. We say that these variables only take bound values.
There are many situations where this isn't flexible enough. Some examples:
- Choosing which column to use in a query (or even an expression involving multiple columns).
- Dynamically selecting which table to pull data from.
- Letting users select which function to apply, and/or the inputs to those functions.
- Similarly, letting users select which operators to use. That could be
≥
,≤
, or=
with numbers; it could beLIKE
vsILIKE
with strings.Sidenote: See e.g. Snowflake's documentation onILIKE
. - Choosing the ordering of the data, either the columns and/or
ASC
vsDESC
. - Giving users the option to specify a value using a function (e.g.
CURRENT_DATE()
).
SELECT incident_id,
incident_type,
incident_timestamp
FROM {{ geog_traffic_table }}
SELECT incident_id,
incident_type,
incident_timestamp
FROM {{ geog_traffic_table }}
Slight allows you to use variables that take raw values, but we also give you ways to avoid danger.
Example App: Choosing the Table and Column
NOAA maintains a public dataset they call the Global Surface Summary of the Day. But their publicly available data in BigQuery is split into one table per year.
Ideally if we have an app that uses this data, we'll let our users select which table to use: that is, which year to run. This app does exactly that: feel free to try running it, and read the body to see how the year
variable is used. It lets users choose the year from a set of options (like a dropdown), and this sets the name of the table to run.
In business situations, there are many reasons data can be split into multiple tables. One reason is geographical splits, which are sometimes required by law. Others include table size and legacy requirements, but a very simple one is that the tables genuinely are quite different in general, but happen to both fit your current query.Sidenote: That is, they both have all the columns needed for the query we're writing. For example, if there's a table of data collected on soccer players and a table for basketball players, the data and sensors might be different between the two. However, they could have common columns, such as maximum speed, that allow for certain analyses where either table could be subbed in. Or for census data, questions may change over the years, but many analyses will not use the added or removed questions.
Similarly, the app above lets you choose which column in the source data to compute statistics of. This is common, as you may need to use different columns depending on the use-case, for example revenue before or after some types of expenses are taken into account (or pitch deck vs IRS numbers), or say some combination of the responses in this mask-use dataset.
You could just return all analyses on all columns, but this can add up quickly, and can result in a lot of wasted data getting transferred out from the database.
Finally, it's always nice to include an image, this is for 2021:
Example App: Choosing Which Function to Apply
This simple app counting library dependencies lets us choose which function to apply to the vector of dependency counts. This example isn't amazing, but the concept of choosing average case vs best or worst case is reasonable. The app makes a slightly odd choice to also name the resulting column: this is generally not recommended, but when needed, it can be quite useful.
The above app also lets us select the order we want to see the results. This can be useful if we're comparing different runs with different functions used.
We won't add the picture here, but follow this link for a nice graph.
When and How To Use Raw Values
If we're not careful, allowing raw input can allow malicious actors access to private data.Sidenote: Through what's called raw injection, or SQL injection. See Microsoft's docs for more. If the connection used also has write permissions, even worse things could happen. Both the example apps linked above use our category variable, which avoids raw injection completely: we only allow runners to use values that the app authors added as selectable options. That's great, but sometimes we do need the full flexibility of raw input. We might wonder when we should feel safe doing so.
Speaking generally, we have three levels of safety required. Let's go through each.
App Runners Have Full Permissions on the Database
If all our users are also able to run any query they want in the database, we can feel safe allowing raw values.Sidenote: Some business intelligence tools only have raw variables. We're not sure we agree with that choice, but since there haven't been too many BI-based raw injection disasters, it's not the end of the world. Similarly if the query writes to the database, we can feel OK with raw values if users have full write permission. We see this when:
- Everyone working with data at a company has run permissions, or similarly the app is only for a small team within the company (or personal use).
- You bring your own connectionSidenote: Coming soon in Slight. to run each app: here by default you can't "raw inject yourself".
- The app is designed to be built upon by other apps, and those apps don't need the full flexibility.
You may still want to use say a category variable where possible for ease of use.
App Runners Are All Internal (Trusted)
Our trusted usersSidenote: Here, we mean as trusted as is typical with users of any Business Intelligence tool, for example. may not have full run permission, but we don't fear malicious intent: they won't work hard to abuse the ability to use raw input.
However, we don't want to give them footguns: the app should still be easy to use! Using category variables, like the example apps linked above, gives a great experience when it can cover all needed bases. Or with the table example, we could have alternatively selected a raw number input, with a good default value and a clear description, so that users could quickly understand the intention is to input a year.
You can also set variables to allow either bound or raw input, but only grant the permission to run with raw values to certain users. This gives you maximum safety and flexibility within a single app.
App Runners Aren't Trusted
Typically this would be an externally facing app.
In Business Intelligence tools, externally facing reports aren't common. But with Slight, it's easy to generate a secure token for an external user and let them run the app as they need it. In this scenario, we must ensure that the external users can only run what we intended.
Typical bound values are fully safe, but indeed so are raw options: it's no more dangerous (but much more convenient) than creating a separate app per option.
Summary: Which Bind To Use
In order:
- If bound values solve the needs of the app, use those. They're safe, and Slight can provide an interface that makes it very clear what is expected from the user.
- If raw values are needed and there's a limited set of them, use a category variable, and add the needed values as raw options. Note that you can also use bound options at the same time if desired.
- If raw values are needed but they don't form a limited set, think about who your user is. If they're untrusted, you will probably have to rethink the app setup, or potentially make multiple apps. If they're trusted, make sure to provide good defaults/examples and a description for the variable.Sidenote: It's also important to avoid using connections to run the app with permissions that aren't needed. For example when the query only returns data (it doesn't add or update any rows), try to use read-only connections. Stay tuned though, we'll be providing even more safe-guards on raw input in the future.
If balancing safety, convenience, and power is something your stored queries could benefit from, we'd be delighted to hear from you. Email us at team@slight.co to get in touch.
- Related Articles
What differentiates us, what tradeoffs does our product make? We think our tradeoffs strike the right balance for many data workflows, but this depends on your needs and current painpoints.
Our free public version of Slight at www.slight.run is now open — give it a go today! It's not identical to our commercial product, but close enough to get a feel for it.
Slight's API is up and running. Now anyone who knows SQL can create APIs for their parametrised queries with no extra effort.