Exploring Time Travel in Azure Databricks
Welcome to a new article on the best Data & Cloud blog. In today’s post we tell you how to use the version history and the “Time Travel” functionality in a basic way, through a practical example. Our scenario focuses on a user table in Delta Lake, where we will perform merge operations such as upserts and explore how to roll back the table to a previous version.
Delta Lake in Databricks stands out for its flexibility and data control, offering us features such as version control and transaction history, essential in business environments where data integrity and traceability are a priority. The “Time Travel” capability in Delta tables allows you to access and restore previous versions of data, and is crucial in data recovery, auditing and historical analysis.
Next, the steps and the necessary code, let’s go there.
Step 1: Set up Azure Databricks
The first step, assuming you already have an Azure Databricks account and a cluster configured, is to open a notebook in Databricks. This will be the starting point for all our work. If you do not have any of these requirements, the following links can help you:
Step 2: Create an example table
Once our notebook is created, we will use Spark SQL to create an example delta table in Databricks. Open a cell in your notebook and run the following code:
CREATE TABLE IF NOT EXISTS usuarios (
This instruction creates a table called users with three columns: id, email and name. We are using the Delta format, which is the standard in Databricks and is what will facilitate advanced operations such as upserts.
Step 3: Insert sample data
To have something to work with, let’s insert some data into our table:
INSERT INTO usuarios VALUES (1, 'email@example.com', 'Juan');
INSERT INTO usuarios VALUES (2, 'firstname.lastname@example.org', 'Ana');
We show the insert:
SELECT * FROM usuarios
Step 4: We perform an upsert
Delta tables in Databricks support “merge” operations that can be used to perform upserts. If you are not familiar with this type of operation, an upsert is basically a record existence check operation, if a record already exists, then it is updated (update) and if not, it is inserted (insert). In this case, we are going to update the username with id=2 and add a new user with id=3.
First, we create a temporary table with the new data:
CREATE OR REPLACE TEMPORARY VIEW nuevos_datos AS
SELECT 2 as id, 'email@example.com' as email, 'Ana María' as nombre
SELECT 3 as id, 'firstname.lastname@example.org' as email, 'Luis' as nombre;
Let’s now perform the upsert (merge) operation on the users table using the new_data view in which we have Ana’s update and Luis’ insertion:
MERGE INTO usuarios
ON usuarios.id = nuevos_datos.id
WHEN MATCHED THEN
UPDATE SET usuarios.nombre = nuevos_datos.nombre
WHEN NOT MATCHED THEN
INSERT (id, email, nombre) VALUES (nuevos_datos.id, nuevos_datos.email, nuevos_datos.nombre);
We will ensure the insertion and update done in the users table with:
SELECT * FROM usuarios;
This shows us the users table with the user id = 2 updated and the new user id = 3 added.
Step 6. Version control and transaction history
Now, let’s see how to manage version control and transaction history in delta table:
Step 6.1. Reviewing transaction history
Para ver el historial de transacciones de nuestra tabla
usuarios, podemos usar el comando
DESCRIBE HISTORY usuarios;
This command will show a history of all operations performed on the users table, including changes made during upsert operations, as we can see in the following image:
To learn more about the history scheme and the operation metric keys, you can use this link.
6.2. Access to previous versions of data (Time Travel)
Delta Lake’s “Time Travel” supports querying previous table versions based on timestamp or table version (as recorded in the transaction log). This functionality is extremely useful for use cases such as recovery of data, auditing and analyzing historical data, correcting errors in data, or providing snapshot isolation to a set of queries for rapidly changing tables.
Conveniently, it allows you to choose to view a specific version of the table or snapshot at a given point in time.
To query a specific version of the table, we can use the following code:
SELECT * FROM usuarios VERSION AS OF 2;
In this example, 2 is the version number of the users table you want to query.
We can also query the table at a specific point in time:
SELECT * FROM usuarios TIMESTAMP AS OF '2024-01-27 11:37:00';
This example queries the state of the users table as it was on January 27, 2024 at 11:37:00.
6.3.Recovering previous versions of data
Delta’s “Time Travel” feature is not only useful for retrospective queries, but also for data restoration. In our example we will restore the users table to its initial state, querying the desired version and using the recovered data to overwrite or update the current table.
The process involves two main steps: first, identify the specific version of the table you want to return to, and second, perform the restore itself.
6.4. Identify the version to roll back to
Before reverting the table, you need to know which version you want to revert to. You can do this by querying the table’s history to find the desired version:
DESCRIBE HISTORY usuarios;
This command, as we saw in point 6.1, shows a history of all versions of the table, including the operations performed in each version. Each history entry in history has an associated version number, we will try to take it to version 2, a point immediately prior to performing the MERGE.
6.5. Performing the rollback
Once you have identified the desired version, we will use “Time Travel” to revert the table to that specific version. To return to version 2, we will create a temporary view or a new table with the desired state of the users table:
CREATE OR REPLACE TEMPORARY VIEW usuarios_revertidos AS
SELECT * FROM usuarios VERSION AS OF 2;
This command creates a temporary view reverted_users that reflects the state of the users table at the specified version (in this case, version 1). This step does not affect the original users table or its version history.
Then, we can use that view to replace the current data in the users table:
DELETE FROM usuarios;
INSERT INTO usuarios
SELECT * FROM usuarios_revertidos;
The restored version looks like the initial one, when inserting Ana and Juan’s records:
We have deleted all the current data in the users table and then we have inserted the data from the reverted_users view, achieving the “restoration” of the state of the users table to what it had in version 2. All these steps are recorded as new transactions in the history of the table, as we see below:
This ability to travel through time and access detailed transaction history positions Delta on Databricks as a powerful tool for data management in the business world, where accuracy, completeness, and the ability to track and reverse changes in data Data is essential for operational success and informed decision making.
In our exploration of Databricks, we have seen that this tool provides crucial functionalities for traceability and auditing. The version control and ‘Time Travel’ capabilities stand out, not only for facilitating data recovery, but for offering complete protection and tracking, essential for data security.
However, taking full advantage of these tools in Databricks, and maintaining a balance between performance and change management, implies important challenges such as query optimization, system resource adequacy are key to preserving efficiency and, at the same time, effective management. Rigorous version control ensures data integrity and consistency. Therefore, constant monitoring and the right touch of strategic adjustments will allow us to use these powerful functionalities without sacrificing efficiency and thus achieving an optimal balance between advanced security and operational performance.
n our next posts we will continue exploring the capabilities of Delta Lake in Databricks, if you found it interesting, we recommend the following articles:
And if you want to take your data analysis to the next level, contact us.