Skip to content
Category: Data Analytics
2024-01-31

Exploring Time Travel in Azure Databricks

Welcome to a new article on the best Data & Cloud blog. In today’s post we tell you how to use the version history and the “Time Travel” functionality in a basic way, through a practical example. Our scenario focuses on a user table in Delta Lake, where we will perform merge operations such as upserts and explore how to roll back the table to a previous version.

Delta Lake in Databricks stands out for its flexibility and data control, offering us features such as version control and transaction history, essential in business environments where data integrity and traceability are a priority. The “Time Travel” capability in Delta tables allows you to access and restore previous versions of data, and is crucial in data recovery, auditing and historical analysis.

Next, the steps and the necessary code, let’s go there.

Step 1: Set up Azure Databricks

The first step, assuming you already have an Azure Databricks account and a cluster configured, is to open a notebook in Databricks. This will be the starting point for all our work. If you do not have any of these requirements, the following links can help you:

Tutorial: Query data with notebooks

Accelerate data-driven innovation with your Azure free account and Azure Databricks Units

Step 2: Create an example table

Once our notebook is created, we will use Spark SQL to create an example delta table in Databricks. Open a cell in your notebook and run the following code:

%sql
CREATE TABLE IF NOT EXISTS usuarios (
    id INT,
    email STRING,
    nombre STRING
)
USING DELTA;

This instruction creates a table called users with three columns: id, email and name. We are using the Delta format, which is the standard in Databricks and is what will facilitate advanced operations such as upserts.

Step 3: Insert sample data

To have something to work with, let’s insert some data into our table:

%sql
INSERT INTO usuarios VALUES (1, 'usuario1@example.com', 'Juan');
INSERT INTO usuarios VALUES (2, 'usuario2@example.com', 'Ana');

We show the insert:

%sql
SELECT * FROM usuarios

Step 4: We perform an upsert

Delta tables in Databricks support “merge” operations that can be used to perform upserts. If you are not familiar with this type of operation, an upsert is basically a record existence check operation, if a record already exists, then it is updated (update) and if not, it is inserted (insert). In this case, we are going to update the username with id=2 and add a new user with id=3.

First, we create a temporary table with the new data:

%sql
CREATE OR REPLACE TEMPORARY VIEW nuevos_datos AS
SELECT 2 as id, 'usuario2@example.com' as email, 'Ana María' as nombre
UNION ALL
SELECT 3 as id, 'usuario3@example.com' as email, 'Luis' as nombre;

Let’s now perform the upsert (merge) operation on the users table using the new_data view in which we have Ana’s update and Luis’ insertion:

%sql
MERGE INTO usuarios
USING nuevos_datos
ON usuarios.id = nuevos_datos.id
WHEN MATCHED THEN 
    UPDATE SET usuarios.nombre = nuevos_datos.nombre
WHEN NOT MATCHED THEN 
    INSERT (id, email, nombre) VALUES (nuevos_datos.id, nuevos_datos.email, nuevos_datos.nombre);

We will ensure the insertion and update done in the users table with:

%sql
SELECT * FROM usuarios;

This shows us the users table with the user id = 2 updated and the new user id = 3 added.

Step 6. Version control and transaction history

Now, let’s see how to manage version control and transaction history in delta table:

Step 6.1. Reviewing transaction history

Para ver el historial de transacciones de nuestra tabla usuarios, podemos usar el comando DESCRIBE HISTORY:

%sql
DESCRIBE HISTORY usuarios;

This command will show a history of all operations performed on the users table, including changes made during upsert operations, as we can see in the following image:

To learn more about the history scheme and the operation metric keys, you can use this link.

6.2. Access to previous versions of data (Time Travel)

Delta Lake’s “Time Travel” supports querying previous table versions based on timestamp or table version (as recorded in the transaction log). This functionality is extremely useful for use cases such as recovery of data, auditing and analyzing historical data, correcting errors in data, or providing snapshot isolation to a set of queries for rapidly changing tables.

Conveniently, it allows you to choose to view a specific version of the table or snapshot at a given point in time.

To query a specific version of the table, we can use the following code:

%sql
SELECT * FROM usuarios VERSION AS OF 2;

In this example, 2 is the version number of the users table you want to query.

We can also query the table at a specific point in time:

%sql
SELECT * FROM usuarios TIMESTAMP AS OF '2024-01-27 11:37:00';

This example queries the state of the users table as it was on January 27, 2024 at 11:37:00.

6.3.Recovering previous versions of data

Delta’s “Time Travel” feature is not only useful for retrospective queries, but also for data restoration. In our example we will restore the users table to its initial state, querying the desired version and using the recovered data to overwrite or update the current table.

The process involves two main steps: first, identify the specific version of the table you want to return to, and second, perform the restore itself.

6.4. Identify the version to roll back to

Before reverting the table, you need to know which version you want to revert to. You can do this by querying the table’s history to find the desired version:

%sql
DESCRIBE HISTORY usuarios;

This command, as we saw in point 6.1, shows a history of all versions of the table, including the operations performed in each version. Each history entry in history has an associated version number, we will try to take it to version 2, a point immediately prior to performing the MERGE.

6.5. Performing the rollback

Once you have identified the desired version, we will use “Time Travel” to revert the table to that specific version. To return to version 2, we will create a temporary view or a new table with the desired state of the users table:

%sql
CREATE OR REPLACE TEMPORARY VIEW usuarios_revertidos AS
SELECT * FROM usuarios VERSION AS OF 2;

This command creates a temporary view reverted_users that reflects the state of the users table at the specified version (in this case, version 1). This step does not affect the original users table or its version history.

Then, we can use that view to replace the current data in the users table:

%sql
DELETE FROM usuarios;
​
INSERT INTO usuarios
SELECT * FROM usuarios_revertidos;

The restored version looks like the initial one, when inserting Ana and Juan’s records:

We have deleted all the current data in the users table and then we have inserted the data from the reverted_users view, achieving the “restoration” of the state of the users table to what it had in version 2. All these steps are recorded as new transactions in the history of the table, as we see below:

This ability to travel through time and access detailed transaction history positions Delta on Databricks as a powerful tool for data management in the business world, where accuracy, completeness, and the ability to track and reverse changes in data Data is essential for operational success and informed decision making.

Conclusion

In our exploration of Databricks, we have seen that this tool provides crucial functionalities for traceability and auditing. The version control and ‘Time Travel’ capabilities stand out, not only for facilitating data recovery, but for offering complete protection and tracking, essential for data security.

However, taking full advantage of these tools in Databricks, and maintaining a balance between performance and change management, implies important challenges such as query optimization, system resource adequacy are key to preserving efficiency and, at the same time, effective management. Rigorous version control ensures data integrity and consistency. Therefore, constant monitoring and the right touch of strategic adjustments will allow us to use these powerful functionalities without sacrificing efficiency and thus achieving an optimal balance between advanced security and operational performance.

n our next posts we will continue exploring the capabilities of Delta Lake in Databricks, if you found it interesting, we recommend the following articles:

Power BI Copilot: Innovating Data Analysis with Artificial Intelligence

SQL Server 2022 Parameter Sensitive Plan

Discovering Microsoft Fabric

And if you want to take your data analysis to the next level, contact us.

Complete este formulario para recibir la guía de Windows Server en Azure
*Obligatorio
Complete este formulario para recibir la guía de Windows Server en Azure
Gracias por rellenar el formulario [name]. ¡Aquí tienes tu eBook Gratis!
Complete este formulario para recibir 4 best practices to implement a comprehensive Zero Trust security approach
*Obligatorio
Complete este formulario para recibir 4 best practices to implement a comprehensive Zero Trust security approach
Gracias por rellenar el formulario [name]. ¡Aquí tienes tu eBook Gratis!
Complete este formulario para recibir Cloud Migration Essentials
*Obligatorio
Complete este formulario para recibir Cloud Migration Essentials
Gracias por rellenar el formulario [name]. ¡Aquí tienes tu eBook Gratis!
Complete este formulario para recibir Cloud security Advice for Nonprofit Leaders
*Obligatorio
Complete este formulario para recibir Cloud security Advice for Nonprofit Leaders
Gracias por rellenar el formulario [name]. ¡Aquí tienes tu eBook Gratis!
Complete este formulario para recibir Prevent data leaks with Microsoft 365 Business Premium
*Obligatorio
Complete este formulario para recibir Prevent data leaks with Microsoft 365 Business Premium
Gracias por rellenar el formulario [name]. ¡Aquí tienes tu eBook Gratis!
Complete this form to recieve the guide of Windows Server on Azure
*Required
Complete this form to recieve the guide of Windows Server on Azure
Thank you for filling out the form [name]. Here's your Free eBook!
Complete this form to recieve 4 best practices to implement a comprehensive Zero Trust security approach
*Required
Complete this form to recieve 4 best practices to implement a comprehensive Zero Trust security approach
Thank you for filling out the form [name]. Here's your Free eBook!
Complete this form to recieve Cloud Migration Essentials
*Required
Complete this form to recieve Cloud Migration Essentials
Thank you for filling out the form [name]. Here's your Free eBook!
Complete this form to recieve Cloud security Advice for Nonprofit Leaders
*Required
Complete este formulario para recibir Cloud security Advice for Nonprofit Leaders
Thank you for filling out the form [name]. Here's your Free eBook!
Complete this form to recieve Prevent data leaks with Microsoft 365 Business Premium
*Obligatorio
Complete this form to recieve Prevent data leaks with Microsoft 365 Business Premium
Thank you for filling out the form [name]. Here's your Free eBook!
Complete this form to recieve Cloud Migration Simplified Ebook.
*Obligatorio
Complete this form to recieve Prevent data leaks with Microsoft 365 Business Premium
Thank you for filling out the form [name]. Here's your Free eBook!