Storing Hierarchical Data in a Relational Database

The organization of hierarchical data is a unique challenge in the area of database management DBMS. Hierarchical structures are common in many fields, from organizations in charts to storage systems and categories of products.

Careful consideration of the database schema and the chosen storage model is required to effectively store and query hierarchical data in RDBMS relational databases.

In this article, we will go through the options available for storing hierarchical data in a relational database, exploring their advantages, disadvantages, and use cases. The options mainly are:

  1. Adjacency List Model
  2. Path Enumeration
  3. Nested Set Model
  4. Materialized Path Model

Before diving in, let’s acknowledge these concepts:

What is a Relational Database?

A relational database is a type of database that organizes data into rows and columns which collectively form a table where the data points are related to each other RDBMS.

SQL queries aggregate data, aiding firms in business performance analysis, process optimization, and insight generation. They organize data by linking tables through primary and foreign keys, revealing interconnections.

Storing Hierarchical Data in a Database

Managing hierarchical data in relational databases presents a challenge due to the mismatch between the hierarchical structures and the tabular nature of relational databases. The Strategies are explained below:

1. Adjacency List Model

The Adjacency List Model is a simple and intuitive way to represent hierarchical data in relational databases. In this model, each record contains a reference to its parent record, forming a tree-like structure. For instance, an employee table might have a field referencing the manager’s ID for each employee.

Example:

Here’s a basic schema for an employee table using the Adjacency List Model:

CREATE TABLE Employee (
employee_id INT PRIMARY KEY,
name VARCHAR(100),
manager_id INT,
FOREIGN KEY (manager_id) REFERENCES Employee(employee_id)
);
  • employee_id: Unique identifier for each employee.
  • name: Name of the employee.
  • manager_id: Reference to the manager’s employee_id.

Let’s populate the table with some sample data:

INSERT INTO Employee (employee_id, name, manager_id) VALUES
(1, 'John Doe', NULL), -- John Doe is the CEO, so he has no manager.
(2, 'Jane Smith', 1), -- Jane Smith reports to John Doe.
(3, 'Alice Johnson', 1), -- Alice Johnson also reports to John Doe.
(4, 'Bob Williams', 2), -- Bob Williams reports to Jane Smith.
(5, 'Emily Brown', 2); -- Emily Brown also reports to Jane Smith.

Now, let’s query the data to see the organizational hierarchy in SQL:

SELECT 
e.employee_id,
e.name,
COALESCE(m.name, 'CEO') AS manager
FROM
Employee e
LEFT JOIN
Employee m ON e.manager_id = m.employee_id;

This query retrieves each employee’s ID, name, and the name of their manager. We use a LEFT JOIN to ensure that even employees without a manager (such as the CEO) are included in the results. The result of the query might look like this:

employee_id

name

manager

1

John Doe

CEO

2

Jane Smith

John Doe

3

Alice Johnson

John Doe

4

Bob Williams

Jane Smith

5

Emily Brown

Jane Smith

Pros:

  • Simple to grasp and put into practice.
  • Flexibility is the representation of asymmetrical hierarchies.

Cons:

  • Inefficient querying and traversing of hierarchies might occur, particularly for deeply nested structures.
  • Recursive queries, which can be intricate and resource-intensive, are frequently required.

2. Path Enumeration

In a relational database, path enumeration is a technique for storing hierarchical data in which the complete path of each node from the root node is recorded. With a specified path, this method makes it easier to get parent or child nodes; but, it may result in slower queries, particularly for large datasets.

Example:

Let us consider the scenario where we wish to depict a filesystem hierarchy in which every file or folder has a name, a unique identifier, and its whole path.

CREATE TABLE FileSystem (
node_id INT PRIMARY KEY,
name VARCHAR(100),
full_path VARCHAR(255)
);
  • node_id: Unique identifier for each node.
  • name: Name of the file or folder.
  • full_path: Full path of the node in the filesystem.

Let’s populate the table with some sample data:

INSERT INTO FileSystem (node_id, name, full_path) VALUES
(1, 'Root', '/'),
(2, 'Documents', '/Documents'),
(3, 'Images', '/Documents/Images'),
(4, 'File1.txt', '/Documents/File1.txt'),
(5, 'File2.txt', '/Documents/File2.txt'),
(6, 'File3.jpg', '/Documents/Images/File3.jpg');

Now, let’s query the data to see the filesystem hierarchy:

SELECT * FROM FileSystem;

Output:

node_id

name

full_path

1

Root

/

2

Documents

/Documents

3

Images

/Documents/Images

4

File1.txt

/Documents/File1.txt

5

File2.txt

/Documents/File2.txt

6

File3.jpg

/Documents/Images/File3.jpg

As you can see, each node in the filesystem hierarchy has a unique identifier, a name, and its full path.

Querying for a specific node or its children becomes easier using the full_path column. For example, to retrieve all children of the “/Documents” folder, you can use:

SELECT * FROM FileSystem WHERE full_path LIKE '/Documents/%';

In essence, all children of the ‘/Documents‘ folder will be retrieved by this query, which will return any nodes whose complete path begins with ‘/Documents/‘.

Path Enumeration makes retrieving hierarchical data easier, but it can make searches take longer, especially for structures with several levels of nesting. Furthermore, it could be necessary to update several rows to update the hierarchy, which could affect performance. Because of this, it’s critical to carefully weigh the trade-offs when selecting a relational database storage format for hierarchical data.

Pros:

  • Straightforward to implement.
  • Simple to retrieve parent or child nodes given a specific path.

Cons:

  • Retrieval and traversal of hierarchies can be slow and resource-intensive, especially for large datasets.
  • Limited support for operations like subtree queries or reordering nodes.

3. Nested Set Model

With the Nested Set Model, two numbers—a left value and a right value—represent each node in the tree for storing hierarchical data in a relational database. The way these values are allocated makes it possible to query the hierarchical structure effectively, get subtrees, and perform operations like counting descendants.

Example:

Let us say we wish to depict a hierarchical organizational structure in which every employee has a name, a position within the hierarchy, and a unique identity.

Here’s how we can create a table to represent this structure in SQLite:

CREATE TABLE Employee (
employee_id INTEGER PRIMARY KEY,
name TEXT,
left_value INTEGER,
right_value INTEGER
);
  • employee_id: Unique identifier for each employee.
  • name: Name of the employee.
  • left_value: Left boundary value for the node in the nested set.
  • right_value: Right boundary value for the node in the nested set.

Let’s populate the table with some sample data:

INSERT INTO Employee (employee_id, name, left_value, right_value) VALUES
(1, 'John Doe', 1, 10),
(2, 'Jane Smith', 2, 5),
(3, 'Alice Johnson', 6, 9),
(4, 'Bob Williams', 3, 4),
(5, 'Emily Brown', 7, 8);

Now, let’s query the data to see the organizational hierarchy:

SELECT * FROM Employee;

Output:

employee_id

name

left_value

right_value

1

John Doe

1

10

2

Jane Smith

2

5

3

Alice Johnson

6

9

4

Bob Williams

3

4

5

Emily Brown

7

8

Pros:

  • Efficient for subtree retrieval and operations like counting descendants.
  • Well-suited for hierarchies with frequent read operations.

Cons:

  • Complex to maintain, especially when nodes are inserted or deleted.
  • Queries involving updates to the hierarchy can be challenging and computationally expensive.

Materialized Path Model

Similar to path enumeration, the materialized path model stores the full path of each node, along with additional optimizations such as storing the depth of each node.

Pros:

  • Simplifies querying and traversal of hierarchies.
  • Supports operations like subtree retrieval and path-based queries efficiently.

Cons:

  • May require additional storage space.
  • Updates to the hierarchy can be complex, especially when nodes are moved or reorganized.

Conclusion

In conclusion, choosing the appropriate storage model for hierarchical data in a relational database depends on various factors, including the size and complexity of the hierarchy, the frequency of updates, and the types of queries that will be performed. While each storage model has its advantages and disadvantages, understanding the nuances of each approach is crucial for designing efficient and scalable database schemas that effectively manage hierarchical data.