How to use Semi-Join in Distributed Query Processing In SQL

The semi-join operation is used in distributed query processing to reduce the number of tuples in a table before transmitting it to another site. This reduction in the number of tuples reduces the number and the total size of the transmission ultimately reducing the total cost of data transfer. Let’s say that we have two tables R1, R2 on Site S1, and S2. Now, we will forward the joining column of one table say R1 to the site where the other table say R2 is located. This column is joined with R2 at that site. The decision whether to reduce R1 or R2 can only be made after comparing the advantages of reducing R1 with that of reducing R2. Thus, semi-join is a well-organized solution to reduce the transfer of data in distributed query processing.

Example: Find the amount of data transferred to execute the same query given in the above example using a semi-join operation.

Answer: The following strategy can be used to execute the query.

  • Select all (or Project) the attributes of the EMPLOYEE table at site 1 and then transfer them to site 3. For this, we will transfer NAME, DID(EMPLOYEE) and the size is 30 * 1000 = 30000 bytes.
  • Transfer the table DEPARTMENT to site 3 and join the projected attributes of EMPLOYEE with this table. The size of the DEPARTMENT table is 30 * 50 = 1500

Applying the above scheme, the amount of data transferred to execute the query will be 30000 + 1500 = 31500 bytes.

Query Processing in Distributed DBMS

Query processing in a distributed database management system requires the transmission of data between the computers in a network. A distribution strategy for a query is the ordering of data transmissions and local data processing in a database system. Generally, a query in Distributed DBMS requires data from multiple sites, and this need for data from different sites is called the transmission of data that causes communication costs. Query processing in DBMS is different from query processing in centralized DBMS due to the communication cost of data transfer over the network. The transmission cost is low when sites are connected through high-speed Networks and is quite significant in other networks. 

The process used to retrieve data from a database is called query processing. Several processes are involved in query processing to retrieve data from the database. The actions to be taken are:

  • Costs (Transfer of data) of Distributed Query processing
  • Using Semi join in Distributed Query processing

Similar Reads

Costs (Transfer of Data) of Distributed Query Processing

In Distributed Query processing, the data transfer cost of distributed query processing means the cost of transferring intermediate files to other sites for processing and therefore the cost of transferring the ultimate result files to the location where that result is required. Let’s say that a user sends a query to site S1, which requires data from its own and also from another site S2. Now, there are three strategies to process this query which are given below:...

Using Semi-Join in Distributed Query Processing

The semi-join operation is used in distributed query processing to reduce the number of tuples in a table before transmitting it to another site. This reduction in the number of tuples reduces the number and the total size of the transmission ultimately reducing the total cost of data transfer. Let’s say that we have two tables R1, R2 on Site S1, and S2. Now, we will forward the joining column of one table say R1 to the site where the other table say R2 is located. This column is joined with R2 at that site. The decision whether to reduce R1 or R2 can only be made after comparing the advantages of reducing R1 with that of reducing R2. Thus, semi-join is a well-organized solution to reduce the transfer of data in distributed query processing....

Conclusion

In Conclusion, query processing in a distributed database management system (DBMS) is a complex procedure that tackles issues with transaction management, data dissemination, optimization, and fault tolerance. Distributed database systems’ performance, scalability, and dependability depend on effective concurrency management, optimization, and query decomposition techniques....