Simulation of RDB queries using Perl (Small and Large Stones of Programming)

Relational databases are widely used. However, it is not very clear how queries written by SQL are executed by the DBMS. Queries are usually interpreted by DBMS but not compiled into native code. The reason why not compiled into machine code is that it is very important to optimize queries by looking at the properties of tables at run time for the query to be applied. However, in this article, to look at the execution mechanism of queries, I try to simulate them using Perl programs (i.e., “compiled code”).

Introduction
Data to Be Used
Simple Query
Query with a DISTINCT Operator
Query with a WHERE Clause
Query with a WHERE Clause with AND and OR Operators
Query with a WHERE Clause with an IN Operator
Query with a WHERE Clause with a LIKE Operator
Query with an ORDER-BY Clause
Query with an Aggregation Using SUM Function
Query with an Aggregation Using COUNT and AVG Functions
Query with a Combination of COUNT Function and DISTINCT Operator
Query with a GROUP-BY Clause
Query with GROUP-BY and HAVING Clauses
Query with a JOIN Operator
Query with Nested SELECTs
Query with a UNION Operator
Query with a UNION ALL Operator
Query with an INTERSECT Operator
Query with a MINUS Operator

Introduction

In a Web site called “SQL Join” (which is described in Japanese), there are tens of examples that contain SQL statements and execution results. I modify these examples and translate them into Perl code. Although the whole program is included in this article, concatenated and immediately executable souce code is available.

Data to Be Used

I use a slightly modified version of the data used in “SQL Join”. Databases are usually stored in disks, but I assume they are in main memory here. It is unlikely to put databases on volatile memory. However, it is possible to put them on main memory if it is non-volatile.

my @Store_Information =
    ({store_name => 'Los Angeles', Sales => 1500, Date => 'Jan-05-1999'},
     {store_name => 'San Diego', Sales => 250, Date => 'Jan-07-1999'},
     {store_name => 'San Francisco', Sales => 300, Date => 'Jan-08-1999'},
     {store_name => 'Los Angeles', Sales => 300, Date => 'Jan-08-1999'},
     {store_name => 'Boston', Sales => 700, Date => 'Jan-08-1999'});

my @Geography =
    ({region_name => 'East', store_name => 'Boston'},
     {region_name => 'East', store_name => 'New York'},
     {region_name => 'West', store_name => 'Los Angeles'},
     {region_name => 'West', store_name => 'San Diego'});

my @Internet_Sales =
    ({Date => 'Jan-07-1999', Sales => 250},
     {Date => 'Jan-10-1999', Sales => 535},
     {Date => 'Jan-11-1999', Sales => 320},
     {Date => 'Jan-12-1999', Sales => 750});

Here, Store_Information table is the record of shop-wise sales. Geography table indicates the relationship between the stores (shops) and the areas that the stores exist. Internet_Sales table is the record of amount of one-day sales on the Internet. Tables are represented by arrays, and each array element, i.e., the database record (tuple) is represented by a hash in Perl.

Simple Query

First, I show a simplest SELECT statement for example. The original SQL query is as follows.

SELECT store_name FROM Store_Information

This query extracts shop names (store_name) from Store_Information table (which is an array of hashes). This query can be translated into the following Perl code.

sub select_store_name() {
    print_result(@Store_Information, ['store_name']);
}

Subroutine select_store_name only gets elements that has 'store_name' as its key from each record in Store_Information table, and prints them. print_result is a general-purpose subroutine that is used in the following examples too, and it extracts and prints the elements that has the specified (one or more) keys.

sub print_result(\@$) {
    my ($result, $keys) = @_;
    foreach my $key (@$keys) {
	print "\t$key";
    }
    print "\n\t----------------\n";

    foreach my $record (@$result) {
	foreach my $key (@$keys) {
	    print "\t$record->{$key}";
	}
	print "\n";
    }
}

By calling subroutine select_store_name(), the following result is obtained.

	store_name
	----------------
	Los Angeles
	San Diego
	San Francisco
	Los Angeles
	Boston

In Store_Information table, there are two (different) stores named “Los Angeles”. So it was displayed twice.

To compare this subroutine with the examples below (for example, “A Query with WHERE clause”), the following form of select_store_name (whose function is the same as above) may be better.

sub select_store_name() {
    my @result = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	push(@result, $record);
    }

    print_result(@result, ['store_name']); # SELECT store_name
}

However, the newly added loop in this program does not perform any net work (i.e., it only copies the array). So, if the loop is removed, the program becomes the one shown first.

Query with a DISTINCT Operator

In the next example, as same as the above example, the store names are extracted from Store_Information talbe, but duplicated store names are deleted.

SELECT DISTINCT store_name FROM Store_Information

This query can be translated into the following Perl code.

sub select_distinct_store_name() {
    my %Distinct_Store_Information = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	$Distinct_Store_Information{$record->{store_name}} = $record; # DISTINCT
    }
    my @result = ();
    foreach my $store_name (keys %Distinct_Store_Information) { # DISTINCT
	my $record = $Distinct_Store_Information{$store_name};  # DISTINCT
	push(@result, $record);
    }

    print_result(@result, ['store_name']); # SELECT store_name
}

To remove duplicated store names, a hash table called %Distinct_Store_Information is used. It is easy to implement a hash table in Perl because Perl has so-called hashes. In @result, there are only one “Los Angeles” which was duplicated in array @Store_Information.

By calling subroutine select_store_name(), the following result is obtained.

	store_name
	----------------
	San Francisco
	San Diego
	Los Angeles
	Boston

“Los Angeles” is not duplicated here.

Query with a WHERE Clause

In the next example, store names are extracted from Store_Information table too. However, only the stores whose sales is over 1000 (dollars) are selected.

SELECT store_name
FROM Store_Information
WHERE Sales > 1000

This statement can be translated into the following Perl code.

sub select_store_name_where() {
    my @result = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	if ($record->{Sales} > 1000) { # WHERE Sales > 1000
	    push(@result, $record);
	}
    }

    print_result(@result, ['store_name']); # SELECT store_name
}

By comparing this subroutine with the select_store_name subroutine with an added loop, we can see the only difference is a conditional statement that corresponds to the WHERE clause.

By calling subroutine select_store_name_where(), the following result is obtained.

	store_name
	----------------
	Los Angeles

Query with a WHERE Clause with AND and OR Operators

An example with a WHERE clause with AND and OR operators is shown here.

SELECT store_name
FROM Store_Information
WHERE Sales > 1000 OR (Sales < 500 AND Sales > 275)

This statement can be translated into the following Perl code.

sub select_store_name_and_or() {
    my @result = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	if ($record->{Sales} > 1000 ||
	    $record->{Sales} < 500 && $record->{Sales} > 275) {
		# WHERE Sales > 1000 OR (Sales < 500 AND Sales > 275)
	    push(@result, $record);
	}
    }

    print_result(@result, ['store_name']); # SELECT store_name
}

By calling subroutine select_store_name_and_or(), the following result is obtained.

	store_name
	----------------
	Los Angeles
	San Francisco
	Los Angeles

Query with a WHERE Clause with an IN Operator

An example with a WHERE clause with an IN operator is shown here.

SELECT *
FROM Store_Information
WHERE store_name IN ('Los Angeles', 'San Diego')

This statement can be translated into the following Perl code.

sub select_store_name_in() {
    my @result = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	if (grep ($_ eq $record->{store_name},
		  'Los Angeles', 'San Diego')) {
		# WHERE store_name IN ('Los Angeles', 'San Diego')
	    push(@result, $record);
	}
    }

    print_result(@result, ['store_name', 'Sales', 'Date']); # SELECT *
}

Except that grep function is used for the list search and that program is very similar to the previous one, i.e., select_store_name_and_or().

By calling subroutine select_store_name_in(), the following result is obtained.

	store_name	Sales	Date
	----------------
	Los Angeles	1500	Jan-05-1999
	San Diego	250	Jan-07-1999
	Los Angeles	300	Jan-08-1999

Query with a WHERE Clause with a LIKE Operator

An example with a WHERE clause with a LIKE operator is shown here.

SELECT *
FROM Store_Information
WHERE store_name LIKE '%AN%'

LIKE operator matches a character string to a regular expression. Because pattern matching is a strong part of Perl, it can easily be expressed by Perl.

sub select_store_name_like() {
    my @result = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	if ($record->{store_name} =~ /.*AN.*/i) { # WHERE store_name LIKE '%AN%'
	    push(@result, $record);
	}
    }

    print_result(@result, ['store_name', 'Sales', 'Date']); # SELECT *
}

By calling subroutine select_store_name_like(), the following result is obtained.

	store_name	Sales	Date
	----------------
	Los Angeles	1500	Jan-05-1999
	San Diego	250	Jan-07-1999
	San Francisco	300	Jan-08-1999
	Los Angeles	300	Jan-08-1999

Query with an ORDER-BY Clause

To sort the result by the decreasing order of sales, an ORDER-BY clause can be used as follows.

SELECT store_name, Sales, Date
FROM Store_Information
ORDER BY Sales

This statement can be translated into the following Perl code.

sub select_store_name_order_by() {
    my @result = ();
    @result = sort {$b->{Sales} <=> $a->{Sales}} @Store_Information; # ORDER BY Sales

    print_result(@result, ['store_name', 'Sales', 'Date']);
    # SELECT store_name, Sales, Date
}

That means, sort function can be used before displaying the result.

By calling subroutine select_store_name_order_by(), the following result is obtained.

	store_name	Sales	Date
	----------------
	Los Angeles	1500	Jan-05-1999
	Boston	700	Jan-08-1999
	San Francisco	300	Jan-08-1999
	Los Angeles	300	Jan-08-1999
	San Diego	250	Jan-07-1999

Query with an Aggregation Using SUM Function

The examples above do not contain computation using multiple records. The next example use SUM function to compute the sum of sales.

SELECT SUM(Sales) FROM Store_Information

This statement can be translated into the following Perl code.

sub select_sales_sum() {

    my $SUM_Sales = 0;
    foreach my $record (@Store_Information) { # FROM Store_Information
	$SUM_Sales += $record->{Sales};
    }

    print "\tSUM(Sales)\n";
    print "\t$SUM_Sales\n";
}

By calling subroutine select_sales_sum(), the following result is obtained.

	SUM(Sales)
	3050

Query with an Aggregation Using COUNT and AVG Functions

The following example is similar to the previous example using SUM function. This statement counts the number of records using COUNT function, and calculates the average (using AVG function).

SELECT COUNT(store_name), AVG(Sales)
FROM Store_Information
WHERE store_name is not NULL

This statement can be translated into the following Perl code.

sub select_store_name_count() {
    my $COUNT = 0;
    my $SUM_Sales = 0;
    foreach my $record (@Store_Information) { # FROM Store_Information
	if ($record->{store_name}) { # WHERE store_name is not NULL
	    $COUNT++; # COUNT(store_name)
	    $SUM_Sales += $record->{Sales};
	}
    }

    print "\tCOUNT(store_name)\tAVG(Sales)\n";
    print "\t$COUNT\t", $SUM_Sales / $COUNT, "\n";
}

This subroutine calculates the same intermediate value as SUM function, and it finally gets the average by dividing the intermediate value by the number of records.

By calling subroutine select_store_name_count(), the following result is obtained.

	COUNT(store_name)	AVG(Sales)
	5	610

Query with a Combination of COUNT Function and DISTINCT Operator

The next example combines use of a COUNT function and a DISTINCT operator.

SELECT COUNT(DISTINCT store_name)
FROM Store_Information

This statement can be translated into the following Perl code.

sub select_store_name_count_distinct() {
    my %Distinct_Store_Information = (); # DISTINCT
    foreach my $record (@Store_Information) { # FROM Store_Information
	$Distinct_Store_Information{$record->{store_name}} = $record; # DISTINCT
    }
    my $COUNT_store_name = 0;
    foreach my $store_name (keys %Distinct_Store_Information) { # DISTINCT
	$COUNT_store_name++; # COUNT(DISTINCT store_name)
    }

    print "COUNT(DISTINCT store_name)\n";
    print $COUNT_store_name, "\n";
}

By calling subroutine select_store_name_count_distinct(), the following result is obtained.

COUNT(DISTINCT store_name)
4

Because there are two stores named “Los Angeles”, this subroutine counts them only once.

Query with a GROUP-BY Clause

The following statement groups stores by their names, and shows the sales of each group.

SELECT store_name, SUM(Sales)
FROM Store_Information
GROUP BY store_name

This statement can be translated into the following Perl code.

sub select_store_name_group_by() {
    my %Grouped_Store_Information = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	if (!$Grouped_Store_Information{$record->{store_name}}) { # GROUP BY store_name
	    $Grouped_Store_Information{$record->{store_name}} =
	    {store_name => $record->{store_name},
	     SUM_Sales => $record->{Sales}};
	} else {
	    $Grouped_Store_Information{$record->{store_name}}->{SUM_Sales} +=
		$record->{Sales};
	}
    }
    my @result = ();
    foreach my $store_name (keys %Grouped_Store_Information) {
	push(@result, $Grouped_Store_Information{$store_name});
    }

    print "store_name SUM(Sales)\n";
    print_result(@result, ['store_name', 'SUM_Sales']); # SELECT store_name, SUM(Sales)
}

The first loop contains a condition statement. A new record is created in the then-part of the condition statement. This is the record for holding the intermediate and final values for the calculation of SUM. In the previous examples, there was no need to hold such an intermediate value. So only pointers to preexisting records are used and there was no need to generate new records (hashes). The else-part of the conditional statement in the present example only updates the intermediate value of SUM.

By calling subroutine select_store_name_group_by(), the following result is obtained.

	store_name	SUM_Sales
	----------------
	San Francisco	300
	San Diego	250
	Los Angeles	1800
	Boston	700

The sales of two stores named “Los Angeles” are added, but the other values are the same as those in Store_Information table.

Query with GROUP-BY and HAVING Clauses

To express a condition on the result of “GROUP BY”, “HAVING” is used in this example.

SELECT store_name, SUM(sales)
FROM Store_Information
GROUP BY store_name
HAVING SUM(sales) > 1500

This statement shows the groups whose sum of sales is greater than 1500.

This statement can be translated into the following Perl code.

sub select_store_name_having() {
    my %Grouped_Store_Information = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	if (!$Grouped_Store_Information{$record->{store_name}}) { # GROUP BY store_name
	    $Grouped_Store_Information{$record->{store_name}} =
	    {store_name => $record->{store_name},
	     SUM_Sales => $record->{Sales}};
	} else {
	    $Grouped_Store_Information{$record->{store_name}}->{SUM_Sales} +=
		$record->{Sales};
	}
    }
    my @result = ();
    foreach my $store_name (keys %Grouped_Store_Information) {
	my $record = $Grouped_Store_Information{$store_name};
	if ($record->{SUM_Sales} > 1500) { # HAVING SUM(sales) > 1500
	    push(@result, $record);
	}
    }

    print "store_name SUM(Sales)\n";
    print_result(@result, ['store_name', 'SUM_Sales']); # SELECT store_name, SUM(sales)
}

The condition in the WHERE clause in handled in the first loop, but the condition in the HAVING clause is handled in the second loop.

By calling subroutine select_store_name_having(), the following result is obtained.

	store_name	SUM_Sales
	----------------
	Los Angeles	1800

Query with a JOIN Operator

JOIN operation is one of the most important operations for relational databases.

SELECT Geography.region_name, SUM(Store_Information.Sales)
FROM Geography, Store_Information
WHERE Geography.store_name = Store_Information.store_name
GROUP BY Geography.region_name

In this example, both tables contain store names. They are used to join these tables. This means, the sales for each region are calculated in this example.

There are many methods to execute JOIN. Methods of optimization were the most important themes of database researches around 1980s. However, only one method called "hash join" is used here. This is a method of optimizing join operations using hashing.

sub select_store_geo_join() {
    my %Distinct_Geography = ();
    foreach my $record (@Geography) { # FROM Geography
	push(@{$Distinct_Geography{$record->{store_name}}}, $record); # hash chain
    }
    my %Grouped = ();
    foreach my $record_S (@Store_Information) { # FROM Store_Information
	foreach my $record_G (@{$Distinct_Geography{$record_S->{store_name}}}) {
	    if (!$Grouped{$record_G->{region_name}}) { # GROUP BY region_name
		$Grouped{$record_G->{region_name}} =
		{region_name => $record_G->{region_name},
		 Sales => $record_S->{Sales}};
	    } else {
		$Grouped{$record_G->{region_name}}->{Sales} +=
		    $record_S->{Sales};
	    }
	}
    }
    my @result = ();
    foreach my $region_name (keys %Grouped) {
	push(@result, $Grouped{$region_name});
    }

    print_result(@result, ['region_name', 'Sales']); # SELECT store_name
}

A hash table is created in the first loop. Hash tables are also used in the examples of “DISTINCT” and “GROUP BY”. However, in these examples, each hash-table entry contains only one result value (such as the value of SUM). In contrast, in this example, each entry may have multiple records (as an array), so the program is more complicated. However, the algorithm of "hash join" can be expressed by such a short program using Perl because of its power.

The second loop is nested, but the number of repetition of the inner loop is usually small. The inner loop contains a conditional statement. The then-part creates a new record that is a joined record of two tables. The else-part of the conditional statement only updates the records.

By calling subroutine select_store_geo_join(), the following result is obtained.

	region_name	Sales
	----------------
	West	2050
	East	700

Query with Nested SELECTs

The following query contains a subquery, i.e., SELECTs are nested in this example.

SELECT SUM(Sales) FROM Store_Information
WHERE Store_name IN
(SELECT store_name FROM Geography
 WHERE region_name = 'West')

This statement sums up the stores in the west region. This statement can be translated into the following Perl code.

sub select_store_subquery() {
    my @work = ();
    foreach my $record (@Geography) { # FROM Geography
	if ($record->{region_name} eq 'West') { # WHERE region_name = 'West')
	    push(@work, $record->{store_name}); # SELECT store_name
	}
    }
    my $SUM_Sales = 0;
    foreach my $record (@Store_Information) { # FROM Store_Information
	if (grep ($_ eq $record->{store_name}, @work)) { # WHERE store_name IN ...
	    $SUM_Sales += $record->{Sales};
	}
    }

    print "SUM_Sales\n";
    print $SUM_Sales, "\n";
}

The first loop corresponds to the inner SELECT, and the second loop corresponds to the outer SELECT.

By calling subroutine select_store_subquery(), the following result is obtained.

SUM_Sales
2050

Query with a UNION Operator

This query computes the union of two queries.

SELECT Date FROM Store_Information
UNION
SELECT Date FROM Internet_Sales

This statement can be translated into the following Perl code.

sub select_date_union() {
    my %work = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	$work{$record->{Date}} = 1; # SELECT DISTINCT Date
    }
    foreach my $record (@Internet_Sales) {    # FROM Internet_Sales
	$work{$record->{Date}} = 1; # UNION SELECT DISTINCT Date
    }
    my @result = ();
    foreach my $Date (keys %work) {
	push(@result, {Date => $Date});
    }

    print_result(@result, ['Date']);
}

A UNION operator must remove duplicated values, so a hash table is used for this purpose. The hash table contains null ('' or 0) or 1. Only the elements that contains 1 belongs to the result of UNION.

By calling subroutine select_date_union(), the following result is obtained.

	Date
	----------------
	Jan-07-1999
	Jan-05-1999
	Jan-11-1999
	Jan-12-1999
	Jan-10-1999
	Jan-08-1999

The dates are not duplicated, but the order of them is nondeterministic because ordering is not specified.

Query with a UNION ALL Operator

This example is close to the previous one. It uses “UNION ALL”, instead of “UNION”. This means, this query does not remove duplicates.

SELECT Date FROM Store_Information
UNION ALL
SELECT Date FROM Internet_Sales

This statement can be translated into the following Perl code.

sub select_date_union_all() {
    my @work = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	push(@work, {Date => $record->{Date}}); # SELECT Date
    }
    foreach my $record (@Internet_Sales) {    # FROM Internet_Sales
	push(@work, {Date => $record->{Date}}); # UNION ALL SELECT Date
    }
    my @result = ();
    foreach my $record (@work) {
	push(@result, $record);
    }

    print_result(@result, ['Date']);
}

Because there is no need to remove duplicated values for UNION ALL operator, it is easier to implement than for UNION operator.

By calling subroutine select_date_union_all(), the following result is obtained.

	Date
	----------------
	Jan-05-1999
	Jan-07-1999
	Jan-08-1999
	Jan-08-1999
	Jan-08-1999
	Jan-07-1999
	Jan-10-1999
	Jan-11-1999
	Jan-12-1999

Query with an INTERSECT Operator

This example contains an INTERSECT (intersection) operator instead of a UNION operator.

SELECT Date FROM Store_Information
INTERSECT
SELECT Date FROM Internet_Sales

This statement can be translated into the following Perl code.

sub select_date_intersect() {
    my %work = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	$work{$record->{Date}} = 1;
    }
    foreach my $record (@Internet_Sales) {    # FROM Internet_Sales
	if ($work{$record->{Date}}) {
	    $work{$record->{Date}} = 2;	# INTERSECT
	}
    }
    my @result = ();
    foreach my $Date (keys %work) {
	if ($work{$Date} == 2) {
	    push(@result, {Date => $Date});
	}
    }

    print_result(@result, ['Date']); # SELECT Date
}

INTERSECT operator is slightly more difficult to implement than UNION operator. Only the elements that contain "2" are included in the result of the INTERSECT operator. However, this method cannot be applied to programs that contain more complicated expressions.

By calling subroutine select_date_intersect(), the following result is obtained.

	Date
	----------------
	Jan-07-1999

Query with a MINUS Operator

This query computes the difference of two subquery results.

SELECT Date FROM Store_Information
MINUS
SELECT Date FROM Internet_Sales

This statement can be translated into the following Perl code.

sub select_date_minus() {
    my %work = ();
    foreach my $record (@Store_Information) { # FROM Store_Information
	$work{$record->{Date}} = {Date => $record->{Date}};
    }
    foreach my $record (@Internet_Sales) {    # FROM Internet_Sales
	if ($work{$record->{Date}}) {
	    $work{$record->{Date}} = ''; # MINUS
	    # or, delete $work{$record->{Date}};
	}
    }
    my @result = ();
    foreach my $Date (keys %work) {
	if ($work{$Date}) {
	    push(@result, {Date => $Date});
	}
    }

    print_result(@result, ['Date']); # SELECT Date
}

By calling subroutine select_date_minus(), the following result is obtained.

	Date
	----------------
	Jan-05-1999
	Jan-08-1999

Keywords: Relational Database, SQL, Join

Created on December 29, 2008 6:45 PM | Updated on December 31, 2008 7:19 PM | Individual link

Small and Large Stones of Programming