It is a strange thing really: querying the database to find missing data. Queries usually report on what is there. Not the queries in this article (and its prequel): they report on what is not there. We explore a number of approaches to uncovering missing information – because the fact that something is not found in itself is a fact we may want to know about. And it gives me a nice motive to demonstrate a number of fun SQL features.
We are still very much engaged in the EMP and DEPT tables in the SCOTT schema that back in the early ’80s was pivotal in demonstrating the capabilities of the first Oracle RDBMS releases. It still does an excellent job for most of my demos today. Must be me…
This article delves into urgent matters such as ‘which letters do not occur in the names of the employees’, ‘in which months of the year was no one hired’, ‘who in each department does not have a colleague with less experience (or a lower salary)’, ‘which employees do not have a colleague in the same job (either in their own department or in the entire company)’, and more.
Letters not used in any Employee Name
Our first search for information that is not there is a quest for the letters (from the alphabet) that are not used in any of the Employee Names in table EMP. We have 14 employees, each with a 4 to 8 letter ENAME. We wonder (yes, we do) which of our 26 letters in the western european alphabet is not used in any of those 14 names.
Several approaches to this puzzling challenge are available to us, and I am sure there are many more than I will demonstrate here. All of these make use of the alphabet query that was introduced in the article about static and somewhat dynamic domains: http://technology.amis.nl/blog/9577/retrieving-the-values-for-domains-in-sql-queries-in-the-oracle-rdbms-months-days-numbers-and-custom-value-sets.
The first approach creates one long string out of all the employee enames, then joins that string with every letter in the alphabet. Only rows where the letter does not occur in the string are returned, thereby giving us the list of letters not used in any ename:
[code language="sql"]
with alphabet as
( select chr(64+rownum) letter
from dual
connect
by rownum < 27
)
, employeenames as
( select listagg(ename,',') within group (order by sal) enames
from emp
)
select letter
from alphabet
cross join
employeenames
where instr( enames, letter ) = 0
[/code]
The second approach, though similar, uses two special options: first a join condition that does not just look for equality in two columns but instead uses instr( ename, letter ) >0 . The second is the outer join with the specific filter for un-joined rows. Every letter that is not joined to any ENAME is returned by this query.
[code language="sql"]
with alphabet as
( select chr(64+rownum) letter
from dual
connect
by rownum < 27
)
select letter
from alphabet
left outer join
emp
on instr( ename, letter ) > 0
where emp.rowid is null
[/code]
The third approach uses the MINUS set operator to substract from the ALPHABET all letters that occur in any of the ENAMES – and note how the CONNECT BY is used to turn a string of characters into a rowset with a single row for each character in the string (it is almost pivoting, right there!).
[code language="sql"]
with alphabet as
( select chr(64+rownum) letter
from dual
connect
by rownum < 27
)
, employeenames as
( select listagg(ename,',') within group (order by sal) enames
from emp
)
, ename_letters as
( select substr(enames, level, 1) letter
from employeenames
connect
by level <= length(enames)
)
select letter
from alphabet
minus
select letter
from ename_letters
[/code]
The jobs that a department does not contain
Another pressing matter is to find out which JOBs do not occur in each department. The EMP table has a JOB column that specifies for each employee which Job he or she is in. A simple query
select distinct job from emp
reveals the various jobs that the organization that is described by table EMP contains. Suppose we want to find out which of these various jobs do not occur in a specific department. We need to craft a query that returns for each department a comma separated values list of the jobs that are not represented in the department.
Again, the outer query comes in handy, to join each department with each job, even if the job is not represented in the department. In order to make this work, we need to use the partition by clause. This clause instructs the SQL engine to perform the outer join for each department again. Every JOB therefore is present in the end result, not just once, but once for every department. Next, we will filter on all joins of JOB for a DEPARTMENT where an outer join was required – or where the DEPARTMENT does not have an EMP row to do a real join.
[code language="sql"]
select distinct
deptno
, position
from emp
partition by (deptno)
right outer join
( select distinct job position
from emp
) functions
on (emp.job = functions.position)
where emp.rowid is null
order
by deptno
, position
[/code]
The next step is to turn this query result into the comma separated aggregate that was requested.
[code language="sql"]
select deptno
, listagg(position, ',') within group (order by position) unfilled_jobs
from ( select deptno
, position
from emp
partition by (deptno)
right outer join
( select distinct job position
from emp
) functions
on (emp.job = functions.position)
where emp.rowid is null
)
group
by deptno
order
by deptno
[/code]
The rarest jobs
A simple query to find the most rare job (in terms of the largest number of departments not having that job), which is the same thing as the smallest number of departments having that job:
[code language="sql"]
select job
, count(distinct deptno) num_of_depts
from emp
group
by job
order
by num_of_depts
[/code]
In which months was no one hired
Every EMLOYEE has a hiredate. From the HIREDATE column, it is easy to find out in which year and month someone was hired. However, in search of more information not there, we would like to find out in which months there was no one hired at all.
The pattern is emerging: outer join and filter on the virtual join cases:
[code language="sql"]
with months as
( select 0 + rownum month_num
, to_char( to_date('01-'||rownum||'-2010', 'DD-MM-YYYY'), 'MONTH') month_name
from dual
connect
by level < 13
)
select months.month_name
from emp e
right outer join
months
on extract(month from e.hiredate)= months.month_num
where e.rowid is null
order
by months.month_num
[/code]
We make use of the MONTHS domain, also discussed in the article about static and somewhat dynamic domains: http://technology.amis.nl/blog/9577/retrieving-the-values-for-domains-in-sql-queries-in-the-oracle-rdbms-months-days-numbers-and-custom-value-sets.
A somewhat alternative approach uses simple aggregates and a filter on the employee-hire count:
[code language="sql"]
with months as
( select 0 + rownum month_num
, to_char( to_date('01-'||rownum||'-2010', 'DD-MM-YYYY'), 'MONTH') month_name
from dual
connect
by level < 13
)
select months.month_name
, count(empno) "number of hires"
from emp e
right outer join
months
on extract(month from e.hiredate)= months.month_num
having count(empno) = 0
group
by months.month_name
, months.month_num
order
by months.month_num
[/code]
This next elaborate query shows the number of hired employees in each month per department. Again, using the PARTITION BY clause, the outer join has been applied to every department, instead of just once for every row in the MONTHS set. The result of this query would help to get an answer two the question: for each department – give me the months in which no employee was hired.
[code language="sql"]
with months as
( select 0 + rownum month_num
, to_char( to_date('01-'||rownum||'-2010', 'DD-MM-YYYY'), 'MONTH') month_name
from dual
connect
by level < 13
)
select months.month_name
, deptno department
, count(empno) "number of hires"
from emp e
partition by (deptno)
right outer join
months
on extract(month from e.hiredate)= months.month_num
group
by months.month_name
, months.month_num
, deptno
order
by months.month_num
, deptno
[/code]
Who in each department does not have a colleague with less experience
One way of reading this section title is: finding a record (EMP) for which there is not a next record (EMP) [where hiredate is more recent]. When a query can be expressed in terms of a next record, you should immediately think Analytical Functions, and more specifically: LEAD (and LAG).
Using lead we can look for the next less experienced colleague for each employee:
[code language="sql"]
select ename
, lead(ename) over (partition by deptno order by hiredate asc) next_less_experienced_in_dept
, deptno
, hiredate
from emp
order
by deptno
, hiredate
[/code]
Then we filter on those rows for which next_less_experienced_in_dept is null to find the most junior employees in each department.
[code language="sql"]
select ename
, deptno
from ( select ename
, lead(ename) over (partition by deptno order by hiredate asc) next_less_experienced_in_dept
, deptno
from emp
)
where next_less_experienced_in_dept is null
[/code]
Now it is easy to check if these least experienced employees are also the least-earners:
[code language="sql"]
select ename
, deptno
from ( select ename
, lead(ename) over (partition by deptno order by sal desc) next_less_paid_in_dept
, deptno
from emp
)
where next_less_paid_in_dept is null
[/code]
It turns out that MILLER and JAMES make an appearance in both result sets, ADAMS is least experiences in 20 yet SMITH makes less.
An alternative approach to the ‘find least experienced employees’, and even more compact, is this query:
[code language="sql"]
select deptno
, max(ename) keep (dense_rank last order by hiredate asc)
from emp
group
by deptno
[/code]
This query selects for each department the ename for the record that is ranked last when ordered by hiredate. I have never quite gotten used to this syntax, but it can be quite useful.
The equivalent for salary should be obvious:
[code language="sql"]
select deptno
, max(ename) keep (dense_rank last order by sal desc)
from emp
group
by deptno
[/code]
No Colleagues
Our final question in this article: which employees do not have a colleague in the same job (either in their own department or in the entire company)
[code language="sql"]
select ename
, deptno
, job
from ( select ename
, count(*) over (partition by job, deptno) -1 num_of_colleagues
, deptno
, job
from emp
)
where num_of_colleagues = 0
[/code]
It turns out that it is lonely both at the top and at the bottom (PRESIDENT, MANAGER and CLERK).