Join the lines of two files based on a common column
Table of contents
Consider the following files:
# file 1:
A 1
B 3
C 9
D 4
# file 2:
E twelve
B three
C nine
A one
- Column 1 contains the key in both files, which is a capital letter.
- Column 2 in file 1 contains a number
- Column 2 in file 2 contains the same number in text format this time.
If we want a new file with:
- the key in column 1,
- the number in column 2 (from file 1),
- and the number in text in column 3 (from file 2),
we can use the following join command:
$ join <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
A 1 one
B 3 three
C 9 nine
Note that the inputs of join should be sorted, you’ll get an error by default. Also, the <( COMMAND )
syntax is called process substitution, and you can see it as a temporary file that’s not saved anywhere.
I would like to join on specific columns, though. When I typed join --help
I noticed the following:
The default join field (i.e column) is the first, delimited by blanks.
You can specify on which column you want to join each of the two files with the -1
and -2
options. With -1
you specify the column number that contains the key from file 1, and, with -2
, unsurprisingly, you specify the column number that contains the key from file 2.
For example, if the 2nd file had it’s columns in a different order, and the key was in the 2nd column:
# file 2:
twelve E
three B
nine C
one A
We could still join the lines of the 2 files based on the letters (key), but based on the second column for file 2 this time, with the following command:
$ join -2 2 <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'twelve E
three B
nine C
one A' | sort -k 2)
A 1 one
B 3 three
C 9 nine
-2 2
means that the 2nd column from file 2 contains our key.
Note that sort -k 2
means “sort by the second column”.
SQL joins
You may wonder if you can join files with the join
command, and get a similar behavior to SQL joins. In SQL, some common types of joins are the inner join, the left join, the right join, and the full join. It seems that you can perform all those joins with the join
command!
For a TL;DR see the code blocks at the end of full join.
Also, if you need to refresh your SQL joins (I did), see the YouTube video SQL Joins Explained.
Inner join
By default, and in the previous examples, we got an inner join. Let’s review the 1st example again:
# file 1:
A 1
B 3
C 9
D 4
# file 2:
E twelve
B three
C nine
A one
# command
$ join <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
# result
A 1 one
B 3 three
C 9 nine
The result is an inner join by default, because the row with key D
from file 1 doesn’t have a matching row in file 2, and it’s not in the final result. Additionally, the row with key E
from file 2 does not have a matching row in file 1, and it’s also omitted from the final result.
Left join
If we want to perform a left join, meaning that we want all the rows from file 1, and for those rows from file 1 that don’t have a matching row from file 2 (the one with key D
), we will show null
, we can use the -a FILE_NUMBER
option.
The join --help
says about the -a
option:
-a FILENUM
also print unpairable lines from file FILENUM, where
FILENUM is 1 or 2, corresponding to FILE1 or FILE2
So, we can use -a 1
to perform a left join (and include the unpairable lines from file 1 in the output):
$ join -a 1 <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
# result
A 1 one
B 3 three
C 9 nine
D 4
The result can be improved though. I would like an output like the following:
A 1 one
B 3 three
C 9 nine
D 4 null
We can achieve the above with 2 extra join
options (output from man join
follows):
-e <EMPTY>
, replace missing input fields with EMPTY.-o <FORMAT>
, obey FORMAT while constructing output line. FORMAT is one or more comma or blank separated specifications, each beingFILENUM.FIELD
FILENUM
can be 1 or 2, and FIELD
is the column number (again 1 or 2, we have 2 columns in each file).
So, to improve the output of the left join, we can use the following command:
$ join -a 1 -e null -o '1.1,1.2,2.2' <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
A 1 one
B 3 three
C 9 nine
D 4 null
That’s better!
Right join
And if we want to perform a right join, where we won’t have the row with key D
, because it’s not present in file 2 at all, and the row with key E
won’t have a number (from file 1) because it’s not present in file 1, we can use the following command:
$ join -a 2 -e null -o '2.1,1.2,2.2' <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
A 1 one
B 3 three
C 9 nine
E null twelve
Note that the above is a right join this time (on file 2), so we want the 2.1
instead of the 1.1
as the key in the -o
option.
Full join
In a full (outer) join, you want all the keys from both files, and wherever you don’t have a value, you print null.
This is the closest we can get with a custom output format:
$ join -a 1 -a 2 -e null -o '1.2,2.1,1.2,2.2' <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
1 A 1 one
3 B 3 three
9 C 9 nine
4 null 4 null
null E null twelve
Not so pretty… But then I noticed the following:
If FORMAT is the keyword ‘auto’, then the first line of each file determines the number of fields output for each line.
I don’t fully get it right now, but if you use -o auto
it does what we want (in this simple example…):
$ join -a 1 -a 2 -e null -o 'auto' <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
A 1 one
B 3 three
C 9 nine
D 4 null
E null twelve
In fact, you can use -o auto
for left join:
$ join -a 1 -e null -o auto <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
A 1 one
B 3 three
C 9 nine
D 4 null
and for right join:
$ join -a 2 -e null -o auto <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
A 1 one
B 3 three
C 9 nine
E null twelve
And, in order to have all the joins in one place, this is again the command for the inner join:
$ join <(echo 'A 1
B 3
C 9
D 4' | sort) <(echo 'E twelve
B three
C nine
A one' | sort)
A 1 one
B 3 three
C 9 nine
Links
- https://www.gnu.org/software/coreutils/join
- Find common elements in a given column from two files and output the column values from each file
- SQL Joins Explained - Socratica - Youtube Video
- Explain shell web application.
- Bash Guide for Beginners from The Linux Documentation Project
- Advanced Bash-Scripting Guide from The Linux Documentation Project
Other things to read
Popular
- Reveal animations on scroll with react-spring
- Gatsby background image example
- Extremely fast loading with Gatsby and self-hosted fonts