Join the lines of two files based on a common column

04 July, 2023

Time to read: 3 mins

Last update: 04 July, 2023

Table of contents

SQL joins
Links

Consider the following files:

# file 1:
A    1
B    3
C    9
D    4

# file 2:
E    twelve
B    three
C    nine
A    one

Column 1 contains the key in both files, which is a capital letter.
Column 2 in file 1 contains a number
Column 2 in file 2 contains the same number in text format this time.

If we want a new file with:

the key in column 1,
the number in column 2 (from file 1),
and the number in text in column 3 (from file 2),

we can use the following join command:

$ join <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

A 1 one
B 3 three
C 9 nine

Note that the inputs of join should be sorted, you’ll get an error by default. Also, the <( COMMAND ) syntax is called process substitution, and you can see it as a temporary file that’s not saved anywhere.

I would like to join on specific columns, though. When I typed join --help I noticed the following:

The default join field (i.e column) is the first, delimited by blanks.

You can specify on which column you want to join each of the two files with the -1 and -2 options. With -1 you specify the column number that contains the key from file 1, and, with -2, unsurprisingly, you specify the column number that contains the key from file 2.

For example, if the 2nd file had it’s columns in a different order, and the key was in the 2nd column:

# file 2:
twelve E
three B
nine C
one A

We could still join the lines of the 2 files based on the letters (key), but based on the second column for file 2 this time, with the following command:

$ join -2 2 <(echo 'A 1
B    3
C    9
D    4'  | sort) <(echo 'twelve E
three B
nine C
one A'  | sort -k 2)

A 1 one
B 3 three
C 9 nine

-2 2 means that the 2nd column from file 2 contains our key.

Note that sort -k 2 means “sort by the second column”.

SQL joins

You may wonder if you can join files with the join command, and get a similar behavior to SQL joins. In SQL, some common types of joins are the inner join, the left join, the right join, and the full join. It seems that you can perform all those joins with the join command!

For a TL;DR see the code blocks at the end of full join.

Also, if you need to refresh your SQL joins (I did), see the YouTube video SQL Joins Explained.

Inner join

By default, and in the previous examples, we got an inner join. Let’s review the 1st example again:

# file 1:
A    1
B    3
C    9
D    4

# file 2:
E    twelve
B    three
C    nine
A    one

# command
$ join <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

# result
A 1 one
B 3 three
C 9 nine

The result is an inner join by default, because the row with key D from file 1 doesn’t have a matching row in file 2, and it’s not in the final result. Additionally, the row with key E from file 2 does not have a matching row in file 1, and it’s also omitted from the final result.

Left join

If we want to perform a left join, meaning that we want all the rows from file 1, and for those rows from file 1 that don’t have a matching row from file 2 (the one with key D), we will show null, we can use the -a FILE_NUMBER option.

The join --help says about the -a option:

  -a FILENUM
    also print unpairable lines from file FILENUM, where
    FILENUM is 1 or 2, corresponding to FILE1 or FILE2

So, we can use -a 1 to perform a left join (and include the unpairable lines from file 1 in the output):

$ join -a 1 <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

# result
A 1 one
B 3 three
C 9 nine
D 4

The result can be improved though. I would like an output like the following:

A 1 one
B 3 three
C 9 nine
D 4 null

We can achieve the above with 2 extra join options (output from man join follows):

-e <EMPTY>, replace missing input fields with EMPTY.
-o <FORMAT>, obey FORMAT while constructing output line. FORMAT is one or more comma or blank separated specifications, each being FILENUM.FIELD

FILENUM can be 1 or 2, and FIELD is the column number (again 1 or 2, we have 2 columns in each file).

So, to improve the output of the left join, we can use the following command:

$ join -a 1 -e null -o '1.1,1.2,2.2' <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

A 1 one
B 3 three
C 9 nine
D 4 null

That’s better!

Right join

And if we want to perform a right join, where we won’t have the row with key D, because it’s not present in file 2 at all, and the row with key E won’t have a number (from file 1) because it’s not present in file 1, we can use the following command:

$ join -a 2 -e null -o '2.1,1.2,2.2' <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

A 1 one
B 3 three
C 9 nine
E null twelve

Note that the above is a right join this time (on file 2), so we want the 2.1 instead of the 1.1 as the key in the -o option.

Full join

In a full (outer) join, you want all the keys from both files, and wherever you don’t have a value, you print null.

This is the closest we can get with a custom output format:

$ join -a 1 -a 2 -e null -o '1.2,2.1,1.2,2.2' <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

1 A 1 one
3 B 3 three
9 C 9 nine
4 null 4 null
null E null twelve

Not so pretty… But then I noticed the following:

If FORMAT is the keyword ‘auto’, then the first line of each file determines the number of fields output for each line.

I don’t fully get it right now, but if you use -o auto it does what we want (in this simple example…):

$ join -a 1 -a 2 -e null -o 'auto' <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

A 1 one
B 3 three
C 9 nine
D 4 null
E null twelve

In fact, you can use -o auto for left join:

$ join -a 1 -e null -o auto <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

A 1 one
B 3 three
C 9 nine
D 4 null

and for right join:

$ join -a 2 -e null -o auto <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

A 1 one
B 3 three
C 9 nine
E null twelve

And, in order to have all the joins in one place, this is again the command for the inner join:

$ join <(echo 'A    1
B    3
C    9
D    4'  | sort) <(echo 'E    twelve
B    three
C    nine
A    one'  | sort)

A 1 one
B 3 three
C 9 nine

Join the lines of two files based on a common column

SQL joins

Inner join

Left join

Right join

Full join

Links

Other things to read

Popular

Previous/Next