Inspect text that contains control characters in bash
My use case: See what the IFS
environment variable contains.
Some context: The IFS
is the input field separator environment variable and it’s used by programs such as awk
to separate the fields (columns) of a line.
For example, if a line contains the text 1 2 3
and ${IFS}
is set to the space character, awk will give you 3 columns 1
, 2
, and 3
. If IFS
is set to tab, though, awk
will give you only 1 column, a column that contains all the numbers 1 2 3
.
It’s also used by the read
shell built-in, a utility that’s used to read lines and fields from files and from the standard input (see help read
).
It’s also worth noting that IFS
can contain a list of characters, not only one character.
See Bourne shell reserved variables if want to know learn more environment variables.
So, let’s see what the IFS
variable contains in my terminal:
$ echo "${IFS}"
# no visible out
I get no visible output, but if I pipe the result to the xxd
command:
$ echo "${IFS}" | xxd
00000000: 2009 0a0a ...
# xxd output explanation:
# line address: | hex bytes in pairs of 2 by default | textual representation
I get 4 characters. There’s an extra new line character there 0a
because echo adds an extra new line at the end of the output.
We can use the -n
option of echo
to not append a newline (type help echo
for more options).
$ echo -n "${IFS}" | xxd
00000000: 2009 0a ..
It seems that my IFS
environment variable consists of 3 characters:
0x20
the space0x09
the tab (horizontal)0x0a
the line feed, aka new line, or\n
, orLF
.
The Wikipedia page List of Unicode characters will probably serve you well if you want to translate hex codes to Unicode characters.
Text to UTF-8 bytes
You can also use xxd
to see the UTF-8 encoding (Unicode) of some text.
In the following example, I print my name (Mark) in English:
$ echo -n "Mark" | xxd
00000000: 4d61 726b Mark
Not too exciting, I gave it 4 characters, I got back 4 bytes.
But if I type my name in Greek:
$ echo -n "Μάρκος" | xxd
00000000: ce9c ceac cf81 ceba cebf cf82 ............
I get the UTF-8 encoding; how the text in stored in the file as bytes. I gave it “6” characters, I got back 12 bytes.
A “quick” way to verify this is with JavaScript’s encodeURIComponent
method:
$ node
Welcome to Node.js vx.x.x.
Type ".help" for more information.
> encodeURIComponent("Μάρκος").split("%").join(" ").toLowerCase();
' ce 9c ce ac cf 81 ce ba ce bf cf 82'
Bash default encoding
If you want to see what’s the default encoding in your shell, search for environment variables that start with LC*
(stands for locale) or LANG*
:
$ printenv | grep -iE 'LC|LANG'
LANG=en_US.UTF-8
The printenv
command above retrieves the values of all environment variables. The |
(pipe) symbol redirects the printenv
output to the grep
command.
The grep
command is used to search for patterns in text. The options used are:
-i
: Performs a case-insensitive search.-E
: Enables extended regular expressions for pattern matching.
The pattern ‘LC|LANG’ specifies the search criteria. It looks for lines that contain either “LC” or “LANG”.
Links
Other things to read
Popular
- Reveal animations on scroll with react-spring
- Gatsby background image example
- Extremely fast loading with Gatsby and self-hosted fonts