Back arrow

What is Lexical Analysis?

Nov. 30, 2023

COMPILERS

First, let's start with the meaning behind the word. "Lexical" comes from the term "lexicon" which simply is the vocabulary of a language. When dealing with programming languages, the vocabulary of the language includes things like keywords, identifiers, parentheses, semicolons, etc.

In the field of compilers, lexical analysis refers to the process in which source code is translated into a series of lexical units, typically referred to as tokens.

The process of lexical analysis with source code as input and tokens as output.

Motivation

Working with tokenized data instead of raw text data offers several advantages, the main one being abstraction. By abstracting the source code into tokens, we disregard "junk" like whitespace that does not carry syntactic meaning (more on this later). This process is known as tokenization:

A diagram of tokenization where raw character data is transformed into a set of tokens.

During lexical analysis, we can verify single elements of the source code are well-formed. Elements such as identifiers, numeric literals, and string/character literals, just to name a few, can all be verified via lexical analysis. A bonus feature of lexical analysis is verifying all braces ((, {, [) are balanced in the given program. This is done using a simple stack algorithm.

It is important to note that the goal of lexical analysis is not to understand the code; instead, it is just concerned with producing a stream of tokens representing the source code.

How Does it Work?

Enter the lexer. The lexer is the component of the compiler responsible for carrying out lexical analysis. It receives raw text data as input and outputs a token stream. This is done by scanning over the input character by character, spitting out tokens along the way.

Here's a quick example to illustrate the process of lexical analysis. Here's our source code:

let x: i32 = 13;  

And its resulting token stream:

let > Keyword::Let  
x   > Identifier(x)  
:   > Delimiter::Colon  
i32 > Type::Int32  
=   > BinOp::Equals  
13  > Number(13)  
;   > Delimiter::Semicolon  

Nice - already, we have a much simpler, more generalized version of the source code.

What About Whitespace?

In Python, for example, whitespace carries syntactic meaning. The Python interpreter uses whitespace such as spaces and tabs to structure code, indicating blocks of code using indentation. In this case, specific tokens for whitespace is necessary. In languages like Rust, it relies on curly braces to delineate blocks of code, making whitespace tokens unnecessary.

So, which route should you go? Well, it depends on your language design. Do you want spaces to have meaning? If so, then add a token for whitespace. If not, then don't use whitespace tokens. The beauty of writing a compiler for your own language design is that it can be whatever you want it to be.

Challenges

In all of the implementations I've written, I've always forgotten to account for this corner case. What should happen in the case of an expression like 10--3? While it looks a bit funky, it's a valid expression as we're subtracting a negative number from a positive number.

My solution to this is to never output tokens with negative numbers. Instead, the resulting token stream should be:

10 > Integer(10)  
-  > BinOp::Sub  
-  > BinOp::Sub  
3  > Integer(3)  

I leave it to the parser to figure out the meaning of this set of tokens. The parser will know the first - is binary subtraction because it just saw an expression term, 10. When it sees the second -, it knows it should be a unary negation because it has already seen a binary operator. At this point, it can just negate the number in the token (Number(3) -> Number(-3)), and disregard the second -.

Remember, the goal of lexical analysis is not to verify the meaning of the code. The lexer has no idea what 10--3 means, and it doesn't need to! This "understanding" is done in further stages of the compiler such as syntactic and semantic analysis.

Other Applications for Tokenization

Tokenization is not limited to only compilers. It is crucial in Natural Language Processing (NLP), speech processing, data preprocessing, URL parsing, and more.