How Your Code Gets Read: A Beginner's Guide to Scanning
Before your program does anything — before it adds numbers, calls functions, or prints to the screen — something has to read it. Not understand it. Just read it. That job belongs to the scanner, and it's simpler than you might think.
I've been working through the book Crafting Interpreters by Robert Nystrom, and Chapter 4 on scanning was the first moment the magic started to feel like engineering. Let me walk you through it — and the full C# implementation I put together with the help of Claude along the way.
📂 LoxScanner.cs — full source on GitHub
The Problem: Computers Don't See Words
When you write this:
var x = 42;Your computer doesn't see a variable declaration. It sees a stream of characters:
v, a, r, ' ', x, ' ', =, ' ', 4, 2, ;That's it. Just characters in a row. The machine has no idea that var is a keyword, that x is a name you made up, or that 42 is a number. Before any of that meaning can be extracted, something needs to group those characters into chunks — and label each chunk.
That's scanning.
Before we get into the code, here's an interactive visualiser that I had Claude build for me to go alongside this post. Hit Play and watch the cursor move through the source, the _start and _current indexes update in real time, and tokens pop out one by one. You can slow it right down or step through it manually — I'd recommend doing that for the != operator and // comment examples once you've read a bit further, it makes those sections click immediately.
Lox Scanner — Interactive Visualiser
Watch the scanner consume source code character by character and emit tokens. Use Step → to go one step at a time, or ▶ Play to run automatically.
Tokens: The Vocabulary of Your Language
The output of scanning is a list of tokens. A token is just a labelled chunk of source code. Think of it like a human reading a sentence and mentally tagging each word:
var → [keyword]
x → [name/identifier]
= → [equals sign]
42 → [number]
; → [semicolon]In C#, a token is a simple class with four fields:
public class Token
{
public readonly TokenType Type; // What kind of thing is this?
public readonly string Lexeme; // The raw text: "var", "42", "!="
public readonly object? Literal; // Actual value for strings/numbers, null otherwise
public readonly int Line; // Line number (for error messages)
}Let's break down what each field actually means with a concrete example. Given this source code:
var score = 99;The scanner produces these tokens:
| Token | Type | Lexeme | Literal |
|---|---|---|---|
var | VAR | "var" | null |
score | IDENTIFIER | "score" | null |
= | EQUAL | "=" | null |
99 | NUMBER | "99" | 99.0 |
; | SEMICOLON | ";" | null |
Notice that 99 has two representations: the lexeme is the raw string "99" (just text), while the literal is the actual number 99.0 (a double). That matters because later, when the interpreter does maths, it needs the real number — not a string it has to parse again.
Most tokens don't have a literal at all. A semicolon is just a semicolon. null is fine.
Labelling Every Possible Token: The TokenType Enum
Before the scanner can label anything, it needs a complete list of all valid labels. That's the TokenType enum:
public enum TokenType
{
// Single-character tokens — unambiguous, one char is enough
LEFT_PAREN, RIGHT_PAREN, // ( )
LEFT_BRACE, RIGHT_BRACE, // { }
COMMA, DOT, MINUS, PLUS,
SEMICOLON, SLASH, STAR,
// One or two character tokens — need to peek at the next char
BANG, BANG_EQUAL, // ! !=
EQUAL, EQUAL_EQUAL, // = ==
GREATER, GREATER_EQUAL, // > >=
LESS, LESS_EQUAL, // < <=
// Literals — carry an actual value
IDENTIFIER, // user-defined names: x, score, myFunction
STRING, // "hello world"
NUMBER, // 42 or 3.14
// Keywords — words the language has reserved
AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,
EOF // signals the end of the file
}Think of this enum as the dictionary of everything your language knows how to recognise. If a character or word isn't covered here, it's an error.
The Scanner Itself: A Cursor Walking Through Your Source
The scanner holds the source code as a string and uses two integer indexes to track its position:
public class LoxScanner
{
private readonly string _source; // The full source code
private int _start = 0; // Where the current token began
private int _current = 0; // Where the cursor is right now
private int _line = 1; // Which line we're on
}Here's what those indexes mean visually. Say we're midway through scanning var x = 42; and we've just started on the number:
v a r x = 4 2 ;
0 1 2 3 4 5 6 7 8 9 10
^ ^
_start _current_start marks where the current token begins. _current is where we're reading right now. The gap between them is the lexeme we're building. When we call AddToken(), we slice _source[_start.._current] to get the raw text.
The Main Loop: ScanTokens()
The public entry point loops through the entire source, resetting _start at the beginning of each token, then calling ScanToken() to figure out what it is:
public List<Token> ScanTokens()
{
while (!IsAtEnd())
{
_start = _current; // ← snap _start to wherever the cursor is now
ScanToken(); // ← read one token from here
}
_tokens.Add(new Token(TokenType.EOF, "", null, _line));
return _tokens;
}Each iteration of the loop produces exactly one token. After ScanToken() returns, _current has moved past whatever characters made up that token. Then _start is reset to _current and the next token begins.
The final EOF token is added manually — it gives the parser a clean signal that the source has run out, rather than having to check for null.
ScanToken(): The Heart of the Scanner
ScanToken() reads one character with Advance() and then uses a switch to decide what to do:
private void ScanToken()
{
char c = Advance(); // consume one character
switch (c)
{
case '(': AddToken(TokenType.LEFT_PAREN); break;
case ')': AddToken(TokenType.RIGHT_PAREN); break;
case '+': AddToken(TokenType.PLUS); break;
case ';': AddToken(TokenType.SEMICOLON); break;
// ... more cases
}
}For simple single-character tokens this is trivial — one character consumed, one token emitted, done.
Advance(), Peek(), and Match(): The Three Core Helpers
These three methods are the engine behind everything the scanner does. It's worth understanding them cold.
Advance() — consume a character
private char Advance() => _source[_current++];Returns the character at _current, then increments _current. The ++ is postfix — it returns the value before incrementing. Every call to Advance() moves the cursor forward by one. You've "consumed" that character; it's now part of the current token.
Peek() — look without consuming
private char Peek() => IsAtEnd() ? '\0' : _source[_current];Returns the character at _current but does not move the cursor. This is called lookahead — you're scouting the next character before committing to it. If we're at the end of the file, it returns '\0' (a null character) as a safe sentinel.
Match() — conditional consume
private bool Match(char expected)
{
if (IsAtEnd()) return false;
if (_source[_current] != expected) return false;
_current++; // only move forward if it matched
return true;
}This is Peek() with commitment. If the next character is what we expect, consume it and return true. If not, leave it alone and return false. It's used exclusively for two-character operators.
Handling Two-Character Operators
This is where lookahead earns its keep. When the scanner sees !, it can't know yet if this is ! (not) or != (not-equal) until it looks at the next character.
case '!': AddToken(Match('=') ? TokenType.BANG_EQUAL : TokenType.BANG); break;
case '=': AddToken(Match('=') ? TokenType.EQUAL_EQUAL : TokenType.EQUAL); break;
case '<': AddToken(Match('=') ? TokenType.LESS_EQUAL : TokenType.LESS); break;
case '>': AddToken(Match('=') ? TokenType.GREATER_EQUAL : TokenType.GREATER); break;Let's trace != character by character:
Step 1: Advance() consumes '!'
→ we're in case '!'
→ call Match('=')
Step 2: Match('=') peeks at the next char — it's '='
→ consume it, _current moves forward
→ returns true
Step 3: true ? BANG_EQUAL : BANG → BANG_EQUAL
→ AddToken(BANG_EQUAL)Now trace just ! followed by a space:
Step 1: Advance() consumes '!'
→ we're in case '!'
→ call Match('=')
Step 2: Match('=') peeks at the next char — it's ' '
→ does NOT match, cursor stays put
→ returns false
Step 3: false ? BANG_EQUAL : BANG → BANG
→ AddToken(BANG)
→ the space is left for the next iterationThis principle — always match the longest possible token — is called maximal munch. != wins over ! whenever possible.
Comments and the Slash Problem
The / character is ambiguous too. It might be division, or it might be the start of a // comment:
case '/':
if (Match('/'))
{
// It's a comment — consume everything to the end of the line
while (Peek() != '\n' && !IsAtEnd())
Advance();
// No AddToken() call — comments produce no token
}
else
{
AddToken(TokenType.SLASH);
}
break;If / is followed by another /, we eat the rest of the line using Peek() in a loop — but we never emit a token. Comments are silently discarded. The parser never sees them.
If / is followed by anything else, it's a division operator and we emit SLASH as normal.
Scanning Strings
When the scanner hits a ", it knows a string has started. It keeps consuming characters until it finds the matching closing ":
private void ScanString()
{
while (Peek() != '"' && !IsAtEnd())
{
if (Peek() == '\n') _line++; // strings can span multiple lines
Advance();
}
if (IsAtEnd())
{
_errors.Add($"[Line {_line}] Unterminated string.");
return;
}
Advance(); // consume the closing "
// Trim the surrounding quotes to get the value
string value = _source.Substring(_start + 1, (_current - 1) - (_start + 1));
AddToken(TokenType.STRING, value);
}Let's trace "hello" step by step:
Source: " h e l l o "
Index: 0 1 2 3 4 5 6
ScanToken() calls Advance() → consumes '"' at index 0
_start = 0, _current = 1
ScanString() begins:
Peek() = 'h' → not '"', not end → Advance() (_current = 2)
Peek() = 'e' → not '"', not end → Advance() (_current = 3)
Peek() = 'l' → not '"', not end → Advance() (_current = 4)
Peek() = 'l' → not '"', not end → Advance() (_current = 5)
Peek() = 'o' → not '"', not end → Advance() (_current = 6)
Peek() = '"' → stop the loop
Advance() → consumes the closing '"' (_current = 7)
Substring(_start + 1, ...) = Substring(1, 5) = "hello"
AddToken(STRING, "hello")The _start + 1 and _current - 1 in the Substring call are just stripping the opening and closing quote marks. The literal value stored is hello, not "hello".
Scanning Numbers
Numbers need two levels of lookahead — one to check for a decimal point, and a second to make sure the character after the dot is actually a digit:
private void ScanNumber()
{
// Consume all the integer digits
while (char.IsDigit(Peek())) Advance();
// Is there a decimal part?
if (Peek() == '.' && char.IsDigit(PeekNext()))
{
Advance(); // consume the '.'
while (char.IsDigit(Peek())) Advance();
}
string numText = _source.Substring(_start, _current - _start);
AddToken(TokenType.NUMBER, double.Parse(numText));
}Why do we need PeekNext() for the dot check? Consider this code:
42.toString()When the scanner finishes reading 42 and sees ., it peeks at the character after the dot. That's t — not a digit. So the dot is not consumed as part of the number. It'll be picked up on the next iteration as a DOT token. Without that second peek, 42. would be swallowed incorrectly.
Tracing 3.14:
ScanNumber() called after '3' was consumed by Advance()
_start = 0, _current = 1
while IsDigit(Peek()): Peek()='.' → not a digit → stop
Peek() == '.' → true
PeekNext() == '1' → digit → true, enter the if block
Advance() consumes '.' (_current = 2)
while IsDigit(Peek()):
Peek() = '1' → Advance() (_current = 3)
Peek() = '4' → Advance() (_current = 4)
Peek() = EOF → stop
Substring(0, 4) = "3.14"
double.Parse("3.14") = 3.14
AddToken(NUMBER, 3.14)Scanning Identifiers and Keywords
Words — whether user-defined names or reserved keywords — are scanned the same way. Read everything that looks like a word, then check a dictionary:
private void ScanIdentifier()
{
while (char.IsLetterOrDigit(Peek()) || Peek() == '_')
Advance();
string word = _source.Substring(_start, _current - _start);
// Look it up — is it a keyword or a user-defined name?
TokenType type = Keywords.GetValueOrDefault(word, TokenType.IDENTIFIER);
AddToken(type);
}The Keywords dictionary maps strings to token types:
private static readonly Dictionary<string, TokenType> Keywords = new()
{
{ "var", TokenType.VAR },
{ "if", TokenType.IF },
{ "while", TokenType.WHILE },
{ "true", TokenType.TRUE },
// ... etc
};Let's trace while vs whileLoop:
// "while" → keyword
word = "while"
Keywords["while"] = TokenType.WHILE
AddToken(WHILE) ← keyword
// "whileLoop" → identifier
word = "whileLoop"
Keywords["whileLoop"] = not found → default = IDENTIFIER
AddToken(IDENTIFIER) ← user-defined nameThis is clean and simple. One scanning rule covers all words. The distinction between keywords and names happens at the lookup step, not during reading.
Error Handling: Keep Calm and Carry On
When the scanner hits a character it doesn't recognise, it records the error and continues:
default:
if (char.IsDigit(c)) ScanNumber();
else if (char.IsLetter(c)) ScanIdentifier();
else
_errors.Add($"[Line {_line}] Unexpected character: '{c}'");
break;Given var x = @42#;, the scanner:
- Scans
var→VAR - Scans
x→IDENTIFIER - Scans
=→EQUAL - Hits
@→ records error, keeps going - Scans
42→NUMBER - Hits
#→ records error, keeps going - Scans
;→SEMICOLON
Output:
VAR 'var' ->
IDENTIFIER 'x' ->
EQUAL '=' ->
NUMBER '42' -> 42
SEMICOLON ';' ->
EOF '' ->
--- Scan Errors ---
[Line 1] Unexpected character: '@'
[Line 1] Unexpected character: '#'You get all the errors at once, not just the first one. That's a small but important quality-of-life decision.
Putting It All Together: A Full Trace
Let's trace the scanner through a complete real example:
fun greet(name) {
print name;
}The scanner walks through this and produces:
FUN 'fun' ->
IDENTIFIER 'greet' ->
LEFT_PAREN '(' ->
IDENTIFIER 'name' ->
RIGHT_PAREN ')' ->
LEFT_BRACE '{' ->
PRINT 'print' ->
IDENTIFIER 'name' ->
SEMICOLON ';' ->
RIGHT_BRACE '}' ->
EOF '' ->Notice a few things:
funis recognised as a keyword (FUN), not an identifiergreetandnameare identifiers — user-defined namesprintis also a keyword (PRINT), even though it looks like a function call- Whitespace and newlines produced zero tokens — completely invisible
- There are 11 tokens total from what felt like a few lines of code
The Full Code
The complete implementation is linked at the top of this post. I used Claude to help translate the original Java code from the book into C#, which was a great way to understand each method properly — having to review and understand AI-generated code forces you to actually read it!
Why This Matters
Scanning might seem like a boring first step, but it's the foundation everything else rests on. A well-designed scanner:
- Gives the rest of the interpreter clean, structured input instead of raw text
- Catches character-level mistakes early with useful line numbers
- Handles the messy edge cases (two-character operators, decimal numbers, multi-line strings) so the parser never has to think about them
It's also a surprisingly satisfying thing to build yourself. There's something deeply satisfying about watching a wall of text turn into a neat list of labelled tokens — and realising that this is the first thing your language does every single time someone runs their code.
This post is based on Chapter 4 of Crafting Interpreters by Robert Nystrom, one of the best programming books I've read. Highly recommended if you've ever wanted to understand how languages actually work.