By MC — Jun 11, 2026

How Your Code Gets Read: A Beginner's Guide to Scanning

You've written code your whole life. But have you ever wondered what happens in the first millisecond after you hit run?

Before your program does anything — before it adds numbers, calls functions, or prints to the screen — something has to read it. Not understand it. Just read it. That job belongs to the scanner, and it's simpler than you might think.

I've been working through the book Crafting Interpreters by Robert Nystrom, and Chapter 4 on scanning was the first moment the magic started to feel like engineering. Let me walk you through it — and the full C# implementation I put together with the help of Claude along the way.

📂 LoxScanner.cs — full source on GitHub

The Problem: Computers Don't See Words

When you write this:

var x = 42;

Your computer doesn't see a variable declaration. It sees a stream of characters:

v, a, r, ' ', x, ' ', =, ' ', 4, 2, ;

That's it. Just characters in a row. The machine has no idea that var is a keyword, that x is a name you made up, or that 42 is a number. Before any of that meaning can be extracted, something needs to group those characters into chunks — and label each chunk.

That's scanning.

Before we get into the code, here's an interactive visualiser that I had Claude build for me to go alongside this post. Hit Play and watch the cursor move through the source, the _start and _current indexes update in real time, and tokens pop out one by one. You can slow it right down or step through it manually — I'd recommend doing that for the != operator and // comment examples once you've read a bit further, it makes those sections click immediately.

Lox Scanner — Interactive Visualiser

Example

cursor (_current)

current token

keyword

identifier

number

string

operator / punct

error

Source code

_start 0

_current 0

_line 1

Scanner thought

Press ▶ Play or Step → to begin scanning.

Token stream

Speed

Tokens: The Vocabulary of Your Language

The output of scanning is a list of tokens. A token is just a labelled chunk of source code. Think of it like a human reading a sentence and mentally tagging each word:

var   →  [keyword]
x     →  [name/identifier]
=     →  [equals sign]
42    →  [number]
;     →  [semicolon]

In C#, a token is a simple class with four fields:

public class Token
{
    public readonly TokenType Type;     // What kind of thing is this?
    public readonly string Lexeme;      // The raw text: "var", "42", "!="
    public readonly object? Literal;    // Actual value for strings/numbers, null otherwise
    public readonly int Line;           // Line number (for error messages)
}

Let's break down what each field actually means with a concrete example. Given this source code:

var score = 99;

The scanner produces these tokens:

Token	Type	Lexeme	Literal
`var`	`VAR`	`"var"`	`null`
`score`	`IDENTIFIER`	`"score"`	`null`
`=`	`EQUAL`	`"="`	`null`
`99`	`NUMBER`	`"99"`	`99.0`
`;`	`SEMICOLON`	`";"`	`null`

Notice that 99 has two representations: the lexeme is the raw string "99" (just text), while the literal is the actual number 99.0 (a double). That matters because later, when the interpreter does maths, it needs the real number — not a string it has to parse again.

Most tokens don't have a literal at all. A semicolon is just a semicolon. null is fine.

Labelling Every Possible Token: The TokenType Enum

Before the scanner can label anything, it needs a complete list of all valid labels. That's the TokenType enum:

public enum TokenType
{
    // Single-character tokens — unambiguous, one char is enough
    LEFT_PAREN, RIGHT_PAREN,   // ( )
    LEFT_BRACE, RIGHT_BRACE,   // { }
    COMMA, DOT, MINUS, PLUS,
    SEMICOLON, SLASH, STAR,

    // One or two character tokens — need to peek at the next char
    BANG, BANG_EQUAL,          // !  !=
    EQUAL, EQUAL_EQUAL,        // =  ==
    GREATER, GREATER_EQUAL,    // >  >=
    LESS, LESS_EQUAL,          // <  <=

    // Literals — carry an actual value
    IDENTIFIER,   // user-defined names: x, score, myFunction
    STRING,       // "hello world"
    NUMBER,       // 42 or 3.14

    // Keywords — words the language has reserved
    AND, CLASS, ELSE, FALSE, FUN, FOR, IF, NIL, OR,
    PRINT, RETURN, SUPER, THIS, TRUE, VAR, WHILE,

    EOF           // signals the end of the file
}

Think of this enum as the dictionary of everything your language knows how to recognise. If a character or word isn't covered here, it's an error.

The Scanner Itself: A Cursor Walking Through Your Source

The scanner holds the source code as a string and uses two integer indexes to track its position:

public class LoxScanner
{
    private readonly string _source;   // The full source code
    private int _start   = 0;          // Where the current token began
    private int _current = 0;          // Where the cursor is right now
    private int _line    = 1;          // Which line we're on
}

Here's what those indexes mean visually. Say we're midway through scanning var x = 42; and we've just started on the number:

v  a  r     x     =     4  2  ;
0  1  2  3  4  5  6  7  8  9  10

                          ^        ^
                        _start  _current

_start marks where the current token begins. _current is where we're reading right now. The gap between them is the lexeme we're building. When we call AddToken(), we slice _source[_start.._current] to get the raw text.

The Main Loop: ScanTokens()

The public entry point loops through the entire source, resetting _start at the beginning of each token, then calling ScanToken() to figure out what it is:

public List<Token> ScanTokens()
{
    while (!IsAtEnd())
    {
        _start = _current;  // ← snap _start to wherever the cursor is now
        ScanToken();        // ← read one token from here
    }

    _tokens.Add(new Token(TokenType.EOF, "", null, _line));
    return _tokens;
}

Each iteration of the loop produces exactly one token. After ScanToken() returns, _current has moved past whatever characters made up that token. Then _start is reset to _current and the next token begins.

The final EOF token is added manually — it gives the parser a clean signal that the source has run out, rather than having to check for null.

ScanToken(): The Heart of the Scanner

ScanToken() reads one character with Advance() and then uses a switch to decide what to do:

private void ScanToken()
{
    char c = Advance();  // consume one character

    switch (c)
    {
        case '(': AddToken(TokenType.LEFT_PAREN);  break;
        case ')': AddToken(TokenType.RIGHT_PAREN); break;
        case '+': AddToken(TokenType.PLUS);        break;
        case ';': AddToken(TokenType.SEMICOLON);   break;
        // ... more cases
    }
}

For simple single-character tokens this is trivial — one character consumed, one token emitted, done.

Advance(), Peek(), and Match(): The Three Core Helpers

These three methods are the engine behind everything the scanner does. It's worth understanding them cold.

`Advance()` — consume a character

private char Advance() => _source[_current++];

Returns the character at _current, then increments _current. The ++ is postfix — it returns the value before incrementing. Every call to Advance() moves the cursor forward by one. You've "consumed" that character; it's now part of the current token.

`Peek()` — look without consuming

private char Peek() => IsAtEnd() ? '\0' : _source[_current];

Returns the character at _current but does not move the cursor. This is called lookahead — you're scouting the next character before committing to it. If we're at the end of the file, it returns '\0' (a null character) as a safe sentinel.

`Match()` — conditional consume

private bool Match(char expected)
{
    if (IsAtEnd()) return false;
    if (_source[_current] != expected) return false;
    _current++;   // only move forward if it matched
    return true;
}

This is Peek() with commitment. If the next character is what we expect, consume it and return true. If not, leave it alone and return false. It's used exclusively for two-character operators.

Handling Two-Character Operators

This is where lookahead earns its keep. When the scanner sees !, it can't know yet if this is ! (not) or != (not-equal) until it looks at the next character.

case '!': AddToken(Match('=') ? TokenType.BANG_EQUAL : TokenType.BANG); break;
case '=': AddToken(Match('=') ? TokenType.EQUAL_EQUAL : TokenType.EQUAL); break;
case '<': AddToken(Match('=') ? TokenType.LESS_EQUAL : TokenType.LESS); break;
case '>': AddToken(Match('=') ? TokenType.GREATER_EQUAL : TokenType.GREATER); break;

Let's trace != character by character:

Step 1: Advance() consumes '!'
        → we're in case '!'
        → call Match('=')

Step 2: Match('=') peeks at the next char — it's '='
        → consume it, _current moves forward
        → returns true

Step 3: true ? BANG_EQUAL : BANG  →  BANG_EQUAL
        → AddToken(BANG_EQUAL)

Now trace just ! followed by a space:

Step 1: Advance() consumes '!'
        → we're in case '!'
        → call Match('=')

Step 2: Match('=') peeks at the next char — it's ' '
        → does NOT match, cursor stays put
        → returns false

Step 3: false ? BANG_EQUAL : BANG  →  BANG
        → AddToken(BANG)
        → the space is left for the next iteration

This principle — always match the longest possible token — is called maximal munch. != wins over ! whenever possible.

Comments and the Slash Problem

The / character is ambiguous too. It might be division, or it might be the start of a // comment:

case '/':
    if (Match('/'))
    {
        // It's a comment — consume everything to the end of the line
        while (Peek() != '\n' && !IsAtEnd())
            Advance();
        // No AddToken() call — comments produce no token
    }
    else
    {
        AddToken(TokenType.SLASH);
    }
    break;

If / is followed by another /, we eat the rest of the line using Peek() in a loop — but we never emit a token. Comments are silently discarded. The parser never sees them.

If / is followed by anything else, it's a division operator and we emit SLASH as normal.

Scanning Strings

When the scanner hits a ", it knows a string has started. It keeps consuming characters until it finds the matching closing ":

private void ScanString()
{
    while (Peek() != '"' && !IsAtEnd())
    {
        if (Peek() == '\n') _line++;  // strings can span multiple lines
        Advance();
    }

    if (IsAtEnd())
    {
        _errors.Add($"[Line {_line}] Unterminated string.");
        return;
    }

    Advance(); // consume the closing "

    // Trim the surrounding quotes to get the value
    string value = _source.Substring(_start + 1, (_current - 1) - (_start + 1));
    AddToken(TokenType.STRING, value);
}

Let's trace "hello" step by step:

Source:   "  h  e  l  l  o  "
Index:    0  1  2  3  4  5  6

ScanToken() calls Advance() → consumes '"' at index 0
_start = 0, _current = 1

ScanString() begins:
  Peek() = 'h' → not '"', not end → Advance()  (_current = 2)
  Peek() = 'e' → not '"', not end → Advance()  (_current = 3)
  Peek() = 'l' → not '"', not end → Advance()  (_current = 4)
  Peek() = 'l' → not '"', not end → Advance()  (_current = 5)
  Peek() = 'o' → not '"', not end → Advance()  (_current = 6)
  Peek() = '"' → stop the loop

Advance() → consumes the closing '"'  (_current = 7)

Substring(_start + 1, ...) = Substring(1, 5) = "hello"
AddToken(STRING, "hello")

The _start + 1 and _current - 1 in the Substring call are just stripping the opening and closing quote marks. The literal value stored is hello, not "hello".

Scanning Numbers

Numbers need two levels of lookahead — one to check for a decimal point, and a second to make sure the character after the dot is actually a digit:

private void ScanNumber()
{
    // Consume all the integer digits
    while (char.IsDigit(Peek())) Advance();

    // Is there a decimal part?
    if (Peek() == '.' && char.IsDigit(PeekNext()))
    {
        Advance(); // consume the '.'
        while (char.IsDigit(Peek())) Advance();
    }

    string numText = _source.Substring(_start, _current - _start);
    AddToken(TokenType.NUMBER, double.Parse(numText));
}

Why do we need PeekNext() for the dot check? Consider this code:

42.toString()

When the scanner finishes reading 42 and sees ., it peeks at the character after the dot. That's t — not a digit. So the dot is not consumed as part of the number. It'll be picked up on the next iteration as a DOT token. Without that second peek, 42. would be swallowed incorrectly.

Tracing 3.14:

ScanNumber() called after '3' was consumed by Advance()
_start = 0, _current = 1

while IsDigit(Peek()):  Peek()='.' → not a digit → stop

Peek() == '.' → true
PeekNext() == '1' → digit → true, enter the if block

Advance() consumes '.'   (_current = 2)

while IsDigit(Peek()):
  Peek() = '1' → Advance()  (_current = 3)
  Peek() = '4' → Advance()  (_current = 4)
  Peek() = EOF → stop

Substring(0, 4) = "3.14"
double.Parse("3.14") = 3.14
AddToken(NUMBER, 3.14)

Scanning Identifiers and Keywords

Words — whether user-defined names or reserved keywords — are scanned the same way. Read everything that looks like a word, then check a dictionary:

private void ScanIdentifier()
{
    while (char.IsLetterOrDigit(Peek()) || Peek() == '_')
        Advance();

    string word = _source.Substring(_start, _current - _start);

    // Look it up — is it a keyword or a user-defined name?
    TokenType type = Keywords.GetValueOrDefault(word, TokenType.IDENTIFIER);
    AddToken(type);
}

The Keywords dictionary maps strings to token types:

private static readonly Dictionary<string, TokenType> Keywords = new()
{
    { "var",   TokenType.VAR   },
    { "if",    TokenType.IF    },
    { "while", TokenType.WHILE },
    { "true",  TokenType.TRUE  },
    // ... etc
};

Let's trace while vs whileLoop:

// "while" → keyword
word = "while"
Keywords["while"] = TokenType.WHILE
AddToken(WHILE)       ← keyword

// "whileLoop" → identifier
word = "whileLoop"
Keywords["whileLoop"] = not found → default = IDENTIFIER
AddToken(IDENTIFIER)  ← user-defined name

This is clean and simple. One scanning rule covers all words. The distinction between keywords and names happens at the lookup step, not during reading.

Error Handling: Keep Calm and Carry On

When the scanner hits a character it doesn't recognise, it records the error and continues:

default:
    if (char.IsDigit(c))       ScanNumber();
    else if (char.IsLetter(c)) ScanIdentifier();
    else
        _errors.Add($"[Line {_line}] Unexpected character: '{c}'");
    break;

Given var x = @42#;, the scanner:

Scans var → VAR
Scans x → IDENTIFIER
Scans = → EQUAL
Hits @ → records error, keeps going
Scans 42 → NUMBER
Hits # → records error, keeps going
Scans ; → SEMICOLON

Output:

VAR             'var'    ->
IDENTIFIER      'x'      ->
EQUAL           '='      ->
NUMBER          '42'     -> 42
SEMICOLON       ';'      ->
EOF             ''       ->

--- Scan Errors ---
[Line 1] Unexpected character: '@'
[Line 1] Unexpected character: '#'

You get all the errors at once, not just the first one. That's a small but important quality-of-life decision.

Putting It All Together: A Full Trace

Let's trace the scanner through a complete real example:

fun greet(name) {
    print name;
}

The scanner walks through this and produces:

FUN             'fun'    ->
IDENTIFIER      'greet'  ->
LEFT_PAREN      '('      ->
IDENTIFIER      'name'   ->
RIGHT_PAREN     ')'      ->
LEFT_BRACE      '{'      ->
PRINT           'print'  ->
IDENTIFIER      'name'   ->
SEMICOLON       ';'      ->
RIGHT_BRACE     '}'      ->
EOF             ''       ->

Notice a few things:

fun is recognised as a keyword (FUN), not an identifier
greet and name are identifiers — user-defined names
print is also a keyword (PRINT), even though it looks like a function call
Whitespace and newlines produced zero tokens — completely invisible
There are 11 tokens total from what felt like a few lines of code

The Full Code

The complete implementation is linked at the top of this post. I used Claude to help translate the original Java code from the book into C#, which was a great way to understand each method properly — having to review and understand AI-generated code forces you to actually read it!

Why This Matters

Scanning might seem like a boring first step, but it's the foundation everything else rests on. A well-designed scanner:

Gives the rest of the interpreter clean, structured input instead of raw text
Catches character-level mistakes early with useful line numbers
Handles the messy edge cases (two-character operators, decimal numbers, multi-line strings) so the parser never has to think about them

It's also a surprisingly satisfying thing to build yourself. There's something deeply satisfying about watching a wall of text turn into a neat list of labelled tokens — and realising that this is the first thing your language does every single time someone runs their code.

This post is based on Chapter 4 of Crafting Interpreters by Robert Nystrom, one of the best programming books I've read. Highly recommended if you've ever wanted to understand how languages actually work.

How Your Code Gets Read: A Beginner's Guide to Scanning

The Problem: Computers Don't See Words

Lox Scanner — Interactive Visualiser

Tokens: The Vocabulary of Your Language

Labelling Every Possible Token: The TokenType Enum

The Scanner Itself: A Cursor Walking Through Your Source

The Main Loop: ScanTokens()

ScanToken(): The Heart of the Scanner

Advance(), Peek(), and Match(): The Three Core Helpers

`Advance()` — consume a character

`Peek()` — look without consuming

`Match()` — conditional consume

Handling Two-Character Operators

Comments and the Slash Problem

Scanning Strings

Scanning Numbers

Scanning Identifiers and Keywords

Error Handling: Keep Calm and Carry On

Putting It All Together: A Full Trace

The Full Code

Why This Matters

How AI Agents Leverage Tools to Operate in the Real World

From Tokens to Trees: How Your Code Gets Structure

The Problem: Computers Don't See Words

Lox Scanner — Interactive Visualiser

Tokens: The Vocabulary of Your Language

Labelling Every Possible Token: The TokenType Enum

The Scanner Itself: A Cursor Walking Through Your Source

The Main Loop: ScanTokens()

ScanToken(): The Heart of the Scanner

Advance(), Peek(), and Match(): The Three Core Helpers

Advance() — consume a character

Peek() — look without consuming

Match() — conditional consume

Handling Two-Character Operators

Comments and the Slash Problem

Scanning Strings

Scanning Numbers

Scanning Identifiers and Keywords

Error Handling: Keep Calm and Carry On

Putting It All Together: A Full Trace

The Full Code

Why This Matters

How AI Agents Leverage Tools to Operate in the Real World

From Tokens to Trees: How Your Code Gets Structure

`Advance()` — consume a character

`Peek()` — look without consuming

`Match()` — conditional consume