JTP - Alignment Engine: tokens | Eckford on the side

Chapters:

dual-plane alignment

Let's step back and consider how this fits into the bigger picture. dual-plane alignment. Fancy term. I bet it just gets fancier. How many planes can we add? How is this technically coming together? What are the moving pieces?

I want to have a solid understanding of this stage please.

1. Structural plane
2. Lexical plane <------ "tokens" and phrase building
3. Semantic plane
4. Financial plane

Let's look at Lexical Plane first. Because Financial Plane by it's very nature insists that I am paid to pursue this path $$

🧱 alignment attempt

Our Statement Import pipeline is structured with JSON

        {
          "date": "12/10/2022",
          "name": "AIR CANADA",
          "description": "OTTAWA TO VANCOUVER",
          "withdrawal": "",
          "deposit": "",
          "amount": 0.0,
          "type": "income",
          "category": "TRANSPORT",
          "is_travel": true
        },

Stage	Language-learning term
Force into structure	Contrastive Analysis (alignment attempt)
Assign confidence	Degree of match
`needs_review`	Mismatch / divergence

We have some issues at this stage.

Most obviously to the banking side of things, there is no monetary value assigned to this snippet. This is a Bank Statement translation with tagging. The most glaring omission is of course "withdrawal".

The second is type income. Unless I claim I'm expensing this cost with an (unknown to me) reverse income accounting procedure, this is not income, it is an expense. (Yes, I see accounting as inverse mathematics).

And Air + Canada isn't aware that it is a compound word. How can we train it? How do "tokens" work?

🧱 Alignment Attempt — failure modes in the bank statement record

Input:

{
  "date": "12/10/2022",
  "name": "AIR CANADA",
  "description": "OTTAWA TO VANCOUVER",
  "withdrawal": "",
  "deposit": "",
  "amount": 0.0,
  "type": "income",
  "category": "TRANSPORT",
  "is_travel": true
}

🔴 Issue 1 — Missing monetary signal

withdrawal = ""
deposit = ""
amount = 0.0

This is not a subtle miss. This is a structural violation.

A financial transaction without a value is not a transaction.

Classification:

Alignment status: FAILED (critical field missing)
Not “low confidence” — this should be blocked or quarantined

🔴 Issue 2 — Type misclassification

"type": "income" conflicts with:
- "AIR CANADA"
- "TRANSPORT"
- "is_travel": true

Your system correctly detected context (travel), but inverted the financial polarity.

This is a classic sign: semantic alignment succeeded, financial alignment failed.

⚙️ Financial Alignment

Right now, your alignment attempt is single-plane (semantic tagging).

You need dual-plane alignment:

1. Semantic alignment (you already have)

AIR CANADA → TRANSPORT ✔
travel flag ✔

2. Financial alignment (missing enforcement)

You need a rule set like:

if not withdrawal and not deposit:
    status = "INVALID_MISSING_AMOUNT"

if amount == 0.0:
    status = "INVALID_ZERO_AMOUNT"

And more importantly:

if merchant in KNOWN_EXPENSE_MERCHANTS:
    expected_type = "expense"

How many “planes” in the alignment engine?

We can add as many as we want, but in practice they cluster into four useful ones:

1. Structural plane

Do fields exist?
Is the shape valid JSON / schema?

2. Lexical plane

What are the tokens?
"AIR" + "CANADA" vs "AIR CANADA"

This is where your airline problem lives.

3. Semantic plane

What does it mean?
airline → transport → travel

4. Financial plane

Direction (income vs expense)
magnitude (amount)
sign (debit/credit behavior)

Lexical Plane

🧱 What is a “token” here?

Not crypto. Not auth. Not LLM subword magic.

In your pipeline:

A token = a normalized chunk of text used for matching

Examples from:

"AIR CANADA OTTAWA TO VANCOUVER"

Tokens:

["air", "canada", "ottawa", "to", "vancouver"]

⚙️ How tokens are created

This is deterministic. A small pipeline:

1. Normalize

lowercase
strip punctuation
unify whitespace

text = "AIR CANADA - OTTAWA TO VANCOUVER"
→ "air canada ottawa to vancouver"

2. Split (tokenize)

Most basic:

tokens = text.split()

More robust:

split on spaces + punctuation
keep numbers intact (123.45, 2022-10-12)

3. Clean

Remove noise tokens:

STOPWORDS = {"to", "from", "the", "and"}
tokens = [t for t in tokens if t not in STOPWORDS]

Result:

["air", "canada", "ottawa", "vancouver"]

🧠 What tokens are used for

Tokens are not the goal. They are fuel for:

1. Matching

if "uber" in tokens:
    category = "transport"

2. Phrase building (your key concern)

["air", "canada"] → "air canada"

3. Frequency tracking

token_counts["air"] += 1
pair_counts[("air","canada")] += 1

This is how the system learns what “sticks together”.

🔄 Are tokens “spent”?

Not really. They’re:

created → from text
transformed → filtered, grouped
reused → across matching, scoring, learning

Think of them as atoms in a reaction, not currency.

📦 How tokens are sorted / organized

Three useful structures:

1. Set (fast lookup)

token_set = set(tokens)
if "uber" in token_set:

2. List (ordered)

["air", "canada", "ottawa"]

Used for phrase detection (adjacency matters)

3. Frequency maps

{
  "air": 120,
  "canada": 95
}

and pairs:

{
  ("air","canada"): 80
}

✈️ The AIR CANADA moment (lexical plane in action)

You start with:

["air", "canada"]

The system tracks:

how often "air" appears alone
how often "canada" appears alone
how often "air"+"canada" appear together and adjacent

When this becomes dominant:

pair_counts[("air","canada")] / token_counts["air"] → high

You promote:

"air canada" → single lexical unit

Now your tokens become:

["air_canada", "ottawa", "vancouver"]

(underscore or space, doesn’t matter internally)

🧭 Important distinction

Lexical plane does not care what it means.

It only answers:

“What chunks of text behave like stable units?”

Meaning comes later.

⚙️ Minimal working version

def tokenize(text):
    text = text.lower()
    tokens = re.findall(r"[a-z0-9]+", text)
    return tokens

Phrase detection (simple):

def detect_phrases(tokens, known_phrases):
    joined = " ".join(tokens)
    for phrase in known_phrases:
        if phrase in joined:
            return phrase
    return None

🧩 Clean mental model

Characters → tokens → phrases → entities

You are currently here:

tokens → phrases

⚡ Compression

Lexical plane =
  break text into tokens
  track how tokens appear
  recombine tokens that consistently travel together

If you want the next edge:

we can define promotion rules (when exactly “air canada” graduates from coincidence → entity)

That’s where lexical stops being mechanical and starts becoming strategic.

Contrastive Analysis

alignment attempt

JTP - Alignment Engine: tokens

dual-plane alignment

1. Structural plane

2. Lexical plane <------ "tokens" and phrase building

3. Semantic plane

4. Financial plane

🧱 alignment attempt

🧱 Alignment Attempt — failure modes in the bank statement record

🔴 Issue 1 — Missing monetary signal

🔴 Issue 2 — Type misclassification

⚙️ Financial Alignment

1. Semantic alignment (you already have)

2. Financial alignment (missing enforcement)

How many “planes” in the alignment engine?

1. Structural plane

2. Lexical plane

3. Semantic plane

4. Financial plane

Lexical Plane

🧱 What is a “token” here?

⚙️ How tokens are created

1. Normalize

2. Split (tokenize)

3. Clean

🧠 What tokens are used for

1. Matching

2. Phrase building (your key concern)

3. Frequency tracking

🔄 Are tokens “spent”?

📦 How tokens are sorted / organized

1. Set (fast lookup)

2. List (ordered)

3. Frequency maps

✈️ The AIR CANADA moment (lexical plane in action)

🧭 Important distinction

⚙️ Minimal working version

🧩 Clean mental model

⚡ Compression