Chapters: 

dual-plane alignment

Let's step back and consider how this fits into the bigger picture. dual-plane alignment. Fancy term. I bet it just gets fancier. How many planes can we add? How is this technically coming together? What are the moving pieces?

I want to have a solid understanding of this stage please.

  • 1. Structural plane

  • 2. Lexical plane         <------ "tokens" and phrase building

  • 3. Semantic plane

  • 4. Financial plane

Let's look at Lexical Plane first. Because Financial Plane by it's very nature insists that I am paid to pursue this path $$ 

🧱 alignment attempt

Our Statement Import pipeline is structured with JSON

        {
          "date": "12/10/2022",
          "name": "AIR CANADA",
          "description": "OTTAWA TO VANCOUVER",
          "withdrawal": "",
          "deposit": "",
          "amount": 0.0,
          "type": "income",
          "category": "TRANSPORT",
          "is_travel": true
        },

 

Stage

Language-learning term

Force into structure

Contrastive Analysis (alignment attempt)

Assign confidence

Degree of match

needs_review

Mismatch / divergence

We have some issues at this stage.

Most obviously to the banking side of things, there is no monetary value assigned to this snippet. This is a Bank Statement translation with tagging. The most glaring omission is of course "withdrawal". 

The second is type income. Unless I claim I'm expensing this cost with an (unknown to me) reverse income accounting procedure, this is not income, it is an expense. (Yes, I see accounting as inverse mathematics). 

And Air + Canada isn't aware that it is a compound word. How can we train it? How do "tokens" work? 


 

🧱 Alignment Attempt — failure modes in the bank statement  record

Input:

{
  "date": "12/10/2022",
  "name": "AIR CANADA",
  "description": "OTTAWA TO VANCOUVER",
  "withdrawal": "",
  "deposit": "",
  "amount": 0.0,
  "type": "income",
  "category": "TRANSPORT",
  "is_travel": true
}

🔴 Issue 1 — Missing monetary signal

  • withdrawal = ""
  • deposit = ""
  • amount = 0.0

This is not a subtle miss. This is a structural violation.

A financial transaction without a value is not a transaction.

Classification:

  • Alignment status: FAILED (critical field missing)
  • Not “low confidence” — this should be blocked or quarantined

🔴 Issue 2 — Type misclassification

  • "type": "income" conflicts with:
    • "AIR CANADA"
    • "TRANSPORT"
    • "is_travel": true

Your system correctly detected context (travel), but inverted the financial polarity.

This is a classic sign: semantic alignment succeeded, financial alignment failed.


⚙️ Financial Alignment 

Right now, your alignment attempt is single-plane (semantic tagging).

You need dual-plane alignment:


1. Semantic alignment (you already have)

  • AIR CANADA → TRANSPORT ✔
  • travel flag ✔

2. Financial alignment (missing enforcement)

You need a rule set like:

if not withdrawal and not deposit:
    status = "INVALID_MISSING_AMOUNT"

if amount == 0.0:
    status = "INVALID_ZERO_AMOUNT"

And more importantly:

if merchant in KNOWN_EXPENSE_MERCHANTS:
    expected_type = "expense"

How many “planes” in the alignment engine?

We can add as many as we want, but in practice they cluster into four useful ones:

1. Structural plane

  • Do fields exist?
  • Is the shape valid JSON / schema?

2. Lexical plane

  • What are the tokens?
  • "AIR" + "CANADA" vs "AIR CANADA"

This is where your airline problem lives.


3. Semantic plane

  • What does it mean?
  • airline → transport → travel

4. Financial plane

  • Direction (income vs expense)
  • magnitude (amount)
  • sign (debit/credit behavior)

Lexical Plane

🧱 What is a “token” here?

Not crypto. Not auth. Not LLM subword magic.

In your pipeline:

A token = a normalized chunk of text used for matching

Examples from:

"AIR CANADA OTTAWA TO VANCOUVER"

Tokens:

["air", "canada", "ottawa", "to", "vancouver"]

⚙️ How tokens are created

This is deterministic. A small pipeline:

1. Normalize

  • lowercase
  • strip punctuation
  • unify whitespace
text = "AIR CANADA - OTTAWA TO VANCOUVER"
→ "air canada ottawa to vancouver"

2. Split (tokenize)

Most basic:

tokens = text.split()

More robust:

  • split on spaces + punctuation
  • keep numbers intact (123.45, 2022-10-12)

3. Clean

Remove noise tokens:

STOPWORDS = {"to", "from", "the", "and"}
tokens = [t for t in tokens if t not in STOPWORDS]

Result:

["air", "canada", "ottawa", "vancouver"]

🧠 What tokens are used for

Tokens are not the goal. They are fuel for:

1. Matching

if "uber" in tokens:
    category = "transport"

2. Phrase building (your key concern)

["air", "canada"] → "air canada"

3. Frequency tracking

token_counts["air"] += 1
pair_counts[("air","canada")] += 1

This is how the system learns what “sticks together”.


🔄 Are tokens “spent”?

Not really. They’re:

  • created → from text
  • transformed → filtered, grouped
  • reused → across matching, scoring, learning

Think of them as atoms in a reaction, not currency.


📦 How tokens are sorted / organized

Three useful structures:


1. Set (fast lookup)

token_set = set(tokens)
if "uber" in token_set:

2. List (ordered)

["air", "canada", "ottawa"]

Used for phrase detection (adjacency matters)


3. Frequency maps

{
  "air": 120,
  "canada": 95
}

and pairs:

{
  ("air","canada"): 80
}

✈️ The AIR CANADA moment (lexical plane in action)

You start with:

["air", "canada"]

The system tracks:

  • how often "air" appears alone
  • how often "canada" appears alone
  • how often "air"+"canada" appear together and adjacent

When this becomes dominant:

pair_counts[("air","canada")] / token_counts["air"] → high

You promote:

"air canada" → single lexical unit

Now your tokens become:

["air_canada", "ottawa", "vancouver"]

(underscore or space, doesn’t matter internally)


🧭 Important distinction

Lexical plane does not care what it means.

It only answers:

“What chunks of text behave like stable units?”

Meaning comes later.


⚙️ Minimal working version

def tokenize(text):
    text = text.lower()
    tokens = re.findall(r"[a-z0-9]+", text)
    return tokens

Phrase detection (simple):

def detect_phrases(tokens, known_phrases):
    joined = " ".join(tokens)
    for phrase in known_phrases:
        if phrase in joined:
            return phrase
    return None

🧩 Clean mental model

Characters → tokens → phrases → entities

You are currently here:

tokens → phrases


⚡ Compression

Lexical plane =
  break text into tokens
  track how tokens appear
  recombine tokens that consistently travel together

If you want the next edge:

  • we can define promotion rules (when exactly “air canada” graduates from coincidence → entity)

That’s where lexical stops being mechanical and starts becoming strategic.