dual-plane alignment
Let's step back and consider how this fits into the bigger picture. dual-plane alignment. Fancy term. I bet it just gets fancier. How many planes can we add? How is this technically coming together? What are the moving pieces?
I want to have a solid understanding of this stage please.
-
1. Structural plane
-
2. Lexical plane <------ "tokens" and phrase building
-
3. Semantic plane
-
4. Financial plane
Let's look at Lexical Plane first. Because Financial Plane by it's very nature insists that I am paid to pursue this path $$
🧱 alignment attempt
Our Statement Import pipeline is structured with JSON
{
"date": "12/10/2022",
"name": "AIR CANADA",
"description": "OTTAWA TO VANCOUVER",
"withdrawal": "",
"deposit": "",
"amount": 0.0,
"type": "income",
"category": "TRANSPORT",
"is_travel": true
},
|
Stage |
Language-learning term |
|---|---|
|
Force into structure |
Contrastive Analysis (alignment attempt) |
|
Assign confidence |
Degree of match |
|
|
Mismatch / divergence |
We have some issues at this stage.
Most obviously to the banking side of things, there is no monetary value assigned to this snippet. This is a Bank Statement translation with tagging. The most glaring omission is of course "withdrawal".
The second is type income. Unless I claim I'm expensing this cost with an (unknown to me) reverse income accounting procedure, this is not income, it is an expense. (Yes, I see accounting as inverse mathematics).
And Air + Canada isn't aware that it is a compound word. How can we train it? How do "tokens" work?
🧱 Alignment Attempt — failure modes in the bank statement record
Input:
{
"date": "12/10/2022",
"name": "AIR CANADA",
"description": "OTTAWA TO VANCOUVER",
"withdrawal": "",
"deposit": "",
"amount": 0.0,
"type": "income",
"category": "TRANSPORT",
"is_travel": true
}
🔴 Issue 1 — Missing monetary signal
-
withdrawal = "" -
deposit = "" -
amount = 0.0
This is not a subtle miss. This is a structural violation.
A financial transaction without a value is not a transaction.
Classification:
- Alignment status: FAILED (critical field missing)
- Not “low confidence” — this should be blocked or quarantined
🔴 Issue 2 — Type misclassification
-
"type": "income"conflicts with:-
"AIR CANADA" -
"TRANSPORT" -
"is_travel": true
-
Your system correctly detected context (travel), but inverted the financial polarity.
This is a classic sign: semantic alignment succeeded, financial alignment failed.
⚙️ Financial Alignment
Right now, your alignment attempt is single-plane (semantic tagging).
You need dual-plane alignment:
1. Semantic alignment (you already have)
- AIR CANADA → TRANSPORT ✔
- travel flag ✔
2. Financial alignment (missing enforcement)
You need a rule set like:
if not withdrawal and not deposit:
status = "INVALID_MISSING_AMOUNT"
if amount == 0.0:
status = "INVALID_ZERO_AMOUNT"And more importantly:
if merchant in KNOWN_EXPENSE_MERCHANTS:
expected_type = "expense"How many “planes” in the alignment engine?
We can add as many as we want, but in practice they cluster into four useful ones:
1. Structural plane
- Do fields exist?
- Is the shape valid JSON / schema?
2. Lexical plane
- What are the tokens?
-
"AIR" + "CANADA"vs"AIR CANADA"
This is where your airline problem lives.
3. Semantic plane
- What does it mean?
- airline → transport → travel
4. Financial plane
- Direction (income vs expense)
- magnitude (amount)
- sign (debit/credit behavior)
Lexical Plane
🧱 What is a “token” here?
Not crypto. Not auth. Not LLM subword magic.
In your pipeline:
A token = a normalized chunk of text used for matching
Examples from:
"AIR CANADA OTTAWA TO VANCOUVER"Tokens:
["air", "canada", "ottawa", "to", "vancouver"]⚙️ How tokens are created
This is deterministic. A small pipeline:
1. Normalize
- lowercase
- strip punctuation
- unify whitespace
text = "AIR CANADA - OTTAWA TO VANCOUVER"
→ "air canada ottawa to vancouver"2. Split (tokenize)
Most basic:
tokens = text.split()More robust:
- split on spaces + punctuation
-
keep numbers intact (
123.45,2022-10-12)
3. Clean
Remove noise tokens:
STOPWORDS = {"to", "from", "the", "and"}
tokens = [t for t in tokens if t not in STOPWORDS]Result:
["air", "canada", "ottawa", "vancouver"]🧠 What tokens are used for
Tokens are not the goal. They are fuel for:
1. Matching
if "uber" in tokens:
category = "transport"2. Phrase building (your key concern)
["air", "canada"] → "air canada"3. Frequency tracking
token_counts["air"] += 1
pair_counts[("air","canada")] += 1This is how the system learns what “sticks together”.
🔄 Are tokens “spent”?
Not really. They’re:
- created → from text
- transformed → filtered, grouped
- reused → across matching, scoring, learning
Think of them as atoms in a reaction, not currency.
📦 How tokens are sorted / organized
Three useful structures:
1. Set (fast lookup)
token_set = set(tokens)
if "uber" in token_set:2. List (ordered)
["air", "canada", "ottawa"]Used for phrase detection (adjacency matters)
3. Frequency maps
{
"air": 120,
"canada": 95
}and pairs:
{
("air","canada"): 80
}✈️ The AIR CANADA moment (lexical plane in action)
You start with:
["air", "canada"]The system tracks:
-
how often
"air"appears alone -
how often
"canada"appears alone -
how often
"air"+"canada"appear together and adjacent
When this becomes dominant:
pair_counts[("air","canada")] / token_counts["air"] → highYou promote:
"air canada" → single lexical unitNow your tokens become:
["air_canada", "ottawa", "vancouver"](underscore or space, doesn’t matter internally)
🧭 Important distinction
Lexical plane does not care what it means.
It only answers:
“What chunks of text behave like stable units?”
Meaning comes later.
⚙️ Minimal working version
def tokenize(text):
text = text.lower()
tokens = re.findall(r"[a-z0-9]+", text)
return tokensPhrase detection (simple):
def detect_phrases(tokens, known_phrases):
joined = " ".join(tokens)
for phrase in known_phrases:
if phrase in joined:
return phrase
return None🧩 Clean mental model
Characters → tokens → phrases → entitiesYou are currently here:
tokens → phrases
⚡ Compression
Lexical plane =
break text into tokens
track how tokens appear
recombine tokens that consistently travel togetherIf you want the next edge:
- we can define promotion rules (when exactly “air canada” graduates from coincidence → entity)
That’s where lexical stops being mechanical and starts becoming strategic.