Excel conventions in the wild¶
A reference for the patterns customers actually send. Each section names a convention, shows a minimal example, and notes how crease's declarative YAML handles it (or where we explicitly defer).
The examples use plain HTML <table> blocks so the file renders on GitHub,
in MkDocs, and in any other markdown viewer. No build step required.
1. Title block above the header row¶
A few "decorative" rows precede the actual header row.
| A | B | C | |
|---|---|---|---|
| 1 | Acme Corporation — Order Export | ||
| 2 | Generated: 2025-04-15 | ||
| 3 | |||
| 4 | order_id | order_date | customer_email |
| 5 | ORD-1001 | 2025-01-15 | a@acme.com |
How crease handles it.
locate.header_row: 3(zero-indexed) andlocate.data_starts_row: 4. The LLM is expected to recognize that rows 0–2 are non-tabular and pick the correct header row from the samples.
Fixture: flat_with_title_rows
2. Footer / totals row below the data¶
A summary row at the bottom that looks like data but shouldn't be extracted.
| A | B | C | |
|---|---|---|---|
| 1 | order_id | quantity | total |
| 2 | ORD-1001 | 10 | 250.00 |
| 3 | ORD-1002 | 5 | 500.00 |
| 4 | ORD-1003 | 3 | 75.00 |
| 5 | TOTAL | 825.00 |
How crease handles it.
locate.data_ends_at: {type: value_match, column: 0, value: "TOTAL"}— stop reading at the first row whose column 0 contains "TOTAL".
Fixture: flat_with_totals_row
3. Multi-level (merged) headers¶
Excel merge cells produce visually-grouped headers across two rows.
| A | B | C | D | |
|---|---|---|---|---|
| 1 | SKU | Q1 2025 | ||
| 2 | Jan | Feb | Mar | |
| 3 | SKU-001 | 120 | 140 | 155 |
| 4 | SKU-002 | 30 | 45 | 50 |
How crease handles it. Deferred to v2. This requires either a crosstab-aware extractor or a pre-flatten step. v1 emits
column_count_mismatchand routes to a structural-error reject queue.
4. Scattered metadata on a cover sheet¶
Properties sprinkled across a tab with gaps, notes, and arbitrary ordering.
| A | B | |
|---|---|---|
| 1 | Acme Quarterly Sales Report | |
| 2 | ||
| 3 | ||
| 4 | Reporting Period: | Q1 2025 |
| 5 | ||
| 6 | Submitted by: | Jane Smith |
| 7 | Contact: | jane@acme.com |
| 8 | ||
| 9 | Notes: | |
| 10 | First-time submission — please confirm | |
| 11 | ||
| 12 | Date sent: | 2025-04-15 |
How crease handles it.
orientation: anchoredwith per-fieldanchor.label_match. Each field finds its label independently, walks one cell right, reads the value.
Fixture: anchored_scattered
5. Multiple tables on one tab¶
Two unrelated tables stacked vertically, separated by blank rows.
| A | B | C | |
|---|---|---|---|
| 1 | Customer Info | ||
| 2 | Name | Acme | |
| 3 | Tier | Enterprise | |
| 4 | |||
| 5 | Orders | ||
| 6 | order_id | customer | total |
| 7 | ORD-1001 | Acme | 500 |
How crease handles it. Deferred to v1.5. v1 supports one entity per tab. To handle this reliably we'll use
locate.cell_rangeso each entity can be bound to a rectangular sub-region of the tab.
6. Inline subtotals interspersed with data¶
A summary row appears after each grouping, mixed with regular data.
| A | B | C | |
|---|---|---|---|
| 1 | region | customer | total |
| 2 | North | Globex | 500 |
| 3 | North | Initech | 750 |
| 4 | North subtotal | 1250 | |
| 5 | South | Hooli | 900 |
| 6 | South subtotal | 900 | |
How crease handles it. Deferred to v1.5. Would require a row-skip predicate (e.g.
skip_row_if: cell_at(column: 0) matches "/subtotal/i"). For v1 we expect customers to remove subtotals, or for the validator to flag them aswrong_typesince the subtotal cells often don't match the declared field types.
7. Date-format chaos¶
The same column may carry several date encodings.
| A (date) | |
|---|---|
| 1 | 2025-01-15 |
| 2 | 01/15/2025 |
| 3 | 15-Jan-2025 |
| 4 | 45672 |
| 5 | Jan 15, 2025 |
Row 4 is an Excel date serial number (days since 1900-01-01). Row 1 is the "correct" ISO form. The other three are common operator-typed variations.
How crease handles it. The extractor coerces the dominant format; outliers emit
wrong_typeper row. For columns with strongly mixed encodings, prefertype: stringand parse downstream — or addpattern:to enforce one shape and route bad rows to human review.
8. Numbers stored as text (leading apostrophe)¶
Excel's classic gotcha — a leading ' forces the cell to text, making
'12345 display as 12345 but be a string under the hood.
| A (sku) | B (quantity) | |
|---|---|---|
| 1 | '00123 | 10 |
| 2 | '00124 | 20 |
How crease handles it. openpyxl reads these as strings. If the field is
type: string, no problem. Iftype: integer, coercion may succeed (it strips the leading apostrophe) or emitwrong_typeif there are non-numeric chars.
9. Null sentinels¶
Operators encode "missing" with many tokens. None of these are the same as a truly blank cell.
| A (email) | |
|---|---|
| 1 | a@example.com |
| 2 | N/A |
| 3 | TBD |
| 4 | - |
| 5 | — |
| 6 | (blank) |
| 7 |
How crease handles it. Library-default
null_tokenscollapsesN/A,TBD,-,—,(blank),NaN, and a handful of other common sentinels tonullduring extraction. Override at template or field level —null_tokens: []disables,null_tokens: ["UNKNOWN"]extends.
10. Negative numbers as parentheses¶
A finance convention: (1,234.56) means -1234.56.
| A (amount) | |
|---|---|
| 1 | 1,234.56 |
| 2 | (842.10) |
| 3 | 500.00 |
How crease handles it. Excel's number formatting hides the sign, but the underlying value is numeric. openpyxl reads the actual stored value, not the display. Only an issue when the cell is text-stored as
"(842.10)"— thenwrong_typefires.
11. Currency-formatted numbers¶
$1,234.56 may be a number with currency formatting (fine) or a string (not fine).
| A (price) | |
|---|---|
| 1 | $1,234.56 |
| 2 | €90.00 |
| 3 | £75.50 |
How crease handles it. Same as #10 — depends on whether Excel stored a number or a string. Customers who mix currencies in one column are flagging a real semantic problem; consider a
currencyfield alongsideamount.
12. Boolean variations¶
Same concept, multiple encodings.
| A (active) | B (active) | C (active) | |
|---|---|---|---|
| 1 | Yes | True | Y |
| 2 | No | False | N |
| 3 | yes | true | 1 |
| 4 | NO | FALSE | 0 |
How crease handles it. Library-default
true_values/false_valuescovers the common cases (Yes/No,True/False,Y/N,1/0, with casing variants). Override per field for unusual encodings.
13. Headers with whitespace / mixed case / smart-quote drift¶
Same column, three operators, three different headers.
| operator A | operator B | operator C | |
|---|---|---|---|
| header text | Order ID | order_id | Order ID |
Operator A: title case. Operator B: snake_case. Operator C: title case with a trailing space and a non-breaking space between words.
How crease handles it. The extractor normalises headers (NBSP→space, collapse internal whitespace runs, trim, lower-case) before matching
source_column. The template'ssource_column: "Order ID"matches all three — and also matches headers that wrap mid-label ("Order \nID", common when an operator widens a column after Excel auto-wrapped the text) or contain a stray double-space typo ("Order ID").
14. Cells with embedded line breaks¶
A single cell carrying multiple values separated by \n.
| A (address) | |
|---|---|
| 1 | 123 Main St Suite 400 Boston, MA 02101 |
How crease handles it. Stays as a single string with
\nseparators. If the consumer wants it split, that's a downstream concern.
15. Hidden columns / rows¶
Excel allows hiding rows/columns; visually invisible but still in the file.
How crease handles it. By default, crease reads hidden cells as if they were visible. If a customer intentionally hides "draft" rows, those will appear in extraction — set
locate.skip_hidden_rows: trueto exclude them.
16. "Notes" / "Instructions" tabs¶
Operator-facing tabs that aren't data.
How crease handles it. List them in
ignore_tabs:. The LLM is expected to spot common names (Notes,Instructions,Cover Page, emptySheet1) during inference and add them automatically.
Fixture: multi_tab_acme includes a Notes tab in ignore_tabs.
17. Cover sheet that lists the rest of the workbook¶
A meta-tab that says "see Region: North, Region: South for data."
| A | B | |
|---|---|---|
| 1 | Tabs in this report | |
| 2 | Region - North | 30 rows |
| 3 | Region - South | 22 rows |
| 4 | Region - West | 15 rows |
How crease handles it. Deferred to v2. This is a meta-template that crease could in principle parse first to learn which tabs to look at. For v1, the operator declares the pattern in
tab_patterndirectly.
18. Repeated header rows¶
For "print-friendly" reports, headers repeat every N rows.
| A | B | |
|---|---|---|
| 1 | order_id | total |
| 2 | ORD-1001 | 500 |
| 3 | ORD-1002 | 750 |
| 4 | order_id | total |
| 5 | ORD-1003 | 900 |
How crease handles it. Deferred to v1.5. Adds
skip_row_if: matches_headerto drop rows whose values equal the header text.
19. Cells with formulas instead of values¶
=SUM(B2:B10) instead of 825.
How crease handles it.
openpyxl.load_workbook(path, data_only=True)reads the cached value Excel computed, not the formula text. Crease usesdata_only=Truethroughout. If a customer sends a file that was never opened in Excel (cache is stale), the cell may come back asNoneor the formula string — emitextraction_failedwith a message suggesting they open + save the file in Excel.Backend divergence.
python-calamine(the default backend) does not expose a cached-value mode and returns an empty string for formula cells that haven't been pre-evaluated; crease normalizes those toNone. If a file relies on Excel-computed formula values, passengine="openpyxl"so crease can read the cached results, or open + save the file in Excel before extracting.
19a. Excel error cells (#REF!, #N/A, #DIV/0!)¶
A formula that didn't resolve produces a visible error code in the cell.
How crease handles it. Both backends surface error cells as
Nonerather than the literal error text.python-calaminedoes not expose the cell error type at all in its Python API; openpyxl withdata_only=Truealso returnsNonefor these. The downstream coercion layer therefore reports the field as missing rather than mistyped. If you need to distinguish "formula broke" from "operator left blank", treat any unexpectedNonein a numeric column as a signal to inspect the source workbook.
20. Tabs named with version suffixes¶
Orders_FINAL_v3_revised, Orders_FINAL_v3_revised_REAL_FINAL, etc.
How crease handles it. Tab patterns can be loose:
tab_pattern: ^Ordersmatches any of those. Better: a v1.5 fuzzy-match mode that the LLM picks based on examples.
What this catalog is for¶
- Operators — recognize whether their files fit one of these patterns before submitting; if not, expect re-onboarding.
- Engineers — understand the scope of weirdness the system commits to.
- The LLM (via the inference prompt) — examples in this doc inform the
system prompt for
infer_template. - The roadmap — each "Deferred" tag is a tracked item; we ship them as customers actually hit them, not speculatively.