Flow Mapping Audit
When a database (e.g. BAFU in EcoSpold1) is characterized by a method (e.g. EF3.1 adapted, exported from SimaPro), the engine has to link each characterization factor to the matching elementary flow. This linking relies on a cascade — UUID → CAS → normalized name → hand-curated synonym pair — and is never perfect. This guide describes how to detect, diagnose, and close those gaps.
For the cascade itself, see Flow Mapping.
How linking works
Section titled “How linking works”Method.Mapping.lookupCFForFlow runs the cascade per flow. A flow that
matches by UUID needs no further work; otherwise the engine tries CAS,
then a normalized-name lookup, then the synonym pair table from
data/flows.csv. When none of those produce a hit, the flow goes
uncharacterized — it does not contribute to the score.
Two failure modes to keep separate in your head:
- Genuine method gap. The CF method has no characterization for the substance/compartment in question. The flow being uncharacterized is the correct outcome.
- Mapping bug. A homologous CF exists in the method, but the names don’t match closely enough for the cascade to bridge them. The flow should contribute. This is what we want to catch and fix.
The post-scoring suggester distinguishes the two by attaching a list of
similar CFs (similar_cfs) to each uncharacterized flow. An empty
list = genuine gap. A non-empty list = candidate mapping bug, ranked by
similarity.
The three similarity signals
Section titled “The three similarity signals”Method.Mapping.findSimilarCFs stacks three signals; each candidate
carries the reason that won, so you know what to verify:
jaccard— token overlap on the normalized names. Catches word-order and punctuation variants (e.g."Methane, biogenic" ↔ "Methane biogenic").synonym_expansion— token overlap after expanding both sides via the vendored PubChem snapshot. This is what bridges"CO2" ↔ "Carbon dioxide"— pure tokenization can never see they relate.cas_bridge— when the flow’s CAS matches a CF’s CAS, the candidate is surfaced at score 0.95 regardless of name overlap. Highest-confidence reason; catches cases where one side has CAS and the other doesn’t.
Score range is [0, 1]; the headline value is the max of the three
signals.
The chemical synonyms snapshot
Section titled “The chemical synonyms snapshot”data/chem_synonyms.csv is a vendored snapshot of PubChem synonyms,
filtered down to the CAS numbers actually present in the loaded
databases. It feeds the synonym_expansion signal and is the most
effective lever to widen the suggester’s recall without manual work.
Regenerate when: a new database is added, or a new CAS appears in an updated database release.
How:
scripts/build_chem_synonyms.py \ --extract-cas-from /path/to/bafu /path/to/agribalyse \ --out data/chem_synonyms.csvThe script politely paces itself against PubChem’s REST API
(~5 requests/second). For a few thousand CAS that’s a few minutes. The
output is deterministic given the same input set, so commit it as a
single atomic change: data: regenerate chem_synonyms snapshot for <reason>. Scoring stays reproducible from the vendored CSV — no
runtime network call.
The audit loop
Section titled “The audit loop”The structural changes (suggester, diagnostics, comparison tool) get
committed once. The recurring work is enriching data/flows.csv with
hand-curated synonym pairs PubChem doesn’t capture (LCA-specific
phrasings, package-naming quirks). Run this loop per database/method pair
you care about:
-
Baseline the gap. Pick a representative activity that exists in both the trusted database (e.g. SimaPro / Agribalyse) and the database under test (e.g. BAFU):
compare_impactsdatabase_a=BAFU process_id_a=<bafu-pid> method_id_a=<EF3.1>database_b=Agribalyse process_id_b=<simapro-pid> method_id_b=<EF3.1>Record
delta.relative_pct. That’s the metric you’re driving down. -
Audit the unmatched flows.
get_flow_mappingdatabase=BAFUmethod_id=<EF3.1>verbose=trueprocess_id=<bafu-pid>The
unmatched_db_flowslist is ranked by inventory contribution, so the top entries are where your effort matters most. -
Confirm semantically. For each top entry, look at the
candidateslist. Open the BAFU flow and the suggested CF in their source files (XML, CSV) and verify they’re the same substance — same CAS, same compartment if specified, same chemistry. Don’t trust the score blindly: a highjaccardscore on short names can mislead. -
Add a synonym pair. Append one row to
data/flows.csv:"BAFU exact name","EF3.1 exact name"Commit each pair (or a small thematic batch — e.g. all radon isotopes) as its own commit:
flows: link "<bafu name>" ↔ "<ef name>"Atomic commits keep
git bisectand revert trivial when a pair turns out to be wrong. -
Verify the delta moved. Re-run step 1. The
delta.relative_pctshould shrink. If it didn’t, the synonym was wrong — revert. -
Stop when the residual
delta.relative_pctis within an acceptable band (e.g. < 5% per impact category) or whenloUncharacterized(viaget_impacts include_diagnostics=true) contains only entries withsimilar_cfs == []— those are genuine method gaps, not mapping bugs.
When to add a flows.csv pair vs regenerate chem_synonyms
Section titled “When to add a flows.csv pair vs regenerate chem_synonyms”- Use
data/flows.csvfor LCA-specific synonyms PubChem doesn’t have — packaging-naming variants, region-specific abbreviations, dataset authors’ personal phrasings. - Use
scripts/build_chem_synonyms.pyfor chemical synonyms — formulas, IUPAC variants, common trade names. These belong in PubChem and the snapshot picks them up automatically.
The two lists are not redundant. flows.csv is small and hand-curated;
chem_synonyms.csv is large and machine-generated. A pair that
“feels chemical” probably belongs in PubChem; a pair that “feels
process-specific” belongs in flows.csv.
Configuration
Section titled “Configuration”Add to your TOML config (path resolved relative to the config file):
chem-synonyms = "data/chem_synonyms.csv"Without that line, the snapshot is treated as empty and the suggester degrades to plain Jaccard — still useful, just blind to formula↔name pairs like CO2↔Carbon dioxide.
Scope note
Section titled “Scope note”Today this procedure targets EF3.1 adapted — the SimaPro CSV export.
The corresponding data/chem_synonyms.csv and data/flows.csv entries
are scoped to that pair. The EF3.1 original (ILCD XMLs) is a
distinct path with its own UUID-from-XML provenance and is not yet
covered by this audit workflow.