Skip to content

Flow Mapping Audit

When a database (e.g. BAFU in EcoSpold1) is characterized by a method (e.g. EF3.1 adapted, exported from SimaPro), the engine has to link each characterization factor to the matching elementary flow. This linking relies on a cascade — UUID → CAS → normalized name → hand-curated synonym pair — and is never perfect. This guide describes how to detect, diagnose, and close those gaps.

For the cascade itself, see Flow Mapping.

Method.Mapping.lookupCFForFlow runs the cascade per flow. A flow that matches by UUID needs no further work; otherwise the engine tries CAS, then a normalized-name lookup, then the synonym pair table from data/flows.csv. When none of those produce a hit, the flow goes uncharacterized — it does not contribute to the score.

Two failure modes to keep separate in your head:

  1. Genuine method gap. The CF method has no characterization for the substance/compartment in question. The flow being uncharacterized is the correct outcome.
  2. Mapping bug. A homologous CF exists in the method, but the names don’t match closely enough for the cascade to bridge them. The flow should contribute. This is what we want to catch and fix.

The post-scoring suggester distinguishes the two by attaching a list of similar CFs (similar_cfs) to each uncharacterized flow. An empty list = genuine gap. A non-empty list = candidate mapping bug, ranked by similarity.

Method.Mapping.findSimilarCFs stacks three signals; each candidate carries the reason that won, so you know what to verify:

  • jaccard — token overlap on the normalized names. Catches word-order and punctuation variants (e.g. "Methane, biogenic" ↔ "Methane biogenic").
  • synonym_expansion — token overlap after expanding both sides via the vendored PubChem snapshot. This is what bridges "CO2" ↔ "Carbon dioxide" — pure tokenization can never see they relate.
  • cas_bridge — when the flow’s CAS matches a CF’s CAS, the candidate is surfaced at score 0.95 regardless of name overlap. Highest-confidence reason; catches cases where one side has CAS and the other doesn’t.

Score range is [0, 1]; the headline value is the max of the three signals.

data/chem_synonyms.csv is a vendored snapshot of PubChem synonyms, filtered down to the CAS numbers actually present in the loaded databases. It feeds the synonym_expansion signal and is the most effective lever to widen the suggester’s recall without manual work.

Regenerate when: a new database is added, or a new CAS appears in an updated database release.

How:

Terminal window
scripts/build_chem_synonyms.py \
--extract-cas-from /path/to/bafu /path/to/agribalyse \
--out data/chem_synonyms.csv

The script politely paces itself against PubChem’s REST API (~5 requests/second). For a few thousand CAS that’s a few minutes. The output is deterministic given the same input set, so commit it as a single atomic change: data: regenerate chem_synonyms snapshot for <reason>. Scoring stays reproducible from the vendored CSV — no runtime network call.

The structural changes (suggester, diagnostics, comparison tool) get committed once. The recurring work is enriching data/flows.csv with hand-curated synonym pairs PubChem doesn’t capture (LCA-specific phrasings, package-naming quirks). Run this loop per database/method pair you care about:

  1. Baseline the gap. Pick a representative activity that exists in both the trusted database (e.g. SimaPro / Agribalyse) and the database under test (e.g. BAFU):

    compare_impacts
    database_a=BAFU process_id_a=<bafu-pid> method_id_a=<EF3.1>
    database_b=Agribalyse process_id_b=<simapro-pid> method_id_b=<EF3.1>

    Record delta.relative_pct. That’s the metric you’re driving down.

  2. Audit the unmatched flows.

    get_flow_mapping
    database=BAFU
    method_id=<EF3.1>
    verbose=true
    process_id=<bafu-pid>

    The unmatched_db_flows list is ranked by inventory contribution, so the top entries are where your effort matters most.

  3. Confirm semantically. For each top entry, look at the candidates list. Open the BAFU flow and the suggested CF in their source files (XML, CSV) and verify they’re the same substance — same CAS, same compartment if specified, same chemistry. Don’t trust the score blindly: a high jaccard score on short names can mislead.

  4. Add a synonym pair. Append one row to data/flows.csv:

    "BAFU exact name","EF3.1 exact name"

    Commit each pair (or a small thematic batch — e.g. all radon isotopes) as its own commit:

    flows: link "<bafu name>" ↔ "<ef name>"

    Atomic commits keep git bisect and revert trivial when a pair turns out to be wrong.

  5. Verify the delta moved. Re-run step 1. The delta.relative_pct should shrink. If it didn’t, the synonym was wrong — revert.

  6. Stop when the residual delta.relative_pct is within an acceptable band (e.g. < 5% per impact category) or when loUncharacterized (via get_impacts include_diagnostics=true) contains only entries with similar_cfs == [] — those are genuine method gaps, not mapping bugs.

When to add a flows.csv pair vs regenerate chem_synonyms

Section titled “When to add a flows.csv pair vs regenerate chem_synonyms”
  • Use data/flows.csv for LCA-specific synonyms PubChem doesn’t have — packaging-naming variants, region-specific abbreviations, dataset authors’ personal phrasings.
  • Use scripts/build_chem_synonyms.py for chemical synonyms — formulas, IUPAC variants, common trade names. These belong in PubChem and the snapshot picks them up automatically.

The two lists are not redundant. flows.csv is small and hand-curated; chem_synonyms.csv is large and machine-generated. A pair that “feels chemical” probably belongs in PubChem; a pair that “feels process-specific” belongs in flows.csv.

Add to your TOML config (path resolved relative to the config file):

chem-synonyms = "data/chem_synonyms.csv"

Without that line, the snapshot is treated as empty and the suggester degrades to plain Jaccard — still useful, just blind to formula↔name pairs like CO2↔Carbon dioxide.

Today this procedure targets EF3.1 adapted — the SimaPro CSV export. The corresponding data/chem_synonyms.csv and data/flows.csv entries are scoped to that pair. The EF3.1 original (ILCD XMLs) is a distinct path with its own UUID-from-XML provenance and is not yet covered by this audit workflow.