Business Glossary
This plugin pulls business glossary metadata from a yaml-formatted file. An example of one such file is located in the examples directory here.
CLI based Ingestion
Starter Recipe
Check out the following recipe to get started with ingestion! See below for full configuration options.
For general pointers on writing and running a recipe, see our main recipe guide.
source:
type: datahub-business-glossary
config:
# Coordinates
file: /path/to/business_glossary_yaml
enable_auto_id: true # recommended to set to true so datahub will auto-generate guids from your term names
# sink configs if needed
Config Details
- Options
- Schema
Note that a .
is used to denote nested fields in the YAML recipe.
Field | Description |
---|---|
file ✅ One of string, string(path) | File path or URL to business glossary file to ingest. |
enable_auto_id boolean | Generate guid urns instead of a plaintext path urn with the node/term's hierarchy. Default: False |
The JSONSchema for this configuration is inlined below.
{
"title": "BusinessGlossarySourceConfig",
"type": "object",
"properties": {
"file": {
"title": "File",
"description": "File path or URL to business glossary file to ingest.",
"anyOf": [
{
"type": "string"
},
{
"type": "string",
"format": "path"
}
]
},
"enable_auto_id": {
"title": "Enable Auto Id",
"description": "Generate guid urns instead of a plaintext path urn with the node/term's hierarchy.",
"default": false,
"type": "boolean"
}
},
"required": [
"file"
],
"additionalProperties": false
}
Business Glossary File Format
The business glossary source file should be a .yml file with the following top-level keys:
Glossary: the top level keys of the business glossary file
Example Glossary:
version: "1" # the version of business glossary file config the config conforms to. Currently the only version released is `1`.
source: DataHub # the source format of the terms. Currently only supports `DataHub`
owners: # owners contains two nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
url: "https://github.com/datahub-project/datahub/" # (optional) external url pointing to where the glossary is defined externally, if applicable
nodes: # list of child **GlossaryNode** objects. See **GlossaryNode** section below
...
GlossaryNode: a container of GlossaryNode and GlossaryTerm objects
Example GlossaryNode:
- name: "Shipping" # name of the node
id: "Shipping-Logistics" # (optional) custom identifier for the node
description: Provides terms related to the shipping domain # description of the node
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
nodes: # list of child **GlossaryNode** objects
...
knowledge_links: # (optional) list of **KnowledgeCard** objects
- label: Wiki link for shipping
url: "https://en.wikipedia.org/wiki/Freight_transport"
GlossaryTerm: a term in your business glossary
Example GlossaryTerm:
- name: "Full Address" # name of the term
id: "Full-Address-Details" # (optional) custom identifier for the term
description: A collection of information to give the location of a building or plot of land. # description of the term
owners: # (optional) owners contains 2 nested fields
users: # (optional) a list of user IDs
- njones
groups: # (optional) a list of group IDs
- logistics
term_source: "EXTERNAL" # one of `EXTERNAL` or `INTERNAL`. Whether the term is coming from an external glossary or one defined in your organization.
source_ref: FIBO # (optional) if external, what is the name of the source the glossary term is coming from?
source_url: "https://www.google.com" # (optional) if external, what is the url of the source definition?
inherits: # (optional) list of **GlossaryTerm** that this term inherits from
- Privacy.PII
contains: # (optional) a list of **GlossaryTerm** that this term contains
- Shipping.ZipCode
- Shipping.CountryCode
- Shipping.StreetAddress
custom_properties: # (optional) a map of key/value pairs of arbitrary custom properties
- is_used_for_compliance_tracking: "true"
knowledge_links: # (optional) a list of **KnowledgeCard** related to this term. These appear as links on the glossary node's page
- url: "https://en.wikipedia.org/wiki/Address"
label: Wiki link
domain: "urn:li:domain:Logistics" # (optional) domain name or domain urn
ID Management and URL Generation
The business glossary provides two primary ways to manage term and node identifiers:
Custom IDs: You can explicitly specify an ID for any term or node using the
id
field. This is recommended for terms that need stable, predictable identifiers:terms:
- name: "Response Time"
id: "support-response-time" # Explicit ID
description: "Target time to respond to customer inquiries"Automatic ID Generation: When no ID is specified, the system will generate one based on the
enable_auto_id
setting:With
enable_auto_id: false
(default):- Node and term names are converted to URL-friendly format
- Spaces within names are replaced with hyphens
- Special characters are removed (except hyphens)
- Case is preserved
- Multiple hyphens are collapsed to single ones
- Path components (node/term hierarchy) are joined with periods
- Example: Node "Customer Support" with term "Response Time" → "Customer-Support.Response-Time"
With
enable_auto_id: true
:- Generates GUID-based IDs
- Recommended for guaranteed uniqueness
- Required for terms with non-ASCII characters
Here's how path-based ID generation works:
nodes:
- name: "Customer Support" # Node ID: Customer-Support
terms:
- name: "Response Time" # Term ID: Customer-Support.Response-Time
description: "Response SLA"
- name: "First Reply" # Term ID: Customer-Support.First-Reply
description: "Initial response"
- name: "Product Feedback" # Node ID: Product-Feedback
terms:
- name: "Response Time" # Term ID: Product-Feedback.Response-Time
description: "Feedback response"
Important Notes:
- Periods (.) are used exclusively as path separators between nodes and terms
- Periods in term or node names themselves will be removed
- Each component of the path (node names, term names) is cleaned independently:
- Spaces to hyphens
- Special characters removed
- Case preserved
- The cleaned components are then joined with periods to form the full path
- Non-ASCII characters in any component trigger automatic GUID generation
- Once an ID is created (either manually or automatically), it cannot be easily changed
- All references to a term (in
inherits
,contains
, etc.) must use its correct ID - Moving terms in the hierarchy does NOT update their IDs:
- The ID retains its original path components even after moving
- This can lead to IDs that don't match the current location
- Consider using
enable_auto_id: true
if you plan to reorganize your glossary
- For terms that other terms will reference, consider using explicit IDs or enable auto_id
Example of how different names are handled:
nodes:
- name: "Data Services" # Node ID: Data-Services
terms:
# Basic term name
- name: "Response Time" # Term ID: Data-Services.Response-Time
description: "SLA metrics"
# Term name with special characters
- name: "API @ Response" # Term ID: Data-Services.API-Response
description: "API metrics"
# Term with non-ASCII (triggers GUID)
- name: "パフォーマンス" # Term ID will be a 32-character GUID
description: "Performance"
To see how these all work together, check out this comprehensive example business glossary file below:
version: "1"
source: DataHub
owners:
users:
- mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
- name: "Data Classification"
id: "Data-Classification" # Custom ID for stable references
description: A set of terms related to Data Classification
knowledge_links:
- label: Wiki link for classification
url: "https://en.wikipedia.org/wiki/Classification"
terms:
- name: "Sensitive Data" # Will generate: Data-Classification.Sensitive-Data
description: Sensitive Data
custom_properties:
is_confidential: "false"
- name: "Confidential Information" # Will generate: Data-Classification.Confidential-Information
description: Confidential Data
custom_properties:
is_confidential: "true"
- name: "Highly Confidential" # Will generate: Data-Classification.Highly-Confidential
description: Highly Confidential Data
custom_properties:
is_confidential: "true"
domain: Marketing
- name: "Personal Information"
description: All terms related to personal information
owners:
users:
- mjames
terms:
- name: "Email" # Will generate: Personal-Information.Email
description: An individual's email address
inherits:
- Data-Classification.Confidential # References parent node path
owners:
groups:
- Trust and Safety
- name: "Address" # Will generate: Personal-Information.Address
description: A physical address
- name: "Gender" # Will generate: Personal-Information.Gender
description: The gender identity of the individual
inherits:
- Data-Classification.Sensitive # References parent node path
- name: "Clients And Accounts"
description: Provides basic concepts such as account, account holder, account provider, relationship manager that are commonly used by financial services providers to describe customers and to determine counterparty identities
owners:
groups:
- finance
type: DATAOWNER
terms:
- name: "Account" # Will generate: Clients-And-Accounts.Account
description: Container for records associated with a business arrangement for regular transactions and services
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Account"
inherits:
- Data-Classification.Highly-Confidential # References parent node path
contains:
- Clients-And-Accounts.Balance # References term in same node
- name: "Balance" # Will generate: Clients-And-Accounts.Balance
description: Amount of money available or owed
term_source: "EXTERNAL"
source_ref: FIBO
source_url: "https://spec.edmcouncil.org/fibo/ontology/FBC/ProductsAndServices/ClientsAndAccounts/Balance"
- name: "KPIs"
description: Common Business KPIs
terms:
- name: "CSAT %" # Will generate: KPIs.CSAT
description: Customer Satisfaction Score
Custom ID Specification
Custom IDs can be specified in two ways, both of which are fully supported and acceptable:
- Just the ID portion (simpler approach):
terms:
- name: "Email"
id: "company-email" # Will become urn:li:glossaryTerm:company-email
description: "Company email address"
- Full URN format:
terms:
- name: "Email"
id: "urn:li:glossaryTerm:company-email"
description: "Company email address"
Both methods are valid and will work correctly. The system will automatically handle the URN prefix if you specify just the ID portion.
The same applies for nodes:
nodes:
- name: "Communications"
id: "internal-comms" # Will become urn:li:glossaryNode:internal-comms
description: "Internal communication methods"
Note: Once you select a custom ID, it cannot be easily changed.
Compatibility
Compatible with version 1 of business glossary format. The source will be evolved as newer versions of this format are published.
Code Coordinates
- Class Name:
datahub.ingestion.source.metadata.business_glossary.BusinessGlossaryFileSource
- Browse on GitHub
Questions
If you've got any questions on configuring ingestion for Business Glossary, feel free to ping us on our Slack.