Machine Readable Glossary Generation Tool

Editor's note

Documentation needs to be adjusted for:

Converting formPhrases: MRGT will write expanded formPhrase macros into MRGEntry formPhrases field

The Machine Readable Glossary generation Tool (MRGT) generates Machine Readable Glossaries (MRGs) for one specific, or all terminology versions that are curated within a specific scope. MRGs come in a specific, well-defined format. They contain some meta-data, followed by a list of so-called MRG entries, one for every term in its scope, which represent concepts and other semantic units that are known within that scope.

The (newly generated) MRG(s) are meant to be processed by the other tools in the toolbox, regardless of whether such tools are called from within the context of another scope. As they contain every term that is used in the scope, and include all the relevant meta-data, an MRG serves as the single, authoritative source of that (version of the) scope's terminology.

Installing the Tool

The tool can be installed from the command line and made globally available by executing

npm install -g @tno-terminology-design/mrgt

Before running the tool from the command line, make sure you have met the necessary prerequisites depending on your operating environment.

CMD.exe (Windows)
PowerShell(Windows)
Bash (Linux/Mac)

Node.js and NPM: Ensure Node.js and NPM are installed.
Global Installation: If you have installed the package globally, confirm the global NPM modules path by running npm config get prefix. The global modules are usually stored under <prefix>/node_modules.
Environment Variables: Add the path to global NPM binaries to your system's PATH environment variable. This should be <prefix> on Windows. To add to PATH, you can edit your environment variables or run set PATH=%PATH%;<prefix> in the CMD.

Node.js and NPM: Ensure Node.js and NPM are installed.
Global Installation: Check the global NPM modules path as in CMD.
Environment Variables: Update the PATH environment variable as in CMD. You can also use $env:Path += ";<prefix>" to update the PATH temporarily in the current PowerShell session.

Node.js and NPM: Ensure Node.js and NPM are installed.
Global Installation: If globally installed, run npm config get prefix to get the global modules path, usually <prefix>/lib/node_modules.
Environment Variables: Add the <prefix>/bin directory to your PATH if it's not already. You can do this by adding export PATH=$PATH:<prefix>/bin to your ~/.bashrc or ~/.zshrc file.

Calling the Tool

The behavior of the MRGT can be configured per call e.g. by a configuration file and/or command-line parameters. The command-line syntax is as follows:

mrgt [ <paramlist> ]

where <paramlist> is an (optional) list of parameters.

Legend

The columns in the following table are defined as follows:

Parameter specifies the parameter and further specifications
Req'd specifies whether (Y) or not (n) the field is required to be present when the tool is being called. If required, it MUST either be present in the configuration file, or as a command-line parameter.
Description specifies the meaning of the Value field, and other things you may need to know, e.g. why it is needed, a required syntax, etc.

If a configuration file used, the long version of the parameter must be used (without the preceding --).

Key	Req'd	Description
`-c`, `--config <path>`	n	Path (including the filename) of the tool's (YAML) configuration file.
`-h`, `--help`	n	display help for command.
`-o`, `--onNotExist <action>`	n	The action in case a `vsntag` was specified, but wasn't found in the SAF.
`-s`, `--scopedir <path>`	n	Path of the scope directory from which the tool is called.
`-v`, `--vsntag <vsntag>`	n	Versiontag for which the MRG needs to be (re)generated.
`-V`, `--version`	n	output the version number of the tool.

The <action> parameter can take the following values:

`<action>`	Description
`'throw'`	an error is thrown (an exception is raised), and processing will stop.
`'warn'`	a message is displayed (and logged) and processing continues.
`'log'`	a message is written to a log(file) and processing continues.
`'ignore'`	processing continues as if nothing happened.

Running the Tool

One run of the MRGT either

generates an MRG for one specific terminology version within the current scope (which is the case when the version parameter was specified), or it
generates multiple MRGs, i.e., one for every version of the terminology that is curated within the current scope (which is the case when the version parameter is omitted).

Running the tool comprises the following phases:¹

Constructing a provisional MRG;
Post-processing the entries in that provisional MRG;
Creating/overwriting MRG file(s) in the glossarydir of the current scope.

Phase 1: constructing a provisional MRG

Generating an MRG for a particular version of a terminology starts by reading the SAF of the scope within which that terminology is curated, which exists in the scopedir that was provided as one of the calling parameters. If a vsntag argument is provided, it will search the versions section of the SAF to find the corresponding entry. This corresponding entry will have the value of the vsntag parameter either in its vsntag field, or it is one of the elements in the altvsntags field. If the SAF does not have a corresponding entry, the action specified in the onNotExist parameter will determine whether or not (and how) to proceed.

In this phase, for every terminology version that is to be created, one provisional MRG is created, that contains a provisional MRG entry for every term contained in the particular version of the terminology. This provisional MRG entry either contains:

all fields in the header of the curated text that documents its term, or
all fields in the MRG entry that comes from another MRG (typically, but not necessarily, from another scope).

The Term Selection Instruction syntax specifies precisely how provisional MRGs are created.

After a provisional MRG entry is created, the following modifications are made:

the formPhrases field is processed, which means that every element in that set (array) is subjected to the following processing steps:
1. if the element contains contains a form phrase macro, it is replaced by a set of form phrases that is constructed by processing that form phrase macro - see Form Phrase Macro Expansion for the details and examples. This step effectively enlarges the set (array) of form phrases.
2. the resulting form phrases are converted into a regularized text - that is, they become regularized form phrases - see Text Regularization for the details and examples.

The result is a set of regularized form phrases, which is then used to produce the formPhrases field in the MRG entry.

tip

An MRG SHOULD NOT have two (or more) MRG entries that have a same element in their formPhrases field, because that would mean that the form phrase is ambiguous, as it refers to two different semantic units.

Storing a provisional MRG in the glossarydir

When the creation of a provisional MRG is complete, a filename mrg.<scopetag>.<vsntag>.yaml is constructed, where:

<scopetag> is the scopetag that is used within the current scope to refer to itself. Its value can be found in the scopetag-field in the scope section of the SAF.
<vsntag> is the versiontag that identifies the version of the terminology for which the MRG contains entries. Its value must be equal to that found in the vsntag-field of the element in the versions section of the SAF from which the MRG was generated.

If a file with that name already exists in the glossarydir of the current scope, it will be deleted. Then, a new file with that name will be created, which will contain:

a terminology section, the contents of which is obtained by copying relevant fields from the terminology section in the SAF;
a scopes section, the contents of which is obtained by copying relevant fields from the scopes section in the SAF;
an entries section, the contents of which consists of the provisional MRG entries of the provisional MRG.

Then, if the <vsntag> part of the filename equals the value of the defaultvsn field in the scope section of the SAF, a copy of that file is created in the glossarydir whose filename is mrg.<scopetag>.yaml, which is the name by which the default MRG of the current scope is referred to.

Next, the MRGT will create a copy of the MRG file for every versiontag that exists in the altvsntags-field of the element in the versions section of the SAF from which the MRG was generated. The copy will contain the same MRG as the file that has just been written. The name of this copied file is mrg.<scopetag>.<altvsntag>.yaml, which is the same name as the MRG file, except that the <vsntag> part of that filename is replaced with the value of the versiontag found in the altvsntags-field.

Phase 2: post processing Synonyms

This phase starts only after all provisional MRGs are created that the MRGT was instructed to build in this run, and the corresponding files have been added to the glossarydir of the current scope. This allows post processing, e.g. of synonyms, to use the newly generated provisional MRG entries

When a provisional MRG entry in (one of) the created provisional MRGs has a synonymOf field that contains a term identifier, this will now refer to either

an MRG entry in one of the MRGs that either already existed, or
a provisional MRG entry in a [provisional MRG] that has just been created. This (possibly provisional) MRG entry is then copied, after which all fields in the provisional MRG entry that contained the term identifier are added thereto, overwriting any already existing fields, or adding fields that did not yet exist. Then, the resulting data is used to replace the provisional MRG entry that contained the term identifier.

Effectively, this means that whenever a term is defined as a synonym of some other term, the corresponding MRG entry will have all fields of this other term, except for those that were specified in the header of the term that is defined as a synonym of that other term.

Phase 3: post processing other fields

Now, all provisional MRG entries in all [provisional MRGs] are processed so as to become useable from the context within which they have been selected. This means that every field in such a provisional MRG entry is discarded if the field name (when converted into lowercase), matches any of the field names in the table below, after which the fields in the below table are added with the contents as specified. The MRGT run is concluded after all these modifications have been written to their appropriate MRG files.

Field	Value(s) that are assigned to the fields
`scopetag`	overwrite the `scopetag` field with the `scopetag` field as found in the `scope` section of the SAF.
`locator`	path, relative to `scopedir`/`curatedir`/, of the file that contains the (header of) the curated text.
`navurl`	(localized) path to which browsers navigate in order to see the rendered version of the curated text.
`headingids`	a list of the markdown headings and/or heading ids that are found in the body of the curated text. Note that this body can be either in the curated text file or in a separate body file.

The following sections elaborate on the construction of (the contents) of some of these fields.

Constructing the `navurl` field

The navurl field is constructed by concatenating website/navpath/curatedir/id, where website, navpath and curatedir are given by the contents of the respective fields in the scope section of the SAF.

The id part is one of the following:

if the scope section of the SAF contains the field bodyFileID, then its contents specifies the name of the field that is expected to exist in the header of the curated text, and its value will become the id part. Thus, static site generators such as Docusaurus, which uses the id field to specify this value, can be accommodated.
if the SAF does not specify the bodyFileID field, then id will become the name of the file that contains the rendered version of the body-file as specified in the bodyFile field in the header of the curated text file, or, if that field is empty or non-exitent, the name of the curated text file itself.

Constructing the `headingid` fields (#headingids-construction)

The headingids field is constructed by finding all markdown headings in the body-file (or the curated text file if there is no separate body file, and making a list out of them.

Example of Markdown Headers and their `headingid` fields

Default Markdown Headers
Custom Heading IDs

Markdown headings are only recognized when they are preceeded with number signs (#) at the beginning of a line. The alternative syntax, that uses sequences of = or - characters on the next line, is ignored.

Here is an example of a markdown header:

## This is a Markdown Header

This header will result in the text this-is-a-markdown-header being added as an element in the headingids field.

A markdown heading may also contain a (custom) heading id that allows you to link directly to headings and modify them with CSS.

Here is an example of a markdown header with a custom heading-id:

# This is a Markdown Header {#custom-id}

This header will result in the text custom-id being added as an element in the headingids field.

Phase 4: checking the result

The last step consists of checking crucial properties that MRGs are relied on to have, and raising appropriate exceptions in case something is wrong. This helps curators that check the log outputs to become aware of things they may need to fix before these MRGs are further used (or published).

In this step, the following checks are done (as a minimum):

The value of the termid field in one MRG Entry differs from the value of the termid field of all other MRG Entries. This ensures that termid contains a unique identifier (primary key) within the context of the MRG.
When a regularized form phrase is an element of the formPhrases field of an MRG entry, there MUST NOT be another MRG entry in the same MRG that has this regularized form phrase in its formPhrases field.

Exceptions, Warnings, and Logging

Editor's note

This section needs to be reviewed/revised so as to enable a consistent way of error checking and logging, similar to what is done in the TRRT

The general principle is that the MRGT helps its users to do their jobs. This means that errors that terminate the processing are limited to the max, that warnings (perhaps at different 'levels' of detail/severity) are given output whenever possible (yet may be limited by command-line options), and that texts are tailored for the envisaged users of the tool.

The MRGT logs conditions that prevent it from properly:

obtaining the scopedir from a scopetag;
parsing a curated text (e.g. because it is not in the expected format);
resolving terms, scope tags, group tags, or version tags;
writing the output (e.g. because it has no write-permission for the designated location);
etc.;

Also, the MRGT provides suggestions that help tool-operators (curators) to not only identify, but also fix any problems.

The MRGT comes with documentation that enables developers to ascertain its correct functioning (e.g. by using a test set of files, test scripts that exercise its parameters, etc.), and also enables them to deploy the tool in a git repo and author/modify CI-pipes to use that deployment.

Notes

The MRGT MUST NOT start by overwriting files that contain an MRG, as they should remain available as a (possible) source for copying MRG entries from during the construction of one or more provisional MRGs. Writing the actual files should be done after all provisional MRGs have been constructed.↩

Machine Readable Glossary Generation Tool

Installing the Tool​

Calling the Tool​

Running the Tool​

Phase 1: constructing a provisional MRG​

Storing a provisional MRG in the glossarydir​

Phase 2: post processing Synonyms​

Phase 3: post processing other fields​

Constructing the navurl field​

Constructing the headingid fields (#headingids-construction)​

Phase 4: checking the result​

Exceptions, Warnings, and Logging​

Notes​