Machine Readable Glossary Generation Tool
Documentation needs to be adjusted for:
- Converting formPhrases: MRGT will write expanded formPhrase macros into MRGEntry
formPhrases
field
The Machine Readable Glossary generation Tool (MRGT) generates Machine Readable Glossaries (MRGs) for one specific, or all terminology versions that are curated within a specific scope. MRGs come in a specific, well-defined format. They contain some meta-data, followed by a list of so-called MRG entries, one for every term in its scope, which represent concepts and other semantic units that are known within that scope.
The (newly generated) MRG(s) are meant to be processed by the other tools in the toolbox, regardless of whether such tools are called from within the context of another scope. As they contain every term that is used in the scope, and include all the relevant meta-data, an MRG serves as the single, authoritative source of that (version of the) scope's terminology.
Installing the Tool
The tool can be installed from the command line and made globally available by executing
npm install -g @tno-terminology-design/mrgt
Before running the tool from the command line, make sure you have met the necessary prerequisites depending on your operating environment.
- CMD.exe (Windows)
- PowerShell(Windows)
- Bash (Linux/Mac)
- Node.js and NPM: Ensure Node.js and NPM are installed.
- Global Installation: If you have installed the package globally, confirm the global NPM modules path by running
npm config get prefix
. The global modules are usually stored under<prefix>/node_modules
. - Environment Variables: Add the path to global NPM binaries to your system's PATH environment variable. This should be
<prefix>
on Windows. To add to PATH, you can edit your environment variables or runset PATH=%PATH%;<prefix>
in the CMD.
- Node.js and NPM: Ensure Node.js and NPM are installed.
- Global Installation: Check the global NPM modules path as in CMD.
- Environment Variables: Update the PATH environment variable as in CMD. You can also use
$env:Path += ";<prefix>"
to update the PATH temporarily in the current PowerShell session.
- Node.js and NPM: Ensure Node.js and NPM are installed.
- Global Installation: If globally installed, run
npm config get prefix
to get the global modules path, usually<prefix>/lib/node_modules
. - Environment Variables: Add the
<prefix>/bin
directory to yourPATH
if it's not already. You can do this by addingexport PATH=$PATH:<prefix>/bin
to your~/.bashrc
or~/.zshrc
file.
Calling the Tool
The behavior of the MRGT can be configured per call e.g. by a configuration file and/or command-line parameters. The command-line syntax is as follows:
mrgt [ <paramlist> ]
where <paramlist>
is an (optional) list of parameters.
Legend
The columns in the following table are defined as follows:
Parameter
specifies the parameter and further specificationsReq'd
specifies whether (Y
) or not (n
) the field is required to be present when the tool is being called. If required, it MUST either be present in the configuration file, or as a command-line parameter.Description
specifies the meaning of theValue
field, and other things you may need to know, e.g. why it is needed, a required syntax, etc.
If a configuration file used, the long version of the parameter must be used (without the preceding --
).
Key | Req'd | Description |
---|---|---|
-c , --config <path> | n | Path (including the filename) of the tool's (YAML) configuration file. |
-h , --help | n | display help for command. |
-o , --onNotExist <action> | n | The action in case a vsntag was specified, but wasn't found in the SAF. |
-s , --scopedir <path> | n | Path of the scope directory from which the tool is called. |
-v , --vsntag <vsntag> | n | Versiontag for which the MRG needs to be (re)generated. |
-V , --version | n | output the version number of the tool. |
The <action>
parameter can take the following values:
<action> | Description |
---|---|
'throw' | an error is thrown (an exception is raised), and processing will stop. |
'warn' | a message is displayed (and logged) and processing continues. |
'log' | a message is written to a log(file) and processing continues. |
'ignore' | processing continues as if nothing happened. |
Running the Tool
One run of the MRGT either
- generates an MRG for one specific terminology version within the current scope (which is the case when the
version
parameter was specified), or it - generates multiple MRGs, i.e., one for every version of the terminology that is curated within the current scope (which is the case when the
version
parameter is omitted).
Running the tool comprises the following phases:1
- Constructing a provisional MRG;
- Post-processing the entries in that provisional MRG;
- Creating/overwriting MRG file(s) in the glossarydir of the current scope.
Phase 1: constructing a provisional MRG
Generating an MRG for a particular version of a terminology starts by reading the SAF of the scope within which that terminology is curated, which exists in the scopedir that was provided as one of the calling parameters. If a vsntag
argument is provided, it will search the versions section of the SAF to find the corresponding entry. This corresponding entry will have the value of the vsntag
parameter either in its vsntag
field, or it is one of the elements in the altvsntags
field. If the SAF does not have a corresponding entry, the action specified in the onNotExist
parameter will determine whether or not (and how) to proceed.
In this phase, for every terminology version that is to be created, one provisional MRG is created, that contains a provisional MRG entry for every term contained in the particular version of the terminology. This provisional MRG entry either contains:
- all fields in the header of the curated text that documents its term, or
- all fields in the MRG entry that comes from another MRG (typically, but not necessarily, from another scope).
The Term Selection Instruction syntax specifies precisely how provisional MRGs are created.
After a provisional MRG entry is created, the following modifications are made:
- the
formPhrases
field is processed, which means that every element in that set (array) is subjected to the following processing steps:- if the element contains contains a form phrase macro, it is replaced by a set of form phrases that is constructed by processing that form phrase macro - see Form Phrase Macro Expansion for the details and examples. This step effectively enlarges the set (array) of form phrases.
- the resulting form phrases are converted into a regularized text - that is, they become regularized form phrases - see Text Regularization for the details and examples.
The result is a set of regularized form phrases, which is then used to produce the formPhrases
field in the MRG entry.
An MRG SHOULD NOT have two (or more) MRG entries that have a same element in their formPhrases
field, because that would mean that the form phrase is ambiguous, as it refers to two different semantic units.
Storing a provisional MRG in the glossarydir
When the creation of a provisional MRG is complete, a filename mrg.<scopetag>.<vsntag>.yaml
is constructed, where:
<scopetag>
is the scopetag that is used within the current scope to refer to itself. Its value can be found in thescopetag
-field in thescope
section of the SAF.<vsntag>
is the versiontag that identifies the version of the terminology for which the MRG contains entries. Its value must be equal to that found in thevsntag
-field of the element in the versions section of the SAF from which the MRG was generated.
If a file with that name already exists in the glossarydir of the current scope, it will be deleted. Then, a new file with that name will be created, which will contain:
- a
terminology
section, the contents of which is obtained by copying relevant fields from theterminology
section in the SAF; - a
scopes
section, the contents of which is obtained by copying relevant fields from thescopes
section in the SAF; - an
entries
section, the contents of which consists of the provisional MRG entries of the provisional MRG.
Then, if the <vsntag>
part of the filename equals the value of the defaultvsn
field in the scope
section of the SAF, a copy of that file is created in the glossarydir whose filename is mrg.<scopetag>.yaml
, which is the name by which the default MRG of the current scope is referred to.
Next, the MRGT will create a copy of the MRG file for every versiontag that exists in the altvsntags
-field of the element in the versions section of the SAF from which the MRG was generated. The copy will contain the same MRG as the file that has just been written. The name of this copied file is mrg.<scopetag>.<altvsntag>.yaml
, which is the same name as the MRG file, except that the <vsntag>
part of that filename is replaced with the value of the versiontag found in the altvsntags
-field.
Phase 2: post processing Synonyms
This phase starts only after all provisional MRGs are created that the MRGT was instructed to build in this run, and the corresponding files have been added to the glossarydir of the current scope. This allows post processing, e.g. of synonyms, to use the newly generated provisional MRG entries
When a provisional MRG entry in (one of) the created provisional MRGs has a synonymOf
field that contains a term identifier, this will now refer to either
- an MRG entry in one of the MRGs that either already existed, or
- a provisional MRG entry in a [provisional MRG] that has just been created. This (possibly provisional) MRG entry is then copied, after which all fields in the provisional MRG entry that contained the term identifier are added thereto, overwriting any already existing fields, or adding fields that did not yet exist. Then, the resulting data is used to replace the provisional MRG entry that contained the term identifier.
Effectively, this means that whenever a term is defined as a synonym of
some other term, the corresponding MRG entry will have all fields of this other term, except for those that were specified in the header of the term that is defined as a synonym of that other term.
Phase 3: post processing other fields
Now, all provisional MRG entries in all [provisional MRGs] are processed so as to become useable from the context within which they have been selected. This means that every field in such a provisional MRG entry is discarded if the field name (when converted into lowercase), matches any of the field names in the table below, after which the fields in the below table are added with the contents as specified. The MRGT run is concluded after all these modifications have been written to their appropriate MRG files.
Field | Value(s) that are assigned to the fields |
---|---|
scopetag | overwrite the scopetag field with the scopetag field as found in the scope section of the SAF. |
locator | path, relative to scopedir /curatedir /, of the file that contains the (header of) the curated text. |
navurl | (localized) path to which browsers navigate in order to see the rendered version of the curated text. |
headingids | a list of the markdown headings and/or heading ids that are found in the body of the curated text. Note that this body can be either in the curated text file or in a separate body file. |
The following sections elaborate on the construction of (the contents) of some of these fields.
Constructing the navurl
field
The navurl
field is constructed by concatenating website
/navpath
/curatedir
/id
, where website
, navpath
and curatedir
are given by the contents of the respective fields in the scope
section of the SAF.
The id
part is one of the following:
- if the
scope
section of the SAF contains the fieldbodyFileID
, then its contents specifies the name of the field that is expected to exist in the header of the curated text, and its value will become theid
part. Thus, static site generators such as Docusaurus, which uses theid
field to specify this value, can be accommodated. - if the SAF does not specify the
bodyFileID
field, thenid
will become the name of the file that contains the rendered version of the body-file as specified in thebodyFile
field in the header of the curated text file, or, if that field is empty or non-exitent, the name of the curated text file itself.
Constructing the headingid
fields (#headingids-construction)
The headingids
field is constructed by finding all markdown headings in the body-file (or the curated text file if there is no separate body file, and making a list out of them.
Example of Markdown Headers and their `headingid` fields
- Default Markdown Headers
- Custom Heading IDs
Markdown headings are only recognized when they are preceeded with number signs (#) at the beginning of a line. The alternative syntax, that uses sequences of =
or -
characters on the next line, is ignored.
Here is an example of a markdown header:
## This is a Markdown Header
This header will result in the text this-is-a-markdown-header
being added as an element in the headingids
field.
A markdown heading may also contain a (custom) heading id that allows you to link directly to headings and modify them with CSS.
Here is an example of a markdown header with a custom heading-id:
# This is a Markdown Header {#custom-id}
This header will result in the text custom-id
being added as an element in the headingids
field.
Phase 4: checking the result
The last step consists of checking crucial properties that MRGs are relied on to have, and raising appropriate exceptions in case something is wrong. This helps curators that check the log outputs to become aware of things they may need to fix before these MRGs are further used (or published).
In this step, the following checks are done (as a minimum):
- The value of the
termid
field in one MRG Entry differs from the value of thetermid
field of all other MRG Entries. This ensures thattermid
contains a unique identifier (primary key) within the context of the MRG. - When a regularized form phrase is an element of the
formPhrases
field of an MRG entry, there MUST NOT be another MRG entry in the same MRG that has this regularized form phrase in itsformPhrases
field.
Exceptions, Warnings, and Logging
This section needs to be reviewed/revised so as to enable a consistent way of error checking and logging, similar to what is done in the TRRT
The general principle is that the MRGT helps its users to do their jobs. This means that errors that terminate the processing are limited to the max, that warnings (perhaps at different 'levels' of detail/severity) are given output whenever possible (yet may be limited by command-line options), and that texts are tailored for the envisaged users of the tool.
The MRGT logs conditions that prevent it from properly:
- obtaining the scopedir from a scopetag;
- parsing a curated text (e.g. because it is not in the expected format);
- resolving terms, scope tags, group tags, or version tags;
- writing the output (e.g. because it has no write-permission for the designated location);
- etc.;
Also, the MRGT provides suggestions that help tool-operators (curators) to not only identify, but also fix any problems.
The MRGT comes with documentation that enables developers to ascertain its correct functioning (e.g. by using a test set of files, test scripts that exercise its parameters, etc.), and also enables them to deploy the tool in a git repo and author/modify CI-pipes to use that deployment.
Notes
- The MRGT MUST NOT start by overwriting files that contain an MRG, as they should remain available as a (possible) source for copying MRG entries from during the construction of one or more provisional MRGs. Writing the actual files should be done after all provisional MRGs have been constructed.↩