Friday, September 30, 2016

Directories and Databases of Published Controlled Vocabularies

A source of published controlled vocabularies (taxonomies, thesauri, ontologies, etc.) can be useful for different purposes. Sometimes, finding a vocabulary to license and reuse is the objective, whereas in other cases finding a vocabulary to consult as a source for confirming individual terms and relationships is the goal. Thus, different kinds of directories or databases of controlled vocabularies may be of interest.

In some cases, an individual or organization has a project involving a set of content that would benefit from controlled vocabulary tagging to make it findable/retrievable/discoverable, but lacks the time or resources to build a taxonomy from scratch. Licensing an existing controlled vocabulary may seem like a preferable option. This can be a reasonable solution, depending on the content and scope of the controlled vocabulary in question. In many cases, what is desired is the use of an existing controlled vocabulary as a starting point that can then be edited and expanded to customize it for a specific use. Either case involves the licensing of a controlled vocabulary.

Taxonomists who build taxonomies from scratch or edit proprietary taxonomies like to consult available controlled vocabularies on the same subject to help determine the ideal wording of a term, the inclusion of additional synonyms, and the relationship of the term to others. The results may vary in different controlled vocabularies because they serve different purposes and audiences, but taxonomists know to take that into consideration. In these cases, licensing an entire controlled vocabulary is not needed. Simply viewing a controlled vocabulary and its term relationships is adequate.

If you are interested in licensing a complete controlled vocabulary, you will need to consider both commercial/proprietary controlled vocabularies that require a fee for a license, and public/open source controlled vocabularies that are available for free. Some collections comprise only open source vocabularies. While free is nice, the free license may carry a restriction of no commercial use and/or no modifications in use. So, for commercial or for modified reuse, make sure you consult a controlled vocabulary directory that includes proprietary controlled vocabularies.

If you are interested in merely looking up terms a controlled vocabulary, it is the public, not proprietary, controlled vocabularies that are fully accessible. Therefore, it’s more convenient to consult a database/directory of controlled vocabularies that includes only public vocabularies and preferably either hosts those vocabularies or directly links to the browsable/searchable vocabulary, rather than simply redirecting to the controlled vocabulary publisher’s website, where you may have to hunt around to find access to the controlled vocabulary, if it is even accessible at all.

Comprehensive database-directories of controlled vocabularies

 

Comprehensive directories are large, listing hundreds of controlled vocabularies, so they are managed as databases, with database records for each referenced controlled vocabulary and search filters, such as vocabulary name, publisher’s name, and subject. There are database record pages with more details for each named controlled vocabulary, including a link to the publisher’s website. The link may or may not be to a navigable controlled vocabulary. The most comprehensive such databases are the following two:

Basel Register of Thesauri, Ontologies & Classifications (BARTOC)
BARTOC, launched in 2013, is a comprehensive database/registry of controlled vocabularies (“knowledge organization systems”) created and managed by the Basel University Library (Switzerland). The database currently lists 1,948 vocabularies of all kinds, in all languages, in all subject areas, and in various publication formats. The database of vocabularies is hosted on Drupal, and advanced search filters comprise the top 10 Dewey Decimal Classification categories, 568 hierarchical Topics, Language, Location, and Type (categorization scheme, classification scheme, dictionary, gazetteer, glossary, list, name authority list, ontology, semantic network, subject heading scheme, synonym ring, taxonomy, terminology, and thesaurus). Links are to the publisher site and sometimes directly to a navigable thesaurus. Despite its comprehensiveness, BARTOC only has a few commercial, proprietary controlled vocabularies.

Taxonomy Warehouse
Taxonomy Warehouse, launched in 1999, is a comprehensive database of varied controlled vocabularies created and managed by the thesaurus management software vendor Synaptica LLC. The database currently lists 763 vocabularies of all kinds in all subject areas from 330 organizations in various publication formats. The database of vocabularies is hosted on the Synaptica software platform, and it can be searched or alphabetically browsed (one page per letter of the alphabet), or browsed by 225 hierarchical subject categories. Unlike BARTOC, vocabularies included are only in English or multilingual including English. Synaptica is much more comprehensive than BARTOC in its inclusion of entries for proprietary controlled vocabularies, such as those of Gale and WAND, which may be licensed for a fee from their publishers but then do not preclude modification or commercial reuse. While a number of links are dead, an overhaul update is planned in coming months.

Hosted vocabulary registries

 

A hosted “repository” of vocabularies can be useful, because all the vocabularies are navigable through the same user interface on the same site. You can even search for a vocabulary term across multiple controlled vocabularies at once. As publicly accessible vocabularies, many of these can also be downloaded from the site for noncommercial use. This type of database exists mostly for ontologies, because they conform to Semantic Web standards for exchange of information over the web and thus don’t require a lot of data conversion to be hosted, but Linked Data SKOS vocabulary collections are starting to appear. (Note that ontologies are structured and displayed slightly differently than taxonomies or thesauri, so they may not be as useful as reference sourced for editing taxonomies or thesauri.) Publicly accessible ontologies tend to be in the biomedical sciences, so the subject area is also more limited and the ontology databases are aimed at subject matter experts. Vocabulary repositories of this kind include the following two, among many others:

Research Vocabularies Australia
Research Vocabularies Australia is a controlled vocabulary "discovery service" of the Australian National Data Service (ANDS), launched in September 2015. It currently comprises 74 vocabularies, mostly in the sciences, and is intended to grow. About half of the vocabularies are hosted on the ANDS website, and their hierarchies can be browsed and terms can be searched upon in a common user interface. These are Linked Data SKOS vocabularies, not ontologies, and include taxonomies, thesauri, and simple term lists. Vocabulary publishers comprise 33 governmental nongovernmental organizations, Australian and other. The collection of vocabularies can be searched and can be filtered by Subject, Publisher, Language, and License. Although not as large as ontology-only repositories, Research Vocabularies Australia is a significantly large repository of easy-to-access controlled vocabularies all in one place, and thus is a good source for researching terms or for downloading noncommercial-use vocabularies.

Bioportal
Bioportal is a biomedical ontology repository service of the National Center for Biomedical Ontology (NCBO) comprising 516 ontologies, many of which can be downloaded directly from the site. The vocabularies can be searched or browsed, with search filters including controlled fields for Category, Group, and Format. Filters for sorting the list of ontologies are by Popular, Size, Projects, Notes, and Upload date. One can also search for a class (term) within multiple ontologies. A great deal of metadata and summary information is provided for each vocabulary, including history of uploads, a graph of downloads, and a table of metrics, which includes the number of classes, individuals, properties, maximum depth, children, etc.

Ontobee
Ontobee, hosted by the He Group (of Dr. Yongqun “Oliver” He) of the University of Michigan Medical School, provides a sortable tabular list of 181 biomedical ontologies, which can each be individually searched and browsed directly the Ontobee website. Furthermore, terms can be searched in the Ontobee linked data server across all 181 ontologies. The ontologies (with the OWL file extension) can be downloaded, and lists of terms (more useful references for taxonomists) can be downloaded as Excel or text files.

Vocabularies listed on educational or professional organization sites

 

Some organizations list a sampling of vocabularies in all subject areas to serve as educational examples of different kinds of vocabularies, aimed more at students and professionals in the area of library and information science than for subject matter experts. These vocabulary collections tend to include only vocabularies that can be accessed and navigated on a public website, so they are a good source when researching individual terms. Examples of vocabulary collections of this type include those on the following sites:

American Society for Indexing
“Online Thesauri and Authority Files” is a webpage alphabetical list of about 25 vocabularies, mostly thesauri, in varied subjects with links directly to the browsable vocabularies. While the number of vocabularies is not large, it is maintained, and it is a practical resource for looking up terms in varied thesauri. They are meant to be examples for professional indexers who are also interested in working in thesaurus construction.

Charles Sturt University - School of Information Studies
“Information Organisation Vocabularies” is a webpage under the section “Links and Resources for Students” on this Australian university site. It comprises an alphabetical, sortable table of 328 vocabularies, although there is no explanatory text. The list can be sorted by column headers: Name, Author, Year, Publisher, and Keywords (uncontrolled). There is also a filter-search feature which aids in finding a desired subset of vocabularies. The links to the vocabularies link to the navigable vocabulary on the publisher site or, in the case of a few older vocabularies, to a PDF print thesaurus. Although there are a number of dead links and nothing has been added since 2014, the number of correct links directly into navigable vocabularies is significantly large, so this is a useful resource.

Vocabularies listed on software vendor sites

Some thesaurus/ontology management software vendors provide a sampling of vocabularies in various subject areas created in their tools, aimed at users or potential users of the tool. These vocabularies tend to be directly browsable, but the qualities of the vocabularies may be inconsistent, so care should be taken in using them as an authoritative source. Vocabulary collections of this type include those created in the following software tools:

PoolParty (The Semantic Web Company)
Has an alphabetical list of about 30 web-browsable Linked Data vocabularies, most of which are in English and almost all of which are hosted by The Semantic Web company. Some are very small and were built by the Semantic Web company staff as examples, and some are public thesauri that were imported into PoolParty. In addition to being browsed, almost all of the thesauri can be downloaded, too.

TemaTres
Has a tabular list of "known cases" of over 400 vocabularies managed in TemaTres, some hosted on the TemTres site and some of which link to vocabularies on the owner's server. Table columns are for title, scope (either the number of terms or a description), language, and URL. The vocabularies are in all languages, with a slightly higher proportion in Spanish, due to the fact that TemaTres is developed in Argentina, and only a minority of which are in English. The search feature has limitations due to the inconsistent use of scope descriptions and the fact that titles and descriptions are in different languages.

MultiTes
Has a small sampling of 10 web-browsable thesauri in varied subject areas, of which 7 are in English. Some are hosted on the MultiTes site and some are hosted on the thesaurus publisher sites.

VocBench
Has links on its VocBench “community” page to about a dozen national and international organizations and two higher education institutions with VocBench-created vocabularies. In some cases the links are to the browsable thesauri, but in other cases the links are just to the organization websites, and the thesauri, if available, are not so easily found.

Protégé
Has a wiki page that lists and links to websites of ontology publishers in three categories: 80 OWL ontologies, 19 Frame-based ontologies (those ontologies that were developed using the Protégé-Frames editor), and 8 in other ontology formats. Some of the links are dead, some are to the websites of the ontology owners, and some are directly to the XML file. Since the links are not to the navigable ontology in a browser, this list of ontologies is not useful as a source for checking terms, but it is a good source for downloading ontologies, if you have the right software to read them.

7 comments:

  1. Dear Heather,

    thank you very much for mentioning BARTOC. I just have one minor correction: it does include commercial, proprietary vocabularies, see e.g. Ovid Nursing Subject Thesaurus, Thomson Reuters Business Classification etc. But I agree that the number could be much higher. Maybe this could be achieved through a cooperation with Taxonomy Warehouse/Synaptica?

    Best, Andreas

    ReplyDelete
  2. Thank you, Andreas, for your comment. I had skimmed through a number of the vocabularies on BARTOC but obviously did not look at all of them. I have now updated this post in both the sections on BARTOC and Synpatica.

    ReplyDelete
  3. LOV (Linked Open Vocabularies), hosted by the Open Knowledge Foundation, currently contains 576 vocabularies, their metadata, and linkage statistics: http://lov.okfn.org/dataset/lov/

    LOV supports SPARQL, has a "suggest" feature, and displays a nice bubble chart of the listed vocabularies.

    LOV is an Open Source web application: https://github.com/pyvandenbussche/lov

    ReplyDelete
  4. Thank you, Wes, for sharing this. I had come across LOV last year, but did not think to include it in this list now. Some of the vocabularies are very small, and as such may be suitable for a single facet within a faceted taxonomy. Also, the voabularies are in machine-readable form only, not human readable, so this is not so useful as a source of merely consulting another taxonomy to check and validate select terms and relationships. Nevertheless, it is an interesting source to note.

    ReplyDelete
  5. On the topic of licensing taxonomies and other types of controlled vocabularies, I'd like to explore this further. I have created a short 8-question multiple choice survey to get any idea of the trends in interest in licesing vocabularies. It will be open through the first half of 2019. https://www.surveymonkey.com/r/DWQW5TY

    ReplyDelete
  6. Hi Heather - thanks very much for this page and the references, it's really helped in some research!

    ReplyDelete