The Research, Condition, and Disease Categorization (RCDC) system is a computer-based process that sorts NIH-funded projects into scientific categories for reporting to the public. The RCDC system uses an automated indexing tool that uses text from a project’s title, abstract, public health relevance, and specific aims.

The indexing tool matches project text against scientific terminology, referred to as concepts, from the RCDC Thesaurus. If the project text has enough overlap with the concepts used to create the category, the project will be reported in the category. A single project can be categorized in multiple categories.

Categories: Categorical Spending Topics

A category can be a research area such as neuroscience, a disease such as diabetes, or a condition such as chronic pain. The RCDC system compiles lists of all the funded projects for each category, which is called the project listing. The categories, project listings, project text and other data are available on the Categorical Spending Page. Categories are developed at the request of Congress, the White House, advocacy groups, and NIH leadership.

RCDC Thesaurus

The RCDC Thesaurus is a collection of biomedical concepts that are used to describe scientific research areas. Concepts are terms or keywords found in project text and are used to create categories. These concepts can be anything ranging from specific diseases, gene names, types of diagnostic tools, therapeutic techniques, to patient populations. A concept can have multiple synonyms, which are words or phrases that mean the same thing as the concept.

Examples:

ConceptSynonym(s)
Acquired Immunodeficiency SyndromeAcquired Immune Deficiency Syndrome; AIDS
House Dusthome dust; household dust; residential dust
Joint Prosthesis artificial joint

The RCDC Thesaurus was created based on several vocabularies, such as the Medical Subject Headings (MeSH), plus the content contributed by subject matter experts across the NIH. As science evolves staff review and update concepts.

Project Indexing: Identifying concepts in project text

The RCDC system uses the NIH Automated Indexing Service (NAIS), a sophisticated text mining software that uses state-of-the-art natural language processing techniques. Using NAIS, project text is scanned to find RCDC Thesaurus concepts in the project's title, abstract, specific aims, and public health relevance section. The concepts found in the project text along with the number of times the concepts appear in the text is called the project index. The categorization process begins once project indexing is completed.

Categorization Logic: Creating project listings

Category Parameters

Categories are built based on NIH-wide discussions with NIH subject-matter experts to determine the scientific areas to include or exclude in the category, called category parameters. Category parameters are periodically reviewed and updated to account for new or evolving scientific topics.

Category Fingerprint

Based on the category parameters, a category fingerprint is developed using a list of concepts from the RCDC Thesaurus that are relevant to the category topic. Each concept in the fingerprint is weighted based on how related it is to the topic of the category. For example, "e-cigarette," "nicotine abuse," and "oral tobacco" are examples of concepts used in the Tobacco category.

Category fingerprints use a list of concepts from the RCDC Thesaurus to capture projects relevant to the category topic. This image shows only a selection of the concepts contained in the categories, Tobacco, Back Pain, and Obesity. Tobacco category has concepts such as E-cigarette, nicotine abuse, nicotine gum, nicotine patch, oral tobacco, tobacco abuse, tobacco smoke, tobacco user. Back Pain category has concepts such as back injuries, chronic back pain, low back pain, recurrent back pain, sciatica, spine pain, spine surgery, tailbone pain. Obesity category has concepts such as emotional eating, fast food, gastric bypass, obese patients, obesity prevention, morbid obesity, weight loss drugs. Category fingerprints use a list of concepts from the RCDC Thesaurus to capture projects relevant to the category topic. This image shows only a selection of the concepts contained in the categories, Tobacco, Back Pain, and Obesity.

Categorization occurs when the concepts in the project text are compared to the concepts in the category fingerprint to identify the degree of overlap. When the project text has enough overlap with the concepts in the category fingerprint, the project will be included in the category. A single project can be categorized in multiple categories.

Business Rules

Some RCDC categories have rules to capture additional relevant projects or limit the projects captured based on project metadata (i.e., information about the project), instead of project text. The rules can be based on grant activity codes, funding opportunity numbers, or funding NIH institute or center.

Category Relationships

Category relationships are another way to supplement project listings for categories with directly related research. Categories with narrow topics can be used to add projects to a category with a broader topic by reporting its project listing into the broader category, without the need to adjust the broader category’s fingerprint. For example, all projects in the Back Pain category are included in the Pain Research category.

Category Review and Finalization

After initial development, NIH staff review RCDC categories to determine if the indexed projects are scientifically relevant and if additional work is needed to capture additional projects or remove any errors. Once category review is complete, category data is frozen for the fiscal year and reported on the Categorical Spending Page.