HPO gold standard corpus
The HPO gold standard corpus consists of a collection of 228 manually annotated abstracts cited by the Online Mendelian Inheritance in Man (OMIM) database.
All annotations are stored in stand-off tab based format in files carrying the PMIDs corresponding to the abstracts listed in the original corpus.
The stand-off annotation format is: startOffset::endOffset [tab] HPO URI | original text span (for example [86::103] HP_0001792 | hypoplastic nails). Archives corresponding to the XML and text versions of the abstracts,
as well as the stand-off annotations can be downloaded using the links below:
HPO test suites corpus
Cohen et al. [Cohen 2010] have adopted the testsuite methodology from software engineering and proposed a stratified approach to data sampling based on several criteria.
Each criterion focuses on a set of concepts that share a particular property, such as, length in tokens, presence of punctuation, coordination, etc.
This leads to a framework able to characterize the strengths of the linguistic patterns used within each concept recognition system and, moreover, to a platform that
can be applied and shared to perform standardized error analysis.
This framework has been applied to HPO and has led to 32 manually crafted criteria (or types of test suites) comprising 2,164 entries - each entry corresponds to the label of an HPO concept.
In addition to being structured by type, test suites have also been structured according to the 21 top-level abnormalities present in HPO. The complete list of criteria is listed below.
The archive comprising all test suites can be downloaded from:
Test suite list:
- Numerals isolated (Roman)
- Numerals isolated (Arabic)
- Non-English canonical
- Metaphoric constructs
- Lexical variation
- Canonical ordering
- Canonical ordering (transformed)
- Containing punctuation
- Containing stop words (WITH)
- Containing stop words (TO)
- Containing stop words (OF)
- Containing stop words (IN)
- Containing stop words (BY)
- Containing stop words (FROM)
[Cohen 2010] Cohen K. B., Roeder, C., Baumgartner Jr, W. A., Hunter, L. E. & Verspoor, K. (2010). Test suite design for biomedical ontology concept recognition systems. In Proc. of the Seventh International Conference on Language Resources and Evaluation (LREC'10), Valletta, Malta, pp. 441-446.