A pre-established set of substructures the presence or absence of which are used to describe a molecule and are used during substructure searching as a filter to eliminate molecules that cannot match a query, for clustering or similarity searching, or for developing classification or regression models.