Can you give more details about the probabilistic species identification algorithm you use and are your methods public?
Our identification pipeline uses a published probabilistic algorithm. The algorithm accounts for gaps in the sequence reference databases by comparing the content of that database with expected taxonomic diversity, allowing the probability of a query sequence arising from an unreferenced species to be calculated correctly. This reduces the likelihood of overconfident assignments to species level due to database gaps. As the taxonomy is a key input to the algorithm, the probability of assignment can be estimated at each level and an acceptance threshold can be applied to ensure only high confidence assignments are retained.