dnamatch2 {dnamatch2}R Documentation

dnamatch2

Description

dnamatch2 is a function for doing large scale DNA database search between trace samples (also mixture profiles) and between trace samples and reference samples.

Usage

dnamatch2(evidfold, freqfile, reffold = NULL, sameCID = FALSE,
  betweensamples = TRUE, Thist = Inf, threshMAC = 0.75,
  threshLR = c(10, 100), threshHeight = 200, threshStutt = 0.1,
  threshMaj = 0.6, minLocStain = 3, minLocMaj = 3, pC = 0.05,
  lambda = 0.01, kit = "ESX17", minFreq = 0.001,
  searchtime = Sys.time(), SIDvec = NULL, BIDvec = NULL,
  CIDvec = NULL, timediff = NULL, BIDptrn = NULL, SIDptrn = NULL,
  printHistPlots = FALSE, writeScores = FALSE, maxK = c(4, 3),
  matchfile = "matchfile.csv", sessionfold = "sessions")

Arguments

evidfold

A folder with stain files (possible a vector). Full directory must be given.

freqfile

A file containing population frequencies for alleles. Full directory must be given.

reffold

A folder with ref files (possible a vector). Default is no references. Full directory must be given.

sameCID

Boolean whether matches within same case ID number should be allowed.

betweensamples

A boolean of whether between samples are searched. Default is TRUE.

Thist

Number of days back in time for a Batch-file to be imported (stains).

threshMAC

Threshold for a allele match. A proportion [0,1]

threshLR

Threshold for a match. Can be a vector (qual,quan).

threshHeight

Acceptable peak height (rfu) in stains.

threshStutt

Acceptable stutter ratio in stains (relative peak heights) .

threshMaj

If second largest allele has ratio (relative to the largest allele) above this threshold, then second allele is part of the major profile (used for extracting major from mixture). If the relative peak height between second and third largest allele has ratio greater than this threshold, no major is assigned.

minLocStain

Number of minimum loci in stain profiles in order to be evaluated.

minLocMaj

Number of minimum loci which are required to be considered for a major component (extracted from each stain profile).

pC

Assumed drop-in rate per marker (parameter in the LR models).

lambda

Assumed hyperparameter lambda for peak height drop-in model (parameter in quantiative model).

kit

The shortname of the kit used for the samples. This allows for taking into account degradation. Use getKit function in euroformix R-package.

minFreq

Minimum frequency for rare alleles. Used to assign new allele frequenceis. Default is 0.001.

searchtime

An object with format as returned from Sys.time(). Default is Sys.time().

SIDvec

A vector with Sample-ID numbers which are considered in the search (i.e. a search filter).

BIDvec

A vector with Batch-files, which are considered in the search (i.e. a search filter).

CIDvec

A vector with Case-ID numbers which are considered in the search (i.e. a search filter).

timediff

Timedifference (in days) allowed between matching reference and target. Default is NULL (not used).

BIDptrn

Filename structure of stain files to evaluate. For instance BIDptrn="TA-".

SIDptrn

Pattern of name which is sample ID (SID_RID_CID)=("-S0001_BES00001-14_2014234231"). Here SIDptrn="-S" and SID="0001". Can also be a vector.

printHistPlots

Boolean of showing plots of the scores (number of matching alleles or LRs) for each comparisons.

writeScores

Writes detailed score information to file. Default is FALSE.

maxK

The maximum number of contributors used in the search. Can be a vector for (qual,quan) methods. Default is c(4,3)

matchfile

The name of the matchfile where multiple searches are stored in one place. Default is "matchfile.csv".

sessionfold

The name of the folder with run sessions. Default is "sessions".

Details

dnamatch2 automatically imports DNA profiles from genemapper-format and feeds it into a structure for doing effective comparison matching. Before the comparison matches are carried out, every trace samples are optionally filtered: Alleles with peak heights below a speficied threshold (threshHeight) or alleles in stutter position having peak heights below a specified threshold (threshStutt) are removed. The comparison match algorithm first counts the number of alleles of the reference profiles which are included in each of the trace samples. Then a candidate list of matching situations is created, whom satisfy being over a given treshold (threshMAC). Here all candidate matches having the same CID (case ID) can be optionally be removed. If wanted, also candidate matches which have a timedifference (based on last edited file dates) outside a specified time difference (timediff) can be removed. The second part of the comparison match algorithm first estimates the number of contributors of the remaining trace samples in the candidate list using the likelihood function of the quatliative model (likEvid in forensim), based on the AIC. After, the Likelihood Ratio for all comparisons (between all reference and trace samples) are calculated based on the same qualitative model, and an updated candidate list is created by considering all comparisons with LR greater than a specified threshold (threshLR[1]). Last, the Likelihood Ratio for all remaining comparisons in the candidate list are calculated based on the same quantitative model (likEvidMLE in euroformix). The final match candidates are the comparisons which has a LR greater than a specified threshold (threshLR[2]). Finally, detailed results of these match candidates are automatically printed to files.

To search between trace samples, a major component of each trace sample is extracted (based on relative peak heights): The allele with largest peak height belongs to the major. If the second largest allele has ratio (relative to the largest allele) above a specified threshold (threshMaj), then second allele is part of the major profile as well. If the relative peak height between second and third largest allele has ratio greater than this threshold, no profile for the major is assigned.

It is optional whether between trace samples matches should be considered (boolean betweensamples). If this is turned off, the speed is drastically increased (because of many less comparisons).

Encoding idea: References coded with primenumbers, stains alleles/heights coded with original values but collapsed(/) strings per marker.

Note 1: Trace samples must have this format: "SID_RID_CID", seperated by underscores. The trace name must be unique and consist of SID=Sample ID, RID=Unique together with SID for all traces, CID=Case ID. Hence the IDs must not contain underscores themselves. Note 2: The names of the files containing trace samples must start with pattern specified by variable BIDptrn (BatchID pattern). The variable SIDptrn is required to recognize specific sample type (i.e. it can be used as a filter). Note 3: Marker names of Amelogen must contain the name "AM" (not case-sensitive).

Timestamp="YY-MM-DD-HH-MM-SS" is used as identification key of when the dnamatch2 function was run. Match results are stored as a table in matchfile.csv with the timestamp. Matches which are earlier found in matchfile.csv will not be stored again. The column "Checked" can be used for comments. More details about a given dnamatch2 run are stored in the folder "session" with corresponding timestamp.

Author(s)

Oyvind Bleka <oyvble.at.hotmail.com>

References

dnamatch2: An open source software to carry out large scale database searches of mixtures using qualitative and quantitative models.


[Package dnamatch2 version 2.0.0 Index]