[R] ROCR predictions

Wed Aug 18 07:55:03 CEST 2010

Dear Assa,

you need to call prediction with continuous predictions and a _binary_ true 
class label.

You are the only one who can tell whether the p-values are actually predictions 
  and what the class labels are. For the list readers p is just the name of 
whatever variable, and you didn't even vaguely say what you try to classify, nor 
did you offer any explanation of what the columns are.

The only information we get from your table is that p-value has small and 
continuous values. From what I see the p-values could also be fitting errors of 
the predictions (e.g. expressed as a probability that the similarity to the 
predicted class is random).

Claudia

Assa Yeroslaviz wrote:
> Dear Claudia,
> 
> thank you for your fast answer.
> I add again the table of the data as an example.
> 
> Protein ID 	Pfam Domain 	p-value 	Expected 	Is Expected 	True Postive 
> False Negative 	False Positive 	True Negative
> NP_000011.2 	APH 	1.15E-05 	APH 	TRUE 	1 	0 	0 	0
> NP_000011.2 	MutS_V 	0.0173 	APH 	FALSE 	0 	0 	1 	0
> NP_000062.1 	CBS 	9.40E-08 	CBS 	TRUE 	1 	0 	0 	0
> NP_000066.1 	APH 	3.83E-06 	APH 	TRUE 	1 	0 	0 	0
> NP_000066.1 	CobU 	0.009 	APH 	FALSE 	0 	0 	1 	0
> NP_000066.1 	FeoA 	0.3975 	APH 	FALSE 	0 	0 	1 	0
> NP_000066.1 	Phage_integr_N 	0.0219 	APH 	FALSE 	0 	0 	1 	0
> NP_000161.2 	Beta_elim_lyase 	6.25E-12 	Beta_elim_lyase 	TRUE 	1 	0 	0 	0
> NP_000161.2 	Glyco_hydro_6 	0.002 	Beta_elim_lyase 	FALSE 	0 	0 	1 	0
> NP_000161.2 	SurE 	0.0059 	Beta_elim_lyase 	FALSE 	0 	0 	1 	0
> NP_000161.2 	SapB_2 	0.0547 	Beta_elim_lyase 	FALSE 	0 	0 	1 	0
> NP_000161.2 	Runt 	0.1034 	Beta_elim_lyase 	FALSE 	0 	0 	1 	0
> NP_000204.3 	EGF 	0.004666118 	EGF 	TRUE 	1 	0 	0 	0
> NP_000229.1 	PAS 	3.13E-06 	PAS 	TRUE 	1 	0 	0 	0
> NP_000229.1 	zf-CCCH 	0.2067 	PAS 	FALSE 	0 	1 	1 	0
> NP_000229.1 	E_raikovi_mat 	0.0206 	PAS 	FALSE 	0 	0 	0 	0
> NP_000388.2 	NAD_binding_1 	8.21E-24 	NAD_binding_1 	TRUE 	1 	0 	0 	0
> NP_000388.2 	ABM 	1.40E-08 	NAD_binding_1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	MMR_HSR1 	1.98E-05 	MMR_HSR1 	TRUE 	1 	0 	0 	0
> NP_000483.3 	DEAD 	2.30E-05 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	APS_kinase 	1.80E-09 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	CbiA 	0.0003 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	CoaE 	1.28E-07 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	FMN_red 	4.61E-08 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	Fn_bind 	0.3855 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	Invas_SpaK 	0.2431 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	PEP-utilizers 	0.127 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	NIR_SIR_ferr 	0.1661 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	AAA 	0.0031 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	DUF448 	0.0021 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	CBF_beta 	0.1201 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000483.3 	zf-C3HC4 	0.0959 	MMR_HSR1 	FALSE 	0 	0 	1 	0
> NP_000560.5 	ig 	5.69E-39 	ig 	TRUE 	1 	0 	0 	0
> NP_000704.1 	Epimerase 	4.40E-21 	Epimerase 	TRUE 	1 	0 	0 	0
> NP_000704.1 	Lipase_GDSL 	6.63E-11 	Epimerase 	FALSE 	0 	0 	1 	0
> 
> ...
> 
> this is a shorted list from one of the 10 lists I have for different 
> p-values.
> 
> As you can see I have separate p-value experiments and probably need to 
> calculate for each of them a separate ROC. But I don't know how to 
> calculate these characteristics for the p-values.
> How do I assign the predictions to each of the single p-value experiments?
> 
> I would appreciate any help
> 
> Thanks
> Assa
> 
> 
> On Tue, Aug 17, 2010 at 12:55, Claudia Beleites <cbeleites at units.it 
> <mailto:cbeleites at units.it>> wrote:
> 
>     Dear Assa,
> 
> 
> 
>         I am having a problem building a ROC curve with my data using
>         the ROCR
>         package.
> 
>         I have 10 lists of proteins such as attached (proteinlist.xls).
>         each of the
> 
>     your file didn't make it to the list.
> 
> 
> 
>         lists was calculated with a different p-value.
>         The goal is to find the optimal p-value for the highest number
>         of true
>         positives as well as lowaest number of false positives.
> 
> 
>         As far as I understood the explanations from the vignette of
>         ROCR, my data
>         of TP and FP are the labels of the prediction function. But I
>         don't know how
>         to assign the right predictions to these labels.
> 
> 
>     I assume the p-values are different cutoffs that you use for
>     "hardening" (= making yes/no predictions) from some soft (=
>     continuous class membership) output of your classifier.
> 
>     Usually, ROCR calculates the curves as function of the
>     cutoff/threshold itself from the continuos predictions. If you have
>     these soft predictions, let ROCR do the calculation for you.
> 
>     If you don't have them, ROCR can calculate your characteristics
>     (sens, spec, precision, recall, whatever) for each of the p-values.
>     While you could combine the results "by hand" into a
>     ROCR-performance object and let ROCR do the plotting, it is then
>     probably easier if you plot directly yourself.
> 
>     Don't be shy to look into the prediction and performance objects, I
>     find them pretty obvious. Maybe start with the objects produced by
>     the examples.
> 
>     Also, note ROCR works with binary validation data only. If your data
>     has more than one class, you need to make two-class-problems first
>     (e.g. protein xy ./. not protein xy).
> 
> 
> 
>         BTW, Is there a way of finding the optimum in the curve? I mean
>         to find the
>         exact value in the ROC curve (see sheet 2 in the excel file for
>         the ROC
>         curve).
> 
> 
>     Someone asked for optimum on ROC a couple of months ago, RSiteSearch
>     on the mailing list with ROC and optimal or optimum should get you
>     answers.
> 
> 
> 
>         I would like to thank for any help in advance
> 
>     You're welcome.
> 
>     Claudia
> 
>     -- 
>     Claudia Beleites
>     Dipartimento dei Materiali e delle Risorse Naturali
>     Università degli Studi di Trieste
>     Via Alfonso Valerio 6/a
>     I-34127 Trieste
> 
>     phone: +39 0 40 5 58-37 68
>     email: cbeleites at units.it <mailto:cbeleites at units.it>
> 
> 

-- 
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbeleites at units.it