חוכנית כנס האיגוד הישראלי לסטטיסטיקה 2009

כנוס האיגוד הישראלי לסטטיסטיקה: 17.6.09, אוניברסיטת בן גוריון

את תוכנית הכנס ארגן סהרון רוסט

9:30-10:00	התכנסות, רישום וכיבוד קל בניין 98, אודיטוריום 001
10:10-11:00	דברי פתיחה: יואב בנימיני, נשיא האיגוד ג'ימי ויינבלט, רקטור אוניברסיטת בן גוריון גדי רבינוביץ', ראש החוג להנדסת תעשייה וניהול, אוניברסיטת בן גוריון
	הרצאת פתיחה: עוזי מוטרו, האוניברסיטה העברית דנ"א במשפט – מחוויותיו של עד מומחה
11:00-11:20	הפסקת קפה ומעבר לכיתות
11:20-13:00	מושבים מקבילים Invited talks
	מושבI : סטטיסטיקה ברפואה (תקצירי הרצאות) יו"ר: יוסי לוי, טבע מיקום: בניין 90, חדר 325 Neuroprotective and disease modifying clinical trial design in Parkinson Disease Eli Eyal, Teva Pharmaceutical Industries Statistician’s role in a pharmaceutical industry Daniel Rothenstein, Quark Pharmaceuticals Alternative implementations of Whitehead's methodology for blinded sample size reassessment in survival studies Yossi Levy, Teva Pharmaceutical Industries Estimating and testing interactions in linear regression models when explanatory variables are subject to non-classical measurement error Havi Murad, Gertner Institute	מושב II: סטטיסטיקה רשמית (תקצירי הרצאות) יו"ר: צחי מקובקי, למ"ס מיקום: בניין 90, חדר 326 תעסוקה ושכר של מקבלי תואר ראשון: מעקב לבוגרי 2000-2004 דמיטרי רומנוב, מדען ראשי, למ"ס הערה: 30 דקות (במקום 25) סקר הכנסות משולב – האם קיימת הצדקה לשילוב נתונים ממקורות שונים? נעם כהן, למ"ס הערה: 20 דקות (במקום 25) מתודולוגיה של קישור רשומות בין סקרי משקי הבית למרשם האוכלוסין תיאודור יצקוב, למ"ס Nonparametric estimation of non-response distribution in the Israeli Social Survey Yury Gubman, Central Bureau of Statistics
13:00-14:00	ארוחת צהריים הישיבה השנתית של האיגוד הישראלי לסטטיסטיקה
14:00-15:00	מושבים מקבילים: Contributed talks
	מושב III: מתודולוגיה ותאוריה (תקצירי הרצאות) יו"ר: פליקס אברמוביץ', אוניב' ת"א מיקום: בניין 90, חדר 325 Maximum Likelihood Approach under Indirect Setup Luba Sapir, BGU Adjusted Bayesian Inference for Selected Parameters Daniel Yekutieli, Tel Aviv University Simultaneous testing of several families of hypotheses Marina Bogomolov, Tel Aviv University	מושב IV: יישומים (תקצירי הרצאות) יו"ר: עדנה שכטמן, אוניב' בן גוריון מיקום: בניין 90, חדר 326 Electronic Records Of Undesirable Driving Events Oren Musicant, BGU Stochastic Dynamic Allocation of Kidneys Based on Historical Data Logs Inbal Yahav, University of Maryland How to Dig for DEGs Lena Granovsky, Technion
15:00-16:40	מושבים מקבילים Invited talks
	מושב V: ביואינפורמטיקה וביוסטטיסטיקה (תקצירי הרצאות) יו"ר: ענת ריינר-בן נעים, אוניב' חיפה מיקום: בניין 90, חדר 325 The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery Ehud Aharoni, IBM Research Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses Amit Zeisel, Weizmann Institute Matrices, Modules and Motifs for Understanding Gene Regulation Ron Shamir, Tel Aviv University On Combining and Contrasting Brains Nicole Lazar, University of Georgia	מושב VI: סטטיסטיקה תעשייתית (תקצירי הרצאות) יו"ר: תמר גדריך, אורט בראודה מיקום: בניין 90, חדר 326 Process and Service Optimization via Bayesian Networks Aviv Gruber, Tel Aviv University On Non-Homogeneous Markov Reward Models for Reliability Measures Estimation of Aging Multi-State System Ilia Frenkel, Sami Shamoon College, Beer Sheva Sensitivity Analysis, Computational Models And Statistics: A Case Study Of A Nuclear Waste Repository David Steinberg, Tel Aviv University פיתוח תוכנית דגימה מבוססת כושר תהליך Cpk איתי נגרין, אוניברסיטת בן גוריון
16:40-17:00	הפסקת קפה ומעבר לאודיטוריום 001 בבניין 98
17:00-18:10	חלוקת פרסי האיגוד
	דיון פומבי: הסטטיסטיקה באוניברסיטאות בארץ לאן? מנחה: יואב בנימיני, יו"ר האיגוד משתתפים: - אבנר הלוי, אוניב' חיפה - דני פפרמן, האוניב' העברית - אפריים גולדין, G-Stat - סהרון רוסט, אוניב' תל אביב

תקצירי ההרצאות

הרצאת פתיחה: דנ"א במשפט – מחוויותיו של עד מומחה
עוזי מוטרו
המחלקה לסטטיסטיקה והמחלקה לאבולוציה, סיסטמטיקה ואקולוגיה
האוניברסיטה העברית בירושלים

בתקופה האחרונה אנו עדים לשימוש גובר והולך בזיהוי פורנזי על פי סמני דנ"א. מהם אותם סמני דנ"א, כיצד נעשה הזיהוי במקרים השונים, מהם ההיבטים ההסתברותיים והסטטיסטיים של הזיהוי – בכך נעסוק, ולא נדלג על דוגמאות מהחיים.

חזרה לתוכנית

מושבI : סטטיסטיקה ברפואה
יו"ר: יוסי לוי, טבע

Neuroprotective and disease modifying clinical trial design in Parkinson Disease

Eli Eyal, Teva Pharmaceutical Industries

A neuroprotective therapy is the most important unmet medical need in Parkinson's Disease . Several promising agents in the laboratory have been tested in the clinic, but none has been established in clinical trials to have a disease modifying effect. The delayed start design was developed to try to avoid a symptomatic confound when testing a potential neuroprotective therapy. In this study design, patients are randomly assigned to study drug or placebo in the first phase of the study, and both groups receive the active drug in the second phase. If benefits seen at the end of phase I persist through the end of phase II, they cannot be explained by a symptomatic effect (as patients in both groups are receiving the same medication) and benefits in the early start group must relate to the early initiation of the treatment. The delayed start design was used to assess the potential disease modifying effects of rasagiline in a prospective double blind controlled trial (the ADAGIO study). The hypothesis of disease modifying effect is tested by looking on the behavior of the change from baseline in the Unified Parkinson Disease Rating Scale (UPDRS) comparing slopes estimates and estimates at end of trial from Mixed Models with Repeated Measures (MMRM) analyses. In this talk I will describe the ADAGIO trial design, the efficacy primary hypotheses, the statistical models and the challenges that we had in power calculation, model assumptions, and sensitivity analyses to missing observations.

Statistician’s role in a pharmaceutical industry

Daniel Rothenstein, Quark Pharmaceuticals

Chronic renal failure (CRF) is a long-standing, progressive deterioration of renal function. CRF can be roughly categorized as diminished renal reserve, renal insufficiency, and renal failure (end-stage renal disease). Chronic loss of function causes generalized wasting (shrinking in kidney size / mass) and progressive scarring within all parts of the kidneys.

Quark Pharmaceuticals developed a Monoclonal Antibody (MAB) to a proprietary target gene as a potential therapeutic agent against renal fibrosis. The rat 5/6 Nephrectomy model was selected for testing the efficacy of the MAB.

This presentation will emphasize the statistician’s role and contribution throughout all the study stages: (a) Selection and characterization of the animal model to be used in order to assess the effect of the MAB in the prevention of chronic renal failure (CRF) development, (b) Development of the MAB, (c) Testing MAB delivery efficiency in the animal model in order to determine a therapeutic regime (dosage and schedule)

Alternative implementations of Whitehead's methodology for blinded sample size reassessment in survival studies

Yossi Levy, Teva Pharmaceutical Industries

Whitehead (2001) proposed a general framework for a blinded mid-trial design review of a survival clinical trial, and a particular implementation example was given by Whitehead et. al. (2001). However, there are other possible ways to implement this framework. In this talk, I will describe alternative implementations of the framework, and compare these alternatives using simulations.

Estimating and testing interactions in linear regression models when explanatory variables are subject to non-classical measurement error

H. Murad¹, V. Kipnis², D. Midthune², O. Kalter-Leibovici³, L. S. Freedman¹

¹Biostatistics Unit, and ³Unit for Cardiovascular Epidemiology, Gertner Institute for Epidemiology and Health Policy Research, , Tel-Hashomer, ISRAEL

² Biometry Research Group, Division of Cancer Prevention, National Cancer Institute Rockville, MD, US

Estimating and testing interactions in a linear regression model when covariates are subject to measurement error (ME) is complex, since the interaction term is a product of two or more covariates and involves errors of more complex structure.

We have described in recent work [1] how to estimate and test for interactions when covariates are normally distributed and subject to classical ME. We proposed a version of regression calibration (RC), and showed that it yields consistent estimators and standard errors, and that its test for interaction appears to perform well. In many applications the classical ME model does not hold, particularly when measurements are based on self-report. In this paper, we generalize our RC approach for interaction models to a class of non-classical ME models. Motivated by an application that includes a sub-sample calibration with imperfect reference instruments, we account for the extra uncertainty involved in estimating the ME parameters, using the stacking equations method. We apply our method to data from the Observing Protein and Energy Nutrition (OPEN) study, where the level of errors is high, and find that RC does not work well for estimation, yielding inflated parameters and standard error estimates. This problem is overcome by extending the method to efficient RC [2], where the estimator combines the information about the parameters of interest contained in the sub-sample calibration study with our RC estimator from the main study, using a generalized inverse-variance weighted average. We also developed an efficient version of the method of moments that does not assume normally distributed covariates. Using simulations, based on the design of the OPEN study, we show that in sub-samples of 100-500 both efficient RC and efficient MM perform well. In another set of simulations, we investigate the Type I error rate of the interaction test and show that the efficient RC-based test is slightly liberal when the level of ME is high.

1. Murad, H. and Freedman, L. S. (2007). Estimating and testing interactions in linear regression models when explanatory variables are subject to classical measurement error. Statistics in Medicine 26, 4293-4310.

2. Spiegelman, D., Carroll, R. J., and Kipnis, V. (2001). Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. Statistics in Medicine 20, 139-160.

חזרה לתוכנית

מושב II: סטטיסטיקה רשמית
יו"ר: צחי מקובקי, למ"ס

תעסוקה ושכר של מקבלי תואר ראשון: מעקב לבוגרי 2000-2004
ד"ר דמיטרי רומנוב, מדען ראשי, למ"ס

פרסום זה מציג נתונים על מגמות בתעסוקה והכנסה מעבודה בקרב מקבלי תואר ראשון בין השנים 2000-2004. הדוח מתבסס על נתונים מינהליים שנאספו על ידי הלשכה המרכזית לסטטיסטיקה ורשות המסים בישראל, ונועד להשלים פרסומים קיימים על הכנסות בקרב מקבלי תואר ראשון, באמצעות הצגת תמונה רחבה ומפורטת לגבי תהליך כניסתם לשוק העבודה המקצועי. הדוח עוקב אחר אוכלוסיית מקבלי תואר ראשון הרשומים בתיקי מס הכנסה במשך תקופה של עד חמש שנים, ובוחן את מהלך תעסוקתם לפי אפיונים רבים של השכלה ודמוגרפיה. מאחר שפרסום זה מבוסס גם על נתונים מינהליים, הוא מספק כיסוי כמעט מלא על נתוני התעסוקה הרשמיים במשק (כמדווח למס הכנסה), ובכך מתגבר על מגבלות מסורתיות של נתונים המסתמכים על סקרים, כגון טעויות דגימה והטיה בקבוצות קטנות.
זוהי הפעם הראשונה שדוח מפורט מסוג זה נערך בישראל, והנתונים בו יכולים לסייע למספר רב של משתמשים. מבחינת ההון האנושי, פרסום זה מספק מידע חשוב על ערכי ההכנסה הקשורים לתוכניות לימוד מסוימות. עבור אלה המבקשים להמשיך בלימודים, הנתונים יכולים להצביע על ההשפעה שתהיה להחלטותיהם לגבי תוכניות לימוד שונות – הן לגבי הכנסה עתידית והן לגבי אפשרויות תעסוקה. כמו כן, בעזרת תוצאות המחקר, יוכלו מעסיקים פוטנציאליים לייעל את פעילותם הכלכלית, בכל הקשור לשכר עבודה לפי תחומי השכלה ספציפיים (דהיינו, סוג התואר ודרגתו).
קובעי מדיניות וחוקרים, הן בתחום החינוך והן בתחום התעסוקה, יכולים אף הם להפיק תועלת מניתוח דפוסי הכנסה ותעסוקה של מקבלי תואר ראשון בתקופה האחרונה. נתונים אלה יכולים לספק, בעקיפין, מידע לגבי איכותם של תוכניות לימוד ומוסדות מסוימים וכן על תרומתם לשוק העבודה.
ולבסוף, לאור התפקיד החשוב של ההשכלה גבוהה בקביעת הכנסה ותעסוקה לאורך חיים שלמים, נתונים אלה יכולים לסייע לקבוצות המעוניינות לקדם שוויון הזדמנויות לשכבות חלשות ולקבוצות מיעוט, הן ברכישת השכלה גבוהה והן בהשתלבות בשוק העבודה.

סקר הכנסות משולב – האם קיימת הצדקה לשילוב נתונים ממקורות שונים?

נעם כהן, למ"ס

בלשכה נערכים שני סקרים אשר אוספים מידע בנוגע להכנסות משקי הבית – סקר הכנסות וסקר הוצאות משקי בית (הו"מ).

סקר הכנסות, המתבצע באופן שוטף החל משנת 1965, הוא סקר נלווה לסקר כח-אדם (סכ"א), כאשר השייכים לאוכלוסיית סקר הכנסות מתבקשים לדווח על הכנסותיהם. סקר הו"מ מתבצע באופן שוטף החל משנת 1997 ואחת ממטרותיו העיקריות היא לקבוע את משקל כל רכיב בסל הצריכה של "מדד המחירים לצרכן" על סמך מרכיבי התקציב של משקי הבית. תהליך הדגימה בשני הסקרים דומה, ונערך בנפרד לכל סקר באופן בלתי תלוי. יחידת הדגימה הסופית היא דירה ונחקרים כל משקי הבית הגרים בה במועד החקירה (בדר"כ מ"ב יחיד). החל משנת 1997 נתוני ההכנסות משני הסקרים משולבים לסקר אחד הנקרא "סקר הכנסות משולב".

בשנת 2002 נערך מחקר בנושא אי-השבה וההטיות הנובעות מכך על נתוני ההכנסות מהמקורות השונים, זאת על בסיס נתוני השנים 97-99. במחקר מוצג הדפוס השונה של אי ההשבה בין שני הסקרים עבור מגוון מאפיינים דמוגרפיים וההטיות המתקבלות עקב כך באומדני הכנסות שונים. מסקנת המחקר היתה, כי תבניות אי השבה שונות יוצרות הטיות של משתני ההכנסות משני הסקרים לכיוונים מנוגדים ושילובם ב"סקר הכנסות משולב" מצמצם באופן משמעותי הטיות אלה ומכסה טוב יותר את אוכלוסיית היעד.

עם תכנון סקר כוח-אדם במתכונת חדשה עולה הצורך בקבלת החלטה בנוגע למבנה סקר הכנסות העתידי. נערכו סדרת בדיקות שמטרתן היא לבדוק האם תבניות אי ההשבה של שני הסקרים עדיין "משלימות זו את זו" והטיות הנובעות מאי השבה מקזזות זו את זו, בצורה כזו המצדיקה את שילוב הסקרים גם כיום. לצורך כך, בוצע ניתוח סטטיסטי עדכני, הכולל גם ניתוח מקביל לזה המופיע במאמר, המתייחס לנתוני הסקרים לשנים 2004-2007.

מתודולוגיה של קישור רשומות בין סקרי משפחות למרשם אוכלוסין

תיאודור יצקוב, ראש תחום, למ"ס

ליאת נוקריאן, ראש גף, למ"ס

תהליך קישור הרשומות בין משיבי סקרי המשפחות של הלשכה למרשם האוכלוסין נועד לצורך הוספת השדה תעודת זהות. בעזרת מספר תעודת זהות תקין ניתן יהיה לקשר את משיבי סקרי משקי הבית עם מקורות רבים אחרים, ובכך לשרת מטרות שונות של הלשכה, דוגמת הערכת איכות הנתונים הנאספים בשדה בהשוואה לאותם נתונים במקורות מנהליים, הוספת משתנים ממקורות מנהליים לבסיס הנתונים של הסקר והקטנת נטל ההשבה על ידי ויתור על משתנים קיימים ממקורות מנהליים.

בסקרי המשפחות של הלשכה המרואיינים מתבקשים לדווח על מספר תעודת הזהות של בני משק הבית, אך אחוז ההיענות נמוך. רק כ-50% ממשיבי סקר כוח אדם (להלן סכ"א) וכ-70% משיבי סקר הוצאות משק הבית (להלן הו"מ) מסרו מספרי תעודת זהות ולכן יש צורך בפעולת הקישור.

המתודולוגיה לקישור קבצי סקרי המשפחות למרשם האוכלוסין מבוססת על המערכת לקישור רשומות של המפקד המשולב. בנוסף פותחו אלגוריתמים ותוכניות אשר מנצלות את המידע המשפחתי הקיים בסקר זאת על-ידי: א) צירוף פרטי קרובי משפחה מדרגה ראשונה לכל פרט; ב) זיווג משקי בית מהסקר למשפחות מנהליות (או לדיירי בניין) במרשם (משתני בלוק קווזי).

שני סוגי קישורים מיושמים במערכת: קישור מדויק בו נדרשת הלימה מלאה בין משתני הקישור, וקישור הסתברותי בו נדרשת הלימה חלקית בין המשתנים. הקישור המדויק מתבצע כולו בצורה אוטומטית והקישור ההסתברותי מתבצע ברובו בצורה אוטומטית ובחלקו באופן ידני תמוך מחשב. תהליך הקישור הוא רב שלבי כאשר בכל שלב נבחנת סדרה אחרת של משתני קישור ורשומה שקושרה מסומנת ואינה ממשיכה בתהליך. מתודולוגית קישור סקרי המשפחות מכילה 19 שלבים של קישור מדויק, 15 שלבים של קישור הסתברותי רגיל ושלב אחד של קישור הסתברותי משפחתי\קבוצתי המבוסס על משתני בלוק קווזי.

בניסוי שנערך בעזרת המערכת קושרו קובץ סכ"א רבעוני (רבע 2 2006) וקובץ שנתי של הו"מ 2006 ונמצא קישור לכ-95% מהנדגמים בשני הסקרים. אחוזי הקישור הידני הינם 2.5% בהו"מ ו-4.8% בסכ"א וכל היתר קושרו באופן אוטומטי. הקשרים שנמצאו נבדקו מדגמית, ואחוז הקישורים השגויים (טעות קישור מסוג I) אינו עולה על 0.5%. בנוסף נבדקו ונותחו רשומות להן לא נמצא קישור שחלקן מרכיבות טעות קישור מסוג II.

Nonparametric estimation of non-response distribution in the Israeli Social Survey

Yury Gubman¹, Charles F. Manski², John V. Pepper³, Dmitri Romanov¹

Non-response adjustment in the Israeli Social Survey (ISS) is based on the MAR assumption. Association of non-response with key socio-economic characteristics (individual's economic status and degree of religiosity) which do not correlate strongly with standard survey design and calibration variables may corrupt the MAR assumption validity. We analyze survey and item non-response in ISS by estimating non-parametric sharp bounds for conditional mean to key ISS variables. Statistical tests for checking validity of MCAR and MAR assumptions are proposed, where the test statistics are based on the width of the interval between the estimated bounds. We find significant departures from MAR assumption in the ISS data. Non-response propensity varies significantly between population groups assumed to be homogenous according to the survey design. We propose to utilize information about income and religiosity, available on individual or neighborhood level, for improving the ISS design.

Keywords: survey non-response, item non-response, MAR assumption, sharp bounds.

¹ Israeli Central Bureau of Statistics, Jerusalem, Israel

² Northwestern University, Illinois, USA

³ University of Virginia, Virginia, USA

חזרה לתוכנית

מושב III: מתודולוגיה ותאוריה
יו"ר: פליקס אברמוביץ', אוניב' ת"א

Maximum Likelihood Approach under Indirect Setup

Daniel Berend, Luba Sapir

In the classical setup of point estimation problems, one has observations of a random sample from a known density function with anunknown parameter. In various applications, though, the observations themselves are unknown. Instead, we have some information related to these "unobserved observations". Namely, there is a second distribution, also depending on some parameter and a table of observations from this distribution; each row of the table arises from an instance of the second distribution, where the parameter is one of the observations in the first stage. Our purpose is to estimate the parameter of the distribution from the first stage in the best way. Note that, in the case where the second distribution is constant, the problem reduces to the classical estimation problem. The setup is motivated by practical con-temporary applications from various domains, such as classification, reputation systems in e-commerce, survey analysis, etc.

We propose for this setup an approach for estimating the unknown parameter, based on the data of the second level. We try to obtain an analogue of the maximum likelihood estimator, which in our case may be termed indirect maximum likelihood estimator. We illustrate it in two situations. In one of these, the maximum likelihood estimator can be obtained in closed-form; in the other it can be obtained only implicitly as a solution of a certain equation. In the first case we discuss the properties and intuitive meaning of our estimator, and show ts advantages vis-a-vis other natural estimators. In the second, we suggest an iterative scheme, and under certain conditions prove its convergence to the maximum likelihood estimate.

Adjusted Bayesian inference for selected parameters

Daniel Yekutieli, Tel Aviv University

We address the problem of providing inference for parameters selected after viewing the data. A frequentist solution to this problem is False Discovery Rate adjusted inference. We explain the role of selection in controlling the occurrence of false discoveries in Bayesian analysis, and argue that Bayesian inference may also affected by selection. We introduce selection-adjusted Bayesian methodology based on the conditional posterior distribution of the parameters given selection; show how it can be used to specify selection criteria; explain how it relates to the Bayesian FDR approach; and apply it to microarray data.

Simultaneous testing of several families of hypotheses

Marina Bogomolov and Yoav Benjamini, Tel Aviv University

Modern statisticians are often challenged with a big set of hypotheses that are divided into families. For example, in fMRI research we wish to test the hypotheses about activation of voxels, where each voxel belongs to some anatomic brain region. These brain regions define families of hypotheses. In some applications it is desirable to achieve FDR control within each family along with the overall FDR control. We will show that combined testing of the entire set of hypotheses does not offer FDR control within each family, whereas separate testing of each family of hypotheses does not offer the overall FDR control. We will propose some in-between approaches: several variations of hierarchical testing methodologies and a multi-stage hierarchical testing methodology. We will support by theoretical and simulation results the conjecture that these methodologies offer both within family and overall FDR control. The overall power of these methodologies will also be considered.

חזרה לתוכנית

מושב IV: יישומים
יו"ר: עדנה שכטמן, אוניב' בן גוריון

Electronic Records Of Undesirable Driving Events

Oren Musican, Hillel Bar-Gera, Edna Schechtman

Department of Industrial Engineering and Management

Ben-Gurion University of the Negev

The cause of the majority of road crashes can be attributed to drivers' behavior. Recent in-vehicle monitoring technologies enable continuous and detailed measurements of certain behaviors of drivers. We analyzed the information received from a novel in-vehicle technology which identifies the occurrence of undesirable driving events such as extreme braking and accelerating, sharp cornering and sudden lane changing. Previous studies demonstrated the connection between these events and accident records, suggesting that events frequency (EF) is an appropriate surrogate for safety.

We undertook an exploratory analysis to provide better understanding of EF statistical properties. Some findings are in-line with driving safety literature such as differences between males and females and between night and day. Other findings were somewhat less expected, such as differences in mean EF between trip edges (trip beginning and trip end) and middle of the trip.

Use of the in-vehicle technology's continuous and high resolution measurements enabled interesting advanced statistical analyses. Future research can use our findings to build similar statistical models to predict the occurrence of undesirable driving events by other independent variables.

Stochastic Dynamic Allocation of Kidneys

Based on Historical Data Logs

Inbal Yahav Gisela Bardossy

R.H. Smith School of Business, University of Maryland

iyahav@rhsmith.umd.edu

According to Scientific Registry of Transplant Recipients (SRTR) annual statistics, there are more than annual 79,000 candidates with kidney failure End Stage Renal Disease (ESRD), who are waiting for transplantation in the U.S. Whereas only about 10,000 deceased donor kidneys are available for transplantation each year, more than 20,000 new annual candidates are added to the waiting list.

For the last thirty years, the allocation policy has revolved on a set of priority points that seek to favor the matching between donor and organ tissues while prioritizing patients that spent the most time on the waiting list. As the waiting list grew and the population of candidates aged the allocation process was dominated by the waiting time and became a merely first come first transplant system.

In this paper we propose a stochastic dynamic programming approach to the kidney allocation problem. Our method seeks to maximize a multicriteria objective that balances the efficiency of the allocations with equity in the system. The novelty of our method is that incorporates future uncertain allocations (extracted from historical data logs) to the decision process yet maintains computational-feasibility regardless of the challenges that the large dimensionality of the problem presents. We show that our resulting policy is incentive compatible in sense that neither the policy planner not the patients have an incentive to deviate from the proposed allocation.

How to Dig for DEGs

Lena Granovsky and Paul Feigin

Permutation methods are commonly used to estimate a (null) distribution for nondifferentially expressed genes in microarray experiments. However, different permutation methods can lead to different estimates of the null distribution and consequently can result in substantially different lists of differentially expressed genes (DEGs).

This article explores the problem of choosing the appropriate permutation procedure in order to find the right null hypothesis density. Our method modifies the nonparametric empirical Bayes approach proposed by Efron et al. (2001) by suggesting a number of permutation procedures, which are used for an empirical test of the appropriate null hypothesis. We compute the a-posteriori probabilities of each gene to be differentially expressed. Analogous to Efron’s method, we assume that the observed expression scores are generated from a mixture of a distribution for the affected genes and distribution for the unaffected genes. This mixture density, as well as the density of the unaffected genes (the null hypothesis density), are estimated from the empirical distribution of the observed expression levels using different permutation procedures. The desired inference about the differential expression for a particular gene is then referred to these estimated densities.

The methods proposed in this study are applied to the data produced by the Affymetrix GeneChip and Illumina BeadChip one-color microarray platforms. We provide a comparison of a number of permutation methods applied to the data from two differently designed experiments: Microarray Quality Control (MAQC) (2006) and a pharmacogenetics study of multiple sclerosis treatment. We use these data sets because they allow us to confirm whether the proposed permutation procedures are appropriate for use in the most common designs for microarray experiments. We show that the proposed permutation techniques may be used when the experimental design includes technical variation within the same treatment group (as in the MAQC data) or biological variation (as in the multiple sclerosis experiment). Although we focus on two specific sets of experimental data, the proposed methods can be quite generally applied to many kinds of comparative microarray experiments with an arbitrary number of samples. In addition, robustness of the MAQC study may be examined by comparing its results to the methods proposed in this study.

In this work we show that the choice of the permutation method should depend on the experimental design. Different choices of the null distribution result in substantially different lists of differentially expressed genes. We compare our models with a twosample t-test adjusted for multiple testing using the most widely used Benjamini and Hochberg’s FDR procedure. We show that in many applications the usual estimate of the null distribution is incorrect and might produce large differences in a list of differentially expressed genes. We also draw attention to the issue of adding a constant to the denominator of the t-statistics to improve the estimate of the standard deviation for genes at low expression levels. We show that in some cases, different choices of this constant influence which genes are identified as significant.

In addition, the results of our study are consistent with the results of the MAQC study, indicating that for the chosen sample types and laboratories, microarray results are reproducible between test sites and comparable across the Affymetrix and Illumina platforms. However, variation in the lengths of differentially expressed genes lists is found between the two companies. Significant differences are also found between the list sizes of the different sites, especially for the Illumina platform. These results indicate that the platform and site affect the outcome probabilities and should be taken into account when producing a list of differentially expressed genes.

חזרה לתוכנית

מושב V: ביואינפורמטיקה וביוסטטיסטיקה
יו"ר: ענת ריינר-בן נעים, אוניב' חיפה

Title: The Quality Preserving Database: A Computational Framework for Encouraging Collaboration, Enhancing Power and Controlling False Discovery
Ehud Aharoni¹, Hani Neuvirth-Telem¹ and Saharon Rosset²

¹IBM Haifa Research Laboratories

²Tel Aviv University

The common scenario in computational biology in which a community of researchers conduct multiple statistical tests on one shared database gives rise to the multiple hypotheses testing problem. Conventional procedures for solving this problem control the probability of false discovery by sacrificing some of the power of the tests.

We suggest a scheme for controlling false discovery without any power loss by adding new samples for each use of the database and charging the user with the expenses.

The crux of the scheme is a carefully crafted pricing system that fairly prices different user requests based on their demands while keeping the probability of false discovery bounded.

We demonstrate this idea in the context of HIV treatment research, where multiple researchers conduct tests on a repository of HIV samples.

Improving the Performance of the FDR Procedure Using an Estimator for the Number of True Null Hypotheses

Amit Zeisel¹, Or Zuk² and Eytan Domany¹

¹Department of Physics of Complex Systems, The Weizmann Institute of Science, Rehovot, Israel

²Broad Institute of MIT and Harvard, Cambridge, Ma, USA

The steep rise in availability and usage of high-throughput technologies in biology brought with it a clear need for methods to control the False Discovery Rate (FDR) in multiple tests. Benjamini and Hochberg (BH) introduced in 1995 a simple procedure and proved that it provided a bound on the expected value, FDR ? q. Since then, many authors tried to improve the BH bound. We present two very simple estimators for m0, the number of true hypotheses. Using these estimators we propose improved bounds on the FDR, and demonstrate their advantages over several available bounds, on simulated data and on a large number of gene expression datasets. Using the original BH bound as reference, our method can be used in two ways: 1. for FDR estimation, i.e. to keep the number of discoveries (e.g. differentiating genes) but provide a lower estimate of the bound of their FDR, or to 2. to control the FDR, keep the desired FDR level and return a larger number of discoveries. In order to prove the control property for our two proposed procedures, we first proved a more general theorem, which provides a bound on the FDR for any procedure that uses an estimator for the number of true null hypotheses. Both applications involve the same amount of computation as the BH procedure.

Matrices, Modules and Motifs for Understanding Gene Regulation

Ron Shamir

Tel Aviv University

High throughput experimental techniques are generating larger and larger biomedical datasets, with great potential to further our understanding of basic life and disease processes. However, developing integrated analysis methods of these data that provide focused biomedical insights is still a major challenge, especially when the data are heterogeneous. This talk will describe some of our tools for understanding gene regulation, by joint analysis of gene expression, protein interaction networks, sequence and clinical data from patient records. Our methods combine graph algorithms with probabilistic modeling and heuristics. I will demonstrate the methods on specific biological applications related to stem cells and cancer.

Joint work, in parts, with Igor Ulitsky (TAU), Richard M. Karp (UC Berkeley) Adi Maron-Katz, Y. Halperin, C. Linhart (TAU), R. Elkon (NCI Netherland), and Y. Shiloh (TAU)

On Combining and Contrasting Brains

Nicole A. Lazar

Department of Statistics, University of Georgia, Athens, GA USA

A challenging problem in the statistical analysis of human brain function via functional magnetic resonance imaging (fMRI) is that of comparing activation across groups of subjects. In the first part of this talk, I will discuss methods for "combining" brains, that is, creating a map that summarizes the overall activity pattern of a group of subjects. This can be analogized to the old problem of combining information from independent studies, and I draw on techniques historically used for that problem, to solve the current one.

Once a map has been created for a single group of subjects, we can think about "contrasting", or comparing, the maps for multiple groups. While group comparisons are often accomplished via such standard techniques as the random effect linear model, I will argue that this approach is potentially over-conservative, impairing the ability to detect differences of interest, which may be differences of extent, of magnitude, or both. Instead, I propose extending various of the methods used in the first part of the talk for making group maps, via a combination of statistical distribution theory and computational procedures (bootstrap and permutation). In the second part of this talk, I will discuss some of the issues that arise in extending group maps in this way, and some possible solutions.

חזרה לתוכנית

מושב VI: סטטיסטיקה תעשייתית
יו"ר: תמר גדריך, אורט בראודה

Process and Service Optimization via Bayesian Networks

Aviv Gruber, Irad Ben-Gal

Department of Industrial Engineering

Tel-Aviv University

Ramat-Aviv, Tel-Aviv 69978, Israel

We focus on the development of an automated Data Mining tool that supports optimization processes in industrial and service systems. The tool should be implemented on the organization database and analyze it with respect to some target function(s). For example, it should analyze the effects of different marketing indicators on the organization revenue or, similarly, the effects of different maintenance actions on the overall reliability of the system.

Our baseline reference is a hybrid optimization method that utilizes Monte Carlo calculation points from which an analytic model learns its essential optimization parameters. Yet, this method was developed in the realm of the specific problems' contents, lacking the ability to tackle generic complex stochastic systems.

On Non-Homogeneous Markov Reward Models for Reliability Measures Estimation of Aging Multi-State System

Ilia Frenkel, Lev Khvatskin

Center for Reliability and Risk Management, Industrial Engineering and Management Department, Sami Shamoon College of Engineering, Beer Sheva, Israel

iliaf@sce.ac.il / khvat@sce.ac.il

Anatoly Lisnianski

The Israel Electric Corporation Ltd., Haifa, Israel

anatoly-l@iec.co.il

This paper considers reliability measures for aging multi-state system where the system and its components can have different performance levels ranging from perfect functioning to complete failure. Aging is treated as failure rate increasing function. The suggested approach presents the non-homogeneous Markov reward model for computation of commonly used reliability measures such as mean accumulated performance deficiency, mean number of failures, average availability, etc., for aging multi-state system. Corresponding procedures for reward matrix definition are suggested for different reliability measures. A numerical example is presented in order to illustrate the approach.

Sensitivity Analysis, Computational Models And Statistics: A Case Study Of A Nuclear Waste Repository

David M. Steinberg

Department of Statistics and Operations Research, Tel Aviv University

For many scientific and engineering research problems, the computer terminal has replaced the laboratory and the test bench for conducting experiments. Many phenomena are too expensive for physical testing (e.g. automobile crash resistance tests) or impossible due to time scales (e.g. climate change (

One important use of computational models is for sensitivity analysis, which is aimed at identifying the most important inputs to the model. As such, sensitivity analysis has much in common with traditional experimental design.

This talk will discuss statistical methods being used for sensitivity analysis in the context of a case study, using the RESRAD simulator, developed at Argonne National Laboratory, to identify critical parameters for a nuclear waste repository. Typical risk analyses consider the migration of radionuclides into, for example, the food and water supply during tens of thousands of years. Among numerous outcomes associated with risk, we focus on the maximal equivalent annual dose in the drinking water during a 10,000-year time frame. There is good understanding of the physics that govern decay and migration and that is reflected in the computational model. Inputs to the model include parameters that govern the interaction between the isotopes and the repository site (for example, level of precipitation, pumping rate and distribution coefficients). The exact values of the parameters depend on the specifics of the repository site and the isotopes themselves. Prior knowledge of many of the parameter values is often weak. A concern in planning a repository is to identify the parameters that are most influential in controlling the risk. We will describe some possible approaches to designing the study and to the statistical analysis of the outcome data.

The case study was carried out in collaboration with Tamir Reisin, Eyal Hashavia and Gideon Leonard from the Israel Atomic Energy Commission.

פיתוח תוכנית דגימה מבוססת כושר תהליך Cpk

איתי נגרין, ישראל פרמט, עדנה שכטמן

אוניברסיטת בן-גוריון בנגב

תוכניות דגימת משתנים, המקובלות כיום בתעשייה, מתייחסות לביצועי התהליך בלבד ואינן מתייחסות לדרישות ההנדסיות בעת קביעת גודל מדגם נדרש. כאשר כושר התהליךCpk טוב, גבולות התפלגות המנה (האחוזון ה - (1-γ /2)) מרוחקים פנימית מגבול המפרט ההנדסי, שטח זנב ההתפלגות הנורמאלית החורג מגבול המפרט קרוב לאפס והשפעת טעות הדגימה (שנובעת משמוש באומדן לגבול ההתפלגות) על שטח זה הינה זניחה. לעומת זאת, כאשר גבול התפלגות המנה קרוב בערכו לגבולות המפרט (כושר תהליך גבולי), שטח זנב ההתפלגות הנורמאלית החורג מגבול המפרט משמעותי ולכן, צפוי כי טעות הדגימה תגרור באופן ישיר סטייה בהערכת שיעור הפסולים במנה. במחקר זה פותחו עד כה שתי תוכנית דגימה רב שלביות מבוססת כושר תהליך Cpk מותאמת לביקורת קבלה למשתנים כמותיים הנלקחים כמדגם אקראי פשוט ממנה בגודל N נתון הבאה מהתפלגות נורמאלית. תוכנית הדגימה הראשונה פותחה תחת הנחת שונות ידועה ובכך התבססה על טעות הדגימה הנובעת מאמידת התוחלת בלבד. לעומת זאת, בתוכנית הדגימה השנייה הוסרה הנחת השונות הידועה ובכך פיתוח התוכנית התבסס על טעות הדגימה הנובעת מאמידת באמצעות האומד . תוכנית הדגימה הראשונה הושוותה (באמצעות סימולציות) אל מול תוכנית הדגימה MIL-STD-414 ונמצאה כבעלת הסתברות נמוכה יותר לדחיית מנות טובות (0%-0.03% מול 0%-0.27%בתוכנית MIL-STD-414) והסתברות נמוכה יותר לקבלת מנות פסולות (0%-0.01% מול 5.2%-8.4%בתוכנית MIL-STD-414). בנוסף, תוכנית הדגימה נמצאה כבעלת יתרון בגודל המדגם עבור תהליכים בעלי כושר תהליך גבוה. לאחר ביצוע הסימולאציות פותח עקום אפיון המתאר את הקשר בין הסתברות הקבלה ושיעור הפסולים. עקום האפיון שפותח השווה אל מול עקום האפיון של תוכנית דגימה MIL-STD-414 ונמצא כבעל יכולת הבחנה (בין מנות תקינות ופסולות) העולה בסדר גודל על יכולת ההבחנה של MIL-STD-414.

חזרה לתוכנית