Responsible Conduct in Data Management
Home Topics About the Module Feedback Contact Us ORI
Data Selection Main Quiz Games Cases Glossary

Data selection
is defined as the process of determining the appropriate data type and source, as well as suitable instruments to collect data. Data selection precedes the actual practice of data collection. This definition distinguishes data selection from selective data reporting (selectively excluding data that is not supportive of a research hypothesis) and interactive/active data selection (using collected data for monitoring activities/events, or conducting secondary data analyses). The process of selecting suitable data for a research project can impact data integrity.

The primary objective of data selection is the determination of appropriate data type, source, and instrument(s) that allow investigators to adequately answer research questions. This determination is often discipline-specific and is primarily driven by the nature of the investigation, existing literature, and accessibility to necessary data sources.

Integrity issues can arise when the decisions to select ‘appropriate’ data to collect are based primarily on cost and convenience considerations rather than the ability of data to adequately answer research questions. Certainly, cost and convenience are valid factors in the decision-making process. However, researchers should assess to what degree these factors might compromises the integrity of the research endeavor.

Considerations/issues in data selection

There are a number of issues that researchers should be aware of when selecting data. These include determining:
  • the appropriate type and sources of data which permit investigators to adequately answer the stated research questions,
  • suitable procedures in order to obtain a representative sample
  • the proper instruments to collect data. There should be compatibility between the type/source of data and the mechanisms to collect it. It is difficult to extricate the selection of the type/source of data from instruments used to collect the data.

Types/Sources of Data

Depending on the discipline, data types and sources can be represented in a variety of ways. The two primary data types are quantitative (represented as numerical figures - interval and ratio level measurements), and qualitative (text, images, audio/video, etc.). Although scientific disciplines differ in their preference for one type over another, some investigators utilize information from both quantitative and qualitative with the expectation of developing a richer understanding of a targeted phenomenon. Data sources can include field notes, journal, laboratory notes/specimens, or direct observations of humans, animals, plants.

Interactions between data type and source are not infrequent. Researchers collect information from human beings that can be qualitative (ex. observing child rearing practices) or quantitative (recording biochemical markers, anthropometric measurements). Determining appropriate data is discipline-specific and is primarily driven by the nature of the investigation, existing literature, and accessibility to data sources.

Questions that need to addressed when selecting data type and type include:

  1. What is (are) the research question(s)?
  2. What is the scope of the investigation? (This defines the parameters of any study. Selected data should not extend beyond the scope of the study).
  3. What has the literature (previous research) determined to be the most appropriate data to collect?
  4. What type of data should be considered: quantitative, qualitative, or a composite of both?

Methodological Procedures to Obtain a Representative Sample

The goal of sampling is to select a data source that is representative of the entire data universe of interest. Depending on discipline, samples can be drawn from human or animal populations, laboratory specimens, observations, or historical documents. Failure to ensure representativeness may introduce bias, and thus compromise data integrity.

It is one thing to have a sampling methodology designed for representativeness and yet another thing for the data sample to actually be representative. Thus, data sample representativeness should be tested and/or verified before use of those data.

Potential biases limit the ability to draw inferences to larger populations. A partial list of biases could include sex, age, race, height, or geographical locale.

A variety of sampling procedures are available to reduce the likelihood of drawing a biased sample, and some of them are listed below:

  1. Simple random sampling
  2. Stratified sampling
  3. Cluster sampling
  4. Systematic sampling

These methods of sampling try to ensure the representativeness from the entire population by incorporating an element of ‘randomness’ to the selection procedure, and thus a greater ability to generalize findings to the targeted population. These methods contrast sharply with the conveniencesample where little or no attempt is made to ensure representativeness.

Random sampling procedures common in quantitative research contrasts with the predominant type of sampling conducted in qualitative research. Since investigators may be focusing on a small numbers of cases, sampling procedures are often purposive or theoretical rather than random. According to Savenye and Robinson (2004), “For the study to be valid, the reader should be able to believe that a representative sample of involved individuals was observed. The “multiple realities” of any cultural context should be represented.

Each strategy has its appropriate application for specific scenarios (the reader is advised to review research methodology textbooks for detailed information on each sampling procedure). Selection bias can occur when failing to implement a selected sampling procedure properly. The resulting non-representative sample may exhibit disproportionate numbers of participants sharing characteristics (ex. race, gender, age, geographic) that could interact with main effect variables (Skodol, Bender, 2003; Robinson, Woerner, Pollak, Lerner, 1996; Maynard, Selker, Beshansky, Griffith, Schmid, Califf, D’Agostino, Laks, Lee, Wagner, 1995; Fourcroy, 1994; Gurwitz, Col, Avorn, 1992). Use of homogenous samples in clinical trials may limit the ability of researchers to generalize findings to a broader population (Sharpe, 2002; Dowd, Recker, Heaney, 2000; Johnson, 1990). The issues of sampling procedures apply to both quantitative and qualitative research areas.

Savenye and Robinson (2004) contrast this approach with qualitative researchers’ tendency to interpret results of an investigation or draw conclusions based on specific details of a particular study, rather than in terms of generalizability to other situations and settings. While findings from a case study cannot be generalized, this data may be used to develop research questions later to be investigated in an experiment (Savenye, Robinson, 2004).

Selection of Proper Instrument

Potential for compromising data integrity also exists in the selection of instruments to measure targeted data. Typically, researchers are familiar with the range of instruments that are conventionally used in a specialized field of study. Challenges occur when researchers fail to keep abreast of critiques of existing instruments or diagnostic tests (Goehring, Perrier, Morabia, 2004; Walter, Irwig, Glasziou, 1999; Khan, Khan, Nwosu, Arnott, Chien, 1999). Furthermore, researchers may be:

  • unaware of the development of more refined instruments
  • use instruments that have not been field-tested, calibrated, validated or measured for reliability
  • apply instruments to populations for which they were not originally intended

Questions that should be addressed in the selection of instruments include:

  1. How was data collected in the past?
  2. Is (are) the instrument(s) appropriate for the type of data sought?
  3. Will the instrument(s) be adequate to collect all necessary data to the degree needed?
  4. Is the instrument current, properly field-tested, calibrated, validated, and reliable?
  5. Is the instrument appropriate for using in collecting data from a different source than originally intended? Should the instrument be modified?

Attention to the data selection process is crucial in supporting the research steps that follow. Despite efforts to maintain strict adherence to data collection protocols, selection of fitting statistical analyses, accurate data reporting, and an unbiased write-up, scientific findings will have questionable value if the data selection process is flawed.


Dowd, R., Recker, R.R., Heaney, R.P. (2000). Study subjects and ordinary patients. Osteoporos Int. 11(6): 533-6.

Fourcroy, J.L. (1994). Women and the development of drugs: why can’t a women be more like a man? Ann N Y Acad Sci, 736:174-95.

Goehring, C., Perrier, A., Morabia, A. (2004). Spectrum Bias: a quantitative and graphical analysis of the variability of medical diagnostic test performance. Statistics in Medicine, 23(1):125-35.

Gurwitz,J.H., Col. N.F., Avorn, J. (1992). The exclusion of the elderly and women from clinical trials I acute myocardial infarction. JAMA, 268(11): 1417-22.

Hartt, J., Waller, G. (2002). Child abuse, dissociation, and core beliefs in bulimic disorders. Child Abuse Negl. 26(9): 923-38.

Kahn, K.S, Khan, S.F, Nwosu, C.R, Arnott, N, Chien, P.F.(1999). Misleading authors’ inferences in obstetric diagnostic test literature. American Journal of Obstetrics and Gynaecology., 181(1`), 112-5.

Maynard, C., Selker, H.P., Beshansky, J.R.., Griffith, J.L., Schmid, C.H., Califf, R.M., D’Agostino, R.B., Laks, M.M., Lee, K.L., Wagner, G.S., : et al. (1995). The exclusions of women from clinical trials of thrombolytic therapy: implications for developing the thrombolytic predictive instrument database. Med Decis Making (Medical Decision making: an international journal of the Society for Medical Decision Making), 15(1): 38-43.

Robinson, D., Woerner, M.G., Pollack, S., Lerner, G. (1996). Subject selection bias in clinical: data from a multicenter schizophrenia treatment center. Journal of Clinical Psychopharmacology, 16(2): 170-6.

Sharpe, N. (2002). Clinical trials and the real world: selection bias and generalisability of trial results. Cardiovascular Drugs and Therapy, 16(1): 75-7.

Walter, S.D., Irwig, L., Glasziou, P.P. (1999). Meta-analysis of diagnostic tests with imperfect reference standards. J Clin Epidemiol., 52(10): 943-51.

Whitney, C.W., Lind, B.K., Wahl, P.W. (1998). Quality assurance and quality control in longitudinal studies. Epidemiologic Reviews, 20(1): 71-80.