Machine Learning · December 26, 2023

Key Data Collection Methods

In the model-building process, the 3rd crucial step involves collecting required data. Learn the two main approaches for data collection: indirect and direct methods.

1. Indirect: Leverage existing data from third parties, following their rules and licenses. Ideal when time or resources are tight.

2. Direct: Conduct your own survey. Conduct surveys by creating questionnaires. Design plays a role, and you can use online or offline platforms to collect data. An essential aspect of direct data collection is selecting the right respondents. Understand the concepts of population and sample, where the population includes all individuals of interest, and the sample is a representative subset. This method involves:

  • Questionnaire Design: Craft questions covering key demographic and research-related areas.
  • Respondent Selection: Choose the right population and draw a representative sample using various methods like:
    • Probabilistic:
      • Simple Random Sampling:
        • Purely random selection, like drawing names from a hat.
        • Probability of a respondent being selected is 1/population size.
        • For this method we can use rand() and randbetween() like functions of excel or any other function which generates random number between the specified range.
      • Systematic Sampling:
        • Select every nth individual from the population.
        • Decide the nth individual by N/n formula, where N = Population Size and n = Sample Size.
        • For example from the population of 25000 if we want to get sample of 5000 than by N/n rule 25000/5000, we will get 5 which means select every 5th individual from the population.
          • In this case researcher will select any number between 1 to 5 and thereafter every 5th respondent from the population.
        • Let’s say the researcher selects 5th respondent than next selected respondent will be 10th, 15th, 20th, 25th and so on, thus we will get 5000 respondents data.
          • If a researcher has selected 2nd respondent than next selected respondent would be 7th, 12th, 17th, 22th and so on.
      • Stratified Sampling:
        • Divide the population into groups (strata) based on shared characteristics and then randomly select individuals from each.
        • For example “Male” strata contains males only whereas “Female” strata contains females only.
        • After strata has been formed, select respondent randomly form each strata.
        • For example, let’s say we have created 25 strata each strata having similar characteristics people, we will select respondent randomly from each of these 25 strata.
        • Please make a note that, each strata will be different from another strata.
      • Cluster Sampling:
        • Cluster of different characteristics respondents will be created.
        • For example a cluster can contain respondents having different age, different gender, different income status etc.
        • After the clusters have been created, select clusters randomly from the formed clusters group to collect the data.
        • For example, let’s say we have created 700 clusters, we will randomly select 100 clusters from the group 700 clusters.
      • 2 main differences between cluster sampling and stratified sampling are:
        • Cluster contains respondents having heterogeneous characteristics whereas Strata contains respondents having homogeneous characteristics.
        • In cluster sampling, entire cluster will be selected randomly from the group of clusters whereas in stratified sampling, individual respondents will be selected randomly from each strata from the group of strata.
    • Non-Probabilistic:
      • Convenience Sampling:
        • Select readily available respondents, like your students or hospital staff.
        • For example,
          • if a doctor is doing a survey, he/she will ask his/her hospital staff to fill the questionnaire.
          • If a professor is doing a survey, he/she will ask his/her students to fill the questionnaire.
      • Quota Sampling:
        • Select respondents to meet specific quotas for different attributes (e.g., age, income).
        • For example, take only x number of respondents [quota] of age of range 20–30, take only y number of respondents [quota] of age of range 40–50, select only z number of respondents [quota] of income of range 20 K-40 K etc.
      • Snowball Sampling:
        • Ask initial respondents to recommend others with similar characteristics.
        • Chain selection from one respondent to another.
        • First find a respondent of required area of interest, than ask him/her to find out similar characteristics respondents.
        • The technique is mainly useful when it is difficult to find a respondent having required area of interest.
        • For example, if we want to do a survey on old programming languages like ALGOL, it will be difficult in today’s time to get someone who knows it. In this case, first we will find a respondent who knows ALGOL and then ask him/her to get another respondents who knows ALGOL as well.
      • Judgmental Sampling:
        • Rely on an expert’s judgment to choose relevant respondents.
        • If we want to select respondents from the hospital regarding the medical survey than first, a medical expert will advise researchers whom to consider for the survey.
        • Its not always necessary that only a researcher can select respondents, some times a researcher can seek help from another one who is a subject matter expert to decide respondents.
  • Example Scenario: Consider setting up a restaurant. Design a questionnaire covering demographics, dining habits, preferences, and more.
    • Let’s say we want to setup a new restaurant in a city and for that we want to know that which type of the restaurant, people of the city love the most. For this purpose we will design a questionnaire having questions like…..
      • Age
      • Gender
      • Income
      • How frequent a person visit a restaurant
      • Which cuisine a person like the most
      • Average spending on dining out
      • Rather a person goes with family or friends to a restaurant
      • Rather a person like outdoor sitting or indoor sitting
    • After the questionnaire is ready, the next step is to decide from whom to fill up the questionnaire. As we are setting up a restaurant in a city, out of the the people of entire city, those people who have visited a restaurant at least once will be considered as our population.
    • As we can not go to each and every individual to fill up the questionnaire, we will use any of the sampling methods to select the few individuals to fill up the questionnaire and gather the required data.
    • The selected individuals will be considered as a sample which have similar characteristics as the population. In statistics we believe that a sample will represent the population and will have same set of characteristics that the population have.
    • In statistics, to know if a sample represent the population very well or not we have different methods like hypothesis testing which we will cover later.

Remember, the ideal method depends on your research project and resources.

Understanding these methods is crucial for effective data collection. Stay tuned for the next article where we delve into determining the sample size, a critical aspect of the research design. For a comprehensive understanding, explore more about research design.

You can watch the video on this topic here.