## Concept of Data and Data Collection

Data are the numerical facts that result from the counting or enumerating process. The first stage of any statistical investigation is data collection, and the main challenge is to gather data related to the subject. The term “**investigator**” refers to the person who runs the statistical investigation. **“Informants” or “respondents”** are the people from whom the data or information is gathered, and “**statistical units**” are the things that are counted.

The systematic process of counting or enumerating the items is known as **data collection**. Both quantitative and qualitative data are possible. Quantitative data is any information that can be counted naturally.

Qualitative data are those that reflect an individual’s character. The investigators can decide how to proceed with the investigation based on their gathered data. The outcome would be accurate if the data were collected without bias or error and incorrect otherwise. As a result, the investigator should exercise caution when gathering data.

**According to Creswell (2014)**, *data collection is the process of gathering information or data for research purposes. This process can involve various methods, including surveys, interviews, observations, and document analysis.*

**Bryman (2016)** defines *data collection as the systematic gathering of information from various sources using various research methods such as questionnaires, interviews, and observations.*

**Yin (2018)** describes *data collection as the process of systematically gathering information from various sources to answer a research question. This can involve a variety of methods, including interviews, observations, and document analysis.*

**Patton (2015)** defines *data collection as the process of gathering information or data through various methods such as interviews, surveys, focus groups, and observations. This information is then analyzed to understand patterns, trends, and relationships.*

**Denzin and Lincoln (2018)** describe *data collection as the process of gathering and recording information for research purposes using various techniques such as interviews, observations, and surveys. The data collected is then analyzed to draw conclusions and answer research questions.*

To summarize, **data collection** is the systematic and purposeful gathering of information or data for research purposes using various methods and techniques.

## Prerequisites of Data collection

The methods used to collect data vary. The nature, goal, and scope of the investigation all influence the data collection strategy. False data will produce inaccurate and deceptive results. Consequently, the following factors should be taken into account during the data collection process:

**The objective of the inquiry:**The survey’s goals should be clearly stated before data collection to guarantee accurate information. The information gathered could be completely pointless or even tend to obscure the issue if there is no clear objective.

**Scope of inquiry:**The scope of the investigation should be precisely defined after the objective is established. The area from which data is to be collected is called the scope. If the inquiry’s parameters are not established, it may result in the collection of useless data, wasting time and resources.

**Find Information Source:**After deciding on the goal and scope of the investigation, the next step is to find the source of information. Data sources can be both internal and external. Internal data is gathered from an organization’s internal records and is related to the operation of that organization. External data is obtained from an outside source or through the intermediary of an outside agency. The information gathered from outside sources could be primary or secondary. The use of either or both will be determined by the objective and scope of the investigation. Statistical research is typically conducted when the sources of information are outside of the organization.

**Method of data collection:**The data can be collected using a census or sampling. Whether an investigator uses a census or a sampling method depends on the situation, the goal, the scope, the accuracy, the resources needed, and the amount of time they have.

**Units of data collection:**Before starting to collect data, the unit of measurement or counting must be clearly defined. The use of range and incorrect units results in the collection of misleading statistical data.

**Degree of accuracy:**It is impossible to achieve 100% accuracy in any statistical investigation. As a result, the desired accuracy level must be determined before collecting data. The desired level of accuracy aids in determining the method of data collection. The purpose and scope of the investigation will determine it.

**Types of inquiry:**The inquiry to be used should be decided before collecting data. The types of inquiries used in statistical research are as follows.

- Direct and indirect,
- Official, semi-official, or non-official,
- Original or repetitive,
- Confidential or open, and
- Regular

**Types of data**

Based on Sources Data, there are two types of data:

**1) Primary data**

Primary data is information gathered for the first time by an investigator or an agent of an investigator. The primary data are unique, and their main purposes are to reduce the complexity of an investigation and provide accurate information. These data types come from surveys, institutions, government departments, research teams, and individuals.

For example, suppose an investigator wants to study a fixed-income generating group’s household income and expenditure. In that case, he or she will collect data from the specified field, i.e. the household. Similarly, primary sources are data collected for the first time based on objectives.

**Problems in collecting primary data**

In collecting primary data, the following problems are to be faced:

- Information may not be available on the spot, or he/she may refuse to answer or provide an incorrect answer.
- Illiterate informants may not understand the questions, making it difficult to obtain accurate information from them. Because most respondents are illiterate, the mailed questionnaire method has a high rate of non-response, and informants may not return the questionnaire or send an incomplete one.
- Informants may not provide accurate responses to sensitive questions.
- Enumerators may be deceitful. They may complete the questionnaire without conducting interviews with the respondents. In addition, an untrained or inexperienced enumerator cannot collect unbiased data.

**Methods of collecting primary data**

The different methods of collecting primary data are as follows:

- Direct personal interview,
- Indirect oral interview,
- Information through correspondents,
- Mailed questionnaire method, and
- A questionnaire was sent through enumerators.

**a. Direct Personal Interview**

In this method, the investigator goes into the field and makes inquiries to collect data directly from respondents. The investigator asks the respondents questions about his investigation and collects information. For example, if a person wants to inquire about economics marks obtained by Grade 11 students, he would go to school, contact the students, and obtain the desired information.

**Merits**

- In general, this method ensures greater accuracy.
- This method typically elicits a higher level of response from the respondent.
- Aside from the required information, the investigator can collect various supplementary data.
- You can get the right information from them if you talk to them in their native language or at their level of education.

**Demerits**

- This method usually requires more time, labor, and money.
- This method does not apply to a broader field of study.
- There is a possibility of receiving biased information

**b. Indirect oral interview**

Getting correct information about drinking, smoking, and gambling is impossible. In these situations, the respondent may refuse to answer correctly or give incorrect information. So, to get more accurate information about him, questions are asked of third parties who know him, like friends, relatives, neighbors, and so on. The term “witness” refers to this type of third party.

**Merits**

- This method is appropriate for a wide range of data collection tasks.
- This method requires less time, money, and labor.
- This method is free of the investigator’s and the informant’s bias.
- The opinion and suggestions of experts can be solicited using this method.

**Demerits**

- The investigation’s findings could be skewed.
- Witnesses may colorize the information based on their interests.
- A third party or witness may be unwilling to provide information about the respondent.

**c. Information through correspondents**

In this method, the investigator or agency has local correspondents in various parts of the field of investigation. These correspondents gather information and send it to the headquarters, where it is processed. Newspapers, radio, and television commonly use this method.

**Merits**

- This method is useful when the subject matter is extremely broad.
- This method is cost-effective and time-saving.
- Information is obtained on a regular and convenient basis.

**Demerits**

- The results of this method are not very reliable.
- This method of gathering information may not produce consistent results.

**d. Mailed Questionnaire**

In this method, a list of questions (known as a questionnaire) is prepared and mailed to various informants with a request for a prompt and thorough response within a specified time frame. The questionnaire contains questions and space for responses. Respondents complete the questionnaire and return it to the investigator.

**Merits**

- This method is cost-effective in terms of money, labor, and time.
- This method is preferable when the scope of the investigation is broad.
- The results are free of the investigator’s bias.

**Demerits**

- When the informants are illiterate, this method cannot be used.
- This method has a high rate of non-response.
- The questionnaire may not be filled out completely or carefully.

**e. A questionnaire sent through enumerators**

The schedule is delivered to the informants via trained enumerators in this method. The enumerator contacts the informants to obtain responses to questionnaire questions and fills them out in their handwriting.

**Merits**

- It is useful when dealing with illiterate informants.
- There is a very low rate of non-response.
- The information received is accurate.

**Demerits**

- This method requires more time, money, and labor.
- The enumerator may have personal biases.
- Variations may occur in the information obtained from different enumerators.

**2. Secondary data**

Secondary data is information that someone gathers for their purposes but that another person or organization uses. As a result, what primary data for one person may be secondary data for another? So, secondary data are data that were collected in the first place but came from sources that were published or not published. For example, the records published by the Central Bureau of Statistics, Nepal Rastra Bank, NGOs, INGOs, and other official publications are primary data for their use but secondary data for others.

**Problems in collecting secondary data**

In collecting secondary data, the following problems are to be faced:

- Data that has not been published can be difficult to obtain.
- Data obtained from published or unpublished sources may or may not be suitable and reliable for further investigation.
- During the investigation, the required information may not be published.
- Because of the limitations of primary data, data obtained from secondary sources may be unreal.

**Sources of Secondary Data**

*There are two sources of secondary data:*

- Published sources

- Unpublished sources

**A) Published Sources**

*The main published sources of secondary data are as follows:*

**a. Government Publications**

The government publications are listed below:

- The Central Bureau of Statistics publishes statistical pocketbooks,
- statistical yearbooks, population monographs of Nepal, and other publications (CBS),
- Economic Survey published by Ministry of Finance,
- The National Planning Commission’s Plan document and various publications from various ministries, departments, and governmental offices.

**b. Semi-government Publications**

*The semi-governmental publications in Nepal are as follows:*

- Nepal Rastriya Banijya Bank, Nepal Bank Limited, and Nepal Industrial Development Corporation (NIDC),
- Nepal Agriculture Research Centre (NARC), Industrial Service Centre, Nepal Food Corporation, and other corporations have published reports.
- Private or Non-government Publications: The private or non-governmental publications in Nepal are as follows:

- The report published by research institutes and universities,
- The report published by the Nepal Chamber of Commerce, and Industry
- Reports of different companies and financial reports published by prestigious journals.
- International Publications: Publication of international agencies provides various statistical information, which are as follows:

- Publication Yearbook of World UNO Development such as the UN Report Statistical (WDR), Year Human Book, Demographic Development Report (HDR), etc ., and
- Publications of other international bodies such as ILO, IMF, ADB, etc.

**B) Unpublished Sources**

The unpublished data are also used for statistical investigations. The unpublished sources of data are as follows:

- Data collected by various government departments and offices, patient hospitals, records, Reports from private organizations, businesses, industries, etc.
- School, college, and university administration records, as well as Students’ research projects, such as master’s and doctoral theses.

**Precautions in the use of secondary data**

Before using secondary data, the investigator should think carefully about whether or not it is useful. As a result, when using secondary data, the following precautions must be taken.

**Reliability of data:**The reliability of secondary data should be investigated when using it. The previously collected and processed data should be reliable regarding the survey’s objective, the method of data collection, the agency conducting the survey, the representatives of sample size and sample design techniques, the degree of accuracy desired, the data collection period, and so on.**Suitability of data:**The available secondary data may or may not be useful for this investigation. The nature, goal, and scope of the investigation are used to decide if the secondary data are useful for the current investigation. If the nature, goal, and scope of the available data are the same as the current investigation, then the data is said to be suitable. If they are not, then they are not. For example, data collected for the study of urban household income is insufficient for the study of rural household income.**Adequacy of data:**Secondary data must be enough, reliable, and right for the current investigation. If the scope of the first inquiry is much smaller or larger than the scope of the current inquiry, the data are not enough. For example, data on household income in Kathmandu Valley is insufficient to study the entire country. Another factor in determining data sufficiency is time. If only secondary data on agricultural production for the last four years is available, it will be insufficient to study the trend of agricultural production for the last ten years.

### Techniques of Data Collection

**a. Census Method:**

The census method is a way to collect information from every unit of the population being studied. For example, to study the annual income of a country’s households, we collect data on the annual income of every household. This kind of investigation is only useful when the study area is small, and each unit needs to be looked at in depth.

Instead of choosing a sample, the census method collects information from each person in the population. A census collects information from every individual or unit in the target population. For example, a census of a country’s population would entail gathering data from every single person living within its borders. Because it provides information on every member of the population, the census method is considered the most accurate and comprehensive method of gathering data about a population.

When the population is small, or the cost of collecting data from every individual is feasible, this method is typically used. A census may be legally required in some cases, such as for government purposes or social, economic, or health-related surveys. However, the census method can be costly and time-consuming, particularly in large populations. Collecting data from every member of a population necessitates a significant investment of time, money, and personnel.

Furthermore, some population members may be unwilling or unable to participate, resulting in incomplete or biased data. Furthermore, the sheer volume of data collected can make accurate analysis and interpretation difficult. As a result, the census method may not always be feasible or practical for research purposes.

**b. Sample Method (Sampling):**

The ‘sample’ is a small subset of the population that contains a finite number of elements and can represent the population’s characteristics. The number of elements or objects in a sample determines the sample size. The method or process of data collection in which data is collected from a representative sample of the entire population is known as sampling. Because only a small portion of the population is studied, this method is less expensive than the census method.

### Methods of Sampling

**Simple Random Sampling:**

In simple random sampling, every member of the population has an equal chance of being selected for the sample. The selection is made randomly, without any bias or pattern.

Simple random sampling is a method of selecting a sample from a population in which each member is equally likely to be selected. The selection is made randomly, without any bias or pattern.

**The formula for sample size:**

**n = (z ^{2} * p * q) / e^{2}**

Where,

*n = sample size*

*z = z-score based on the desired level of confidence (e.g. 1.96 for 95% confidence)*

*p = estimated proportion of the population with the characteristic of interest*

*q = 1 – p (complement of p)*

*e = margin of error*

**Merits:**

- Easy to implement
- Each member of the population has an equal chance of being selected
- Free from bias

**Demerits:**

- Requires a complete list of the population
- It may not be representative of the population if the sample size is too small

**Stratified Sampling:**

Stratified sampling involves dividing the population into subgroups or strata based on certain characteristics such as age, gender, or income. Samples are then randomly selected from each stratum in proportion to the stratum’s size.

Stratified sampling involves dividing the population into subgroups or strata based on certain characteristics such as age, gender, or income. Samples are then randomly selected from each stratum in proportion to the stratum’s size.

**The formula for sample size:**

**n = (N * p * q * DEFF) / [(N – 1) * e ^{2} + p * q * DEFF]**

Where,

*n = sample size*

*N = population size*

*p = proportion of the population in the stratum*

*q = 1 – p (complement of p)*

*DEFF = design effect (accounts for the effect of clustering)*

*e = margin of error*

**Merits:**

- Ensures each subgroup is well represented in the sample
- Allows for comparisons between subgroups

**Demerits:**

- Requires prior knowledge of the population to identify relevant subgroups
- It may be costly and time-consuming if there are many subgroups

**Cluster Sampling:**

Cluster sampling involves dividing the population into clusters or groups based on characteristics such as geography or administrative division. Samples are then randomly selected from each cluster.

Cluster sampling involves dividing the population into clusters or groups based on characteristics such as geography or administrative division. Samples are then randomly selected from each cluster.

**The formula for sample size:**

**n = (N * n_c * DEFF) / [(N – 1) * e ^{2} + n_c * DEFF]**

Where,

*n = sample size *

*N = population size *

*n_c = average cluster size *

*DEFF = design effect (accounts for the effect of clustering) *

*e = margin of error*

**Merits:**

- It can be more efficient than simple random sampling when the population is geographically dispersed
- Useful when there is a clear geographic or administrative structure to the population

**Demerits:**

- Requires accurate information on the clusters
- This may lead to a less diverse sample if the clusters are homogeneous

**Systematic Sampling:**

Systematic sampling involves selecting population members based on a predetermined sequence at regular intervals. Systematic sampling involves selecting population members based on a predetermined sequence at regular intervals.

**The formula for sample size:**

**n = N / (1 + N * e ^{2})**

Where,

*n = sample size*

*N = population size*

*e = margin of error*

**Merits:**

- Simple and easy to implement
- Requires a complete list of the population
- It can be more efficient than simple random sampling if the population is large and the list is available

**Demerits:**

- It can introduce bias if there is a pattern in the list related to the variable being studied.
- May miss important subgroups if the list is not exhaustive

**Convenience Sampling:**

Convenience sampling is a non-probability sampling in which individuals are selected based on availability and accessibility. The sample is not randomly selected and may not represent the population. For example, a researcher might survey students in a particular classroom because they are easily accessible without considering whether they are representative of the broader student population.

Convenience sampling is quick, easy, and inexpensive, but it is prone to bias and may not provide valid or reliable results. As such, it is often used in exploratory research or pilot studies, where the focus is on generating ideas and hypotheses rather than testing them.

There is no formula to find the sample size of convenience sampling as the sample is not randomly selected or representative of the population.

**Merits:**

- Quick and easy to implement
- Useful when the population has difficulty access

**Demerits:**

- Prone to bias because the sample is not representative of the population
- Results may not be generalizable to the population

**Snowball Sampling:**

Snowball sampling is a non-probability sampling in which participants are recruited based on referrals from other participants. The sample grows through a chain of referrals, like a snowball rolling downhill. This sampling method is useful when the population is difficult to access or locate, such as members of a hidden or stigmatized community.

For example, a researcher might start by interviewing one member of a specific group and ask them to refer others who might be willing to participate. This sampling method is not representative of the population and may be biased toward those who are more connected or have stronger social networks. As such, it is often used when studying hard-to-reach subgroups or in qualitative research, where the focus is on exploring experiences and perceptions rather than generalizing results to the broader population.

There is no formula to find the sample size of snowball sampling as the sample size is not predetermined and can vary.

**Merits:**

- Useful when the population is hard to reach or hidden
- Can be cost-effective

**Demerits:**

- Prone to bias because the sample is not representative of the population
- Results may not be generalizable to the population
- It can be time-consuming and may require a large sample size to obtain enough data

### Difference between the Census Method and Sampling Method of Data Collection

Here’s a table comparing the census method and sampling method:

Basis of Difference | Census Method | Sampling Method |
---|---|---|

Definition | Collecting data from every member of the population. | Selecting a representative subset of the population for data collection. |

Population | Every individual or unit that is part of the target population | A subset of the target population. |

Sample Size | Entire population. | Subset of the population. |

Cost | High – requires resources, time, money, and personnel to collect data from every individual in the population. | Lower cost than the census method but may still require resources for selecting and contacting participants. |

Representativeness | High – includes every member of the population, providing accurate and comprehensive data. | It depends on the sampling method used – some methods may not be representative or may be prone to bias. |

Feasibility | Suitable for small populations or when the cost of data collection from every individual is feasible. | Suitable for larger populations when it is not feasible or practical to collect data from every individual. |

Examples | Census of a country’s population. | Simple random sampling, stratified sampling, cluster sampling, etc. |

In summary, the main difference between the census method and the sampling method is that the census method involves collecting data from every member of the population. In contrast, the sampling method involves selecting a representative population subset for data collection. The choice between these methods depends on various factors, such as the population size, available resources, representativeness, and research question.

If this article has helped you, please leave a comment.