Information

Analyzing control questions data for a survey

Analyzing control questions data for a survey



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have an experimental study with a list of demographic and related questions and in order to identify data from participants that were potentially just answering the questions at random (to get through them more quickly I would assume), I've included two very similar 7-point likert scale questions at different points in the survey. My assumption would be that since the questions are reflective, the answers participants should give will be at least somewhat similar between the two questions (eg, it should be very unlikely that a participant answers 7 to one question yet 1 to the other).

I haven't yet collected the data, however I would like to have a method for determining which sets of data are suspicious (might be considered for exclusion in analysis) based on these control questions. One method might be to simply determine where the data fit on a Gaussian distribution. However, I think that the limited discriminating power of a 7-point scale would make this an improper test. My other idea was to do a cluster analysis on the data, looking for five groups: three along the line of correlation (between the questions), and two to examine unusually high/low and low/high values. I thought this could provide better suggestions for which data sets might be unusual since it wouldn't use somewhat arbitrary comparisons, it would only use the data given.

I'd really appreciate any suggestions for a better method, or improvements I could make as well as any comments toward more "standard" practices in this area, since I'm somewhat new to research.


You seem to be concerned with reliability, and more specifically internal reliability. Internal reliability is the degree to which different questions are measuring the same construct. This concept is used often in psychology and is usually measured using Cronbach's alpha. However, it is typically used to measure the reliability of a test, and not the reliability of an individual.

As Jeromy Anglim points out, I think it's important to consider the goal here. Using a two question Likert scale is probably not good enough to reliably detect outliers: What if the respondent checked all '4s' on a 7-point Likert scale? Reversing the scale would have no effect.

One alternative approach is to employ an instructional manipulation check (Oppenheimer et al., 2009). The gist of the technique is to trap participants into answering a question in a specific way that they could only have done by reading the instructions carefully. Here is an example from a survey administered by Facebook:

While this technique may throw out a few good participants, it will almost certainly raise the signal-to-noise ratio of your data by only including participants who followed instructions and read questions before answering.

Another tried and true technique is to use a computer-administered test and look at reaction times. You may be able to throw out a few responses (or whole participants) by simply looking for outliers in response time that are below the mean.

Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45(4), 867-872.


Preventing random responding: An important first step is to think about ways to prevent random responding from occurring in the first place. A few ideas include: administer the survey face to face; have an experimental invigilator present; communicate the importance of the research to participants and the importance of participants taking the research seriously; use financial remuneration.

That said, there are situations where participants do not take a study seriously responding randomly for example. This seems to be particularly an issue when collecting data online.

General approach: My overall approach to this is to develop multiple indicators of problematic participation. I'll then assign penalty points to each participant based on the severity of the indicators. Participants with penalty points above a threshold are excluded from analyses.

The choices of what is problematic depends on the type of study:

  • If a study is performed in a face to face setting, the experimenter can take notes recording when participants engage in problematic behaviour.
  • In online survey style studies I record reaction time for each item. I then see how many items are answered more quickly than the person could conceivably read and respond to the item. For example, answering a personality test item in less than about 600 or even 800 milliseconds indicates that the participant has skipped an item. I then count up the number of times this occurs, and set a cut-off.
  • In performance based tasks, other participant actions may imply distraction or not taking the task seriously. I'll try to develop indicators for this.

Mahalanobis distance is often a useful tool to flag multivariate outliers. You can further inspect the cases with the largest values to think about whether they make sense. There is a bit of an art in deciding which variables to include in the distance calculation. In particular, if you have a mix of positively and negatively worded items, carelessness is often indicated by a lack movement between the poles of a scale as you move from positively to negatively worded items.

In general, I also often include items at the end of the test asking the participant whether they took the experiment seriously.

Discussion in the Literature

Osborne and Blanchard (2010) discuss random responding in the context of multiple choice tests. They mention the strategy of including items that all participants should answer correctly. To quote

These can be content that should not be missed (e.g., 2+2=__), behavioral/attitudinal questions (e.g., I weave the fabric for all my clothes), non-sense items (e.g., there are 30days in February), or targeted multiple-choice test items [e.g., “How do you spell 'forensics'?” (a) fornsis, (b) forensics, (c) phorensicks, (d) forensix].

References

  • Osborne, J. W., & Blanchard, M. R. (2010). Random responding from participants is a threat to the validity of social science research results. Frontiers in Psychology, 1. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3153825/

This is not directly an answer to your question but, in line with my comments to another answer, my main advice would be “don't worry about it”.

Jeromy Anglim's tips are all good but I am still unconvinced that this is an important issue for most people. Since you are new to research, there are probably dozens of other things you should worry about.

Furthermore, if you do see evidence that there is a problem (extremely short response times, contradictory answers, large number of respondent providing absurd answers to open-ended questions), I would argue that you should first step back and ask yourself if what you are asking is reasonable (Do the task make sense? Can people be expected to have an opinion about the topic you are investigating? Are you demanding too much effort?) rather than trying to sort out “bad” respondents.

If you really want to dig into the subject and look up some literature, another name for this phenomenon is “satisficing”. “Response set” is a related idea that might be of interest.


What are behavioral surveys?

As you have probably already guessed after reading some of the other sections in this chapter, asking people questions can be immensely valuable for gaining insight and information into various questions, puzzles, and problems that may exist in your community.

Another type of survey, the behavioral survey, asks people to respond to questions about certain actions or behaviors that affect their physical, emotional, or mental well-being. These behaviors might include cigarette use, unprotected sexual activity, or habits that might increase the chance for cardiovascular diseases.

Unlike the constituent surveys of goals, process, and outcomes, behavioral surveys do not try to determine what people think rather, they focus on what people do. But, one important distinction to make with behavioral surveys is this: these surveys will tell you what people say they do. Consequently, the surveys must be taken as self-reports. That is, your group should recognize that the results will be subjective accounts of individual actions. This doesn't diminish the value of behavioral surveys rather, it simply must be taken into consideration when you analyze the data.


3 Answers 3

A binary logistic regression would work, for the same reason that you can use a regression when you want to compare two sample means.

However, it may be more apparatus that is required.

With a small sample, one might consider a binomial test. In this case, your sample sizes are nice and big, so a straight out proportions test should be effectively indistinguishable from it and a little simpler to deal with.

However, since you have three experimental runs it might be worth including experimental run as a covariate (even though if all is well it should not have a covariate effect). In that case, you could use logistic regression to do teh comparison incorporating the experimental run variable, or you could look at a chi-square test to achieve basically the same thing (though if you want a one-tailed test, the first thing would be better).


Appendix

What is survey data collection?

Survey data collection uses surveys to gather information from specific respondents. Survey data collection can replace or supplement other data collection types, including interviews, focus groups and more. The data collected from surveys can be used to boost employee engagement, understand buyer behaviour and improve customer experiences.

What is longitudinal analysis?

Longitudinal data analysis (often called “trend analysis”) is basically tracking how findings for specific questions change over time. Once a benchmark is established, you can determine whether and how numbers shift. Let's suppose that the satisfaction rate for your conference was 50% three years ago, 55% two years ago, 65% last year and 75% this year. In this case, congratulations are in order because your longitudinal data analysis shows a solid, upward trend in satisfaction.

What is the difference between correlation and causation?

Causation is when one factor causes another, whereas correlation is when two variables move together but one does not influence or cause the other. For example, drinking hot chocolate and wearing a woolly hat are two variables that are correlated, in that they tend to go up and down together however, one does not cause the other. In fact, they are both caused by a third factor: cold weather. Cold weather influences both hot chocolate consumption and the likelihood of wearing a woolly hat. Cold weather is the independent variable, and hot chocolate consumption and the likelihood of wearing a woolly hat are the dependent variables. In the case of our conference feedback survey, cold weather most probably influenced attendees' dissatisfaction with the conference city and the conference overall. Finally, to further examine the relationship between variables in your survey, you might need to perform a regression analysis.

What is regression analysis?

Regression analysis is an advanced method of data visualisation and analysis that allows you to look at the relationship between two or more variables. There are many types of regression analysis and the one(s) a survey scientist chooses will depend on the variables he or she is examining. What all types of regression analysis have in common is that they look at the influence of one or more independent variables on a dependent variable. In analysing our survey data, we might be interested in knowing what factors have the greatest impact on attendees’ satisfaction with the conference. Is it a matter of the number of sessions? The keynote speaker? The social events? The site? Using regression analysis, a survey scientist can determine whether and to what extent satisfaction with these different attributes of the conference contribute to overall satisfaction.

This, in turn, provides insight into which aspects of the conference you might want to alter next time around. Let's suppose, for example, that you paid a hefty fee to secure the services of a top-flight keynote speaker for your opening session. Participants gave this speaker and the conference overall high marks. Based on these two facts, you might think that securing the services of a fabulous (and expensive) keynote speaker is the key to conference success. Regression analysis can help you determine whether this is indeed the case. You might find that the popularity of the keynote speaker was a major driver of satisfaction with the conference. If so, next year you’ll want to secure the services of a great keynote speaker again. However, if for example, the regression shows that although everyone liked the speaker, this did not contribute much to attendees’ satisfaction with the conference, the large sum of money spent on the speaker might be better spent elsewhere. If you take the time to carefully analyse the soundness of your survey data, you’ll be on your way to using the answers to help you make informed decisions.


Exporting data

There are several options for exporting data from Survey Monkey. My suggestion is to export the data in several different formats, as some formats will require less preparation for analysis than other formats. If you are able to export the data in SPSS format, then most of the work will be done for you (except variable names for your survey items). Also, you will almost always want to export the individual rows of data rather than the summary data. If you want summaries of your data, you can create them from the individual rows of data. However, you may not be able to do all of the analyses that you want from the summary data that you download from Survey Monkey, and there is no way to get back to the individual rows of data if you downloaded only the summary data.

Below we will show four ways that you could export the individual rows of data. No matter which of these four methods of data extraction you use, you will probably want to rename the variables. The SPSS command rename variables can be used in a few different ways to rename variables, and you can rename multiple variables in a single call to the command. Also, many of the variables come into SPSS as string variables (even if the variable contains numbers), so you will need to convert those to numeric variables to use them in analyses such as ANOVA and regression. Again, there are many ways to do this in SPSS. After you have renamed your variables and converted them into numeric form, you may want to recode or collapse some variables. We will show some examples of this. Also, if a variable has no responses, it will be a numeric variable. We will see an example of this (in question 7, for example).

Let’s see a few examples of some of the commands used most commonly to prepare the data for analysis. Please remember that all commands in SPSS MUST end in a period. Also, think about how you would prefer to organize your SPSS syntax file. One way to organize the file is to work with each variable in turn. For example, you might rename, make a numeric version of the variable, add a variable label and then a value label to the first variable, and then go through the same process for the second variable, and so on. Another way is to do each type of task for each variable. For example, you might recode all of the variables in one section, and then add value labels in another section. You may find a different way to organize your syntax file. How you organize the file doesn’t really matter, but having some form of organization does matter. This will help ensure that you don’t forget to do something. It will also make the file more understandable if you need to share it with someone else. You will also want to add comments to the syntax file. You can do this either by using the comment command or by starting the line with an asterisk. As will all other commands in SPSS, you need to end your comment with a period. In my experience, these types of files tend to get “inherited”, meaning that the person who wrote the syntax file is no longer working on the project and someone else needs to step in and take over. You should try to write a syntax file that you would want to inherit.

I would suggest that this be the first command in your syntax file. It is one way to ensure that you are running your syntax on the correct data file. Here is an example of the command:

Although sometimes not technically necessary, I would suggest that you enclose the path specification and data file name in quotes. This is necessary if you have blanks in a folder name or the data file name.

The get data command is used to import data into SPSS. For example, you would use this command if you were trying to import data in an Excel file into SPSS.

Once you have finished your data cleaning tasks, I suggest that you save your dataset with a new name. When you do your analyses, you can open and use this dataset rather than re-running all of the data cleaning commands that are necessary to transform your raw dataset into an analysis-ready dataset. Also, it is possible that your analysis-ready dataset may contain only the numeric versions of your variables. Here is an example of this command:

The rename variables command

The rename variables command does exactly what you think it should do: it renames variables.

When Survey Monkey names variables, it often uses the first part of the question. Here are a few examples from my survey: WhatyearofschoolareyouinEnteranumber, DidyouattendanothercollegeuniversitybeforecomingtoUCLA, Pleasetelluswhereyouprefertostudyandwhereyouactually, and WhatkindofnoiseleveldoyoupreferwhenstudyingPleasera. Some of the variable names are more than 50 characters long. Now you can see why you would want to rename your variables! You have some choices as to how you write your rename variables syntax, as well as how you organize your syntax file. One possibility is that you rename each of your variables one at a time. If you do this, you may want to do any other operations regarding that variable next, so that your syntax file is organized by variables. Alternatively, you could rename all of your variables in a single call to the rename variables command.

The compute command is one of the commands that can be used to create a new variable. When cleaning the data, you can use the compute command to create a numeric version of a string variable, or you can use it to create a collapsed version of a multi-category variable. Here are some examples:

The if command is another command that you can use to create a new variable. You can also use the if command to recode the value of one variable based on the value of another variable. This command can also be used to create a numeric version of a string variable.

The recode command is used, of course, to recode variables. It can also be used to create a new variable (with the into keyword) and convert certain string variables into numeric variables (with the convert keyword).

The autorecode command does exactly what its name suggests: it automatically recodes variables. Sometimes this command is very useful, and other times it produces undesired (or unexpected) results. Care should be taken when one or more of the variables has missing values.

The crosstabs command

The crosstabs command is very useful for ensuring that the recoding of a variable went as planned. You can have one or more tables subcommands in your call to the crosstabs command. I strongly recommend that all recoded variables be checked against the original variable to be certain that the recode worked as intended. I understand that this can quickly become tedious, but making a mistake when recoding variables can cause lots of problems when you use that recoded variable in analyses, and at that point, the error may be quite difficult to uncover. Also, if you eliminate the original version of the variable from your dataset before you checked the recode, you may have considerable trouble finding the error.

The alter type command has at least three purposes. It can change some string variables into numeric variables, it can alter then length of string variables, and it can change the format of numeric variables.

The value labels command associates descriptive text to the values of categorical variables. This is very useful, because it reminds you what the values of the variable mean. The descriptive text is also given in output involving the variable, which often makes the output easier to interpret.

The variable labels command

The variable labels command allows you to associate descriptive text to a variable. For example, you may make the actual question from your questionnaire the variable label.

The delete variables command

The delete variables command does exactly what it says it does: it deletes variables from your dataset. You may find that some of the variables in your dataset have nothing but missing values in other words, they are useless. You can use the delete variables command to remove these variables from the dataset. In general, we do not suggest that researchers remove string variables from the dataset once a numeric version of them has been created. Rather, we suggest that after all of the data cleaning has been done, that dataset gets saved. Then you can make a copy of that dataset and remove unneeded string variables from it. This will result in a dataset that is cleaned and rid of unnecessary variables in other words, a dataset that is ready for analysis.

The document command can be used to associate text with a dataset. You can use the add document command to make additional notes as needed. These commands are very useful for keeping important information with the dataset, as opposed to writing the important information in a notebook that may get separated from the dataset.

Below are the SPSS syntax files that I wrote to prepare the data for analysis. Although these files have an .sps extension, they are simply text files. This means that you do not need SPSS to open the files you can open them with any text editor, such NotePad or WordPad. If you look at each of these files, you will notice that when the data were extracted using “SPSS format”, the least amount of data cleaning was necessary. The other extraction methods required more steps to prepare the data for analysis. Furthermore, there are many differences between the different types of extraction in terms of what needed to be done to prepare the data for analysis.

If you download the data using this method, most of the variables will be string variables and the values will be words associated with the options. You can access the SPSS syntax used to clean the data for the example questionnaire used in this workshop here. The original comma-separated values file is here.

If you download the data using this method, most of the variables will be string variables and the values will be words associated with the options. You can access the SPSS syntax used to clean the data for the example questionnaire used in this workshop here. The original comma-separated values file is here.

If you download the data using this method, most of the variables will be numeric variables and the values will be numbers associated with the options. You will not know from the information in the dataset which value labels should be associated with each numeric value, but you can probably get that information from the questionnaire. You can access the SPSS syntax used to clean the data for the example questionnaire used in this workshop here. The original comma-separated values file is here.

If you download the data using this method, most of the variables will be numeric variables and the value labels will be correctly associated with the numeric values. Also, most of the variables will have variable labels. The amount of data cleaning needed should be much less than with the other possible methods of downloading the data. You can access the SPSS syntax used to clean the data for the example questionnaire used in this workshop here. The original SPSS data file is here.

Improvements that could (should!) be made to this questionnaire

Notice that there is a problem with the question about the year in school. It is a good thing that we caught this error in our pilot testing. Also, question 9 about when people study is difficult to analyze because we specified it as “choose all that apply”. While in theory we may want to know all of the times that people study, we need to consider how we are going to analyze that variable if we allow respondents to choose multiple responses. The point is that piloting your questionnaire is important for working out problems with items, but it also allows you to discover potential difficulties with analyses. Remember that you want to know how you are going to analyze data before you collect them.


SDA Features

Documentation:

  • Codebooks: SDA can produce both HTML and print-format codebooks. The documentation for each study contains a full description of each variable, indexes to the variables, and links to study-level information.

Analysis:

  • Various analysis types are available: frequencies and crosstabulation, comparison of means, correlation matrix, comparison of correlations, multiple regression, logit/probit regression.

Other Capabilities:

  • Subsetting: Users can generate and download a customized subset of an SDA dataset. In addition to generating a data file, the subset procedure produces a codebook for the subset and data definitions for SAS, SPSS, Stata and DDI. The subset can include both the original dataset variables and new variables created with recode or compute.

SDA Manager:

  • Import SPSS .sav files, Stata .dta files, CSV files and TSV files and automatically convert them into SDA datasets.
  • Create and configure personal user workspaces. These user workspaces enable analysts to create and store recoded and computed variables in their own private storage areas. For groups -- such as college classes -- the leader or instructor can make their created variables accessible to the group to use in their own analysis projects.
  • Configure dataset-level access control -- specifying which users can access which datasets.
  • Generate reports on usage of the datasets in the archive.
  • Troubleshoot problems.

E. Data Collection

I. Sampling Method

In line with the research objectives and also the issues to become investigated, it could have been most suitable if all recruitment employees inside the organization were interviewed. However, because of the time limitations and resource limitations natural within this study, a non-probability sample of people was selected. Saunders et al (2007) asserts that the non-probability sample is most frequently used when adopting a situation study strategy. A non-probability sample, as explained (Oppenheim, 2000), is really a sample where the possibility of each situation being selected in the people in this country isn’t known.

The examples of graduates which were selected to take part in the quantitative study are they canrrrt constitute a probability sample of graduates within London or United kingdom. Also, the amount of employees within Lloyds who required part within the qualitative study wasn’t sufficient to constitute a substantial area of the recruitment department within Lloyds TSB. And so the study focused more about the quantitative details from the thought of recruitment inside the organization, instead of theories expressed within the literature review, and just what graduates around the outdoors considered online recruitment.

Ii. Primary Data Collection

In collecting data that may be analysed using quantitative means, Easterby-Cruz et al (2008) claims that researchers could collect either primary or secondary data. He further claims that though all these means get their merits and demerits, the gathering of one’s own data gives control of the dwelling from the sample and also the data acquired from each respondent. Additionally, it gives greater confidence the data collected would match the study objectives.

The investigator therefore made a decision to collect primary data from 20 graduates using questionnaires distributed-in-person to every respondent. It was done among buddies and colleagues inside the college who’ve utilized online recruitment systems. Data in the semi-structured interviews could be collected utilizing a tape recorder, and also the conversations with all of four employees could be transcribed sentence after sentence, and expression for expression. The benefits natural within this approach is it enables the investigator to document and find out patterns in words and feelings that wouldn’t be available if other kinds of interviews were conducted.


Discussion

This paper was based on a narrative review of systematic and non-systematic searches of the literature on the effects of mode of questionnaire administration on data quality. The review showed that, while some studies were inconsistent or inconclusive, different modes of questionnaire administration are likely to affect the quality of the data collected. The effects appeared to be more marked between interview and self-administration modes, rather than within modes. It was often difficult to isolate the effects of the method of contact from the other differences between the data collection methods, and this limits knowledge about how the mode of administration alters the process of answering questions. 4 A main problem with the literature elicited is that most of the studies did not use experimental or randomization methods to allocate the different questionnaire modes to participants. Thus, differences detected in responses between different modes could be due to differences between settings, or to genuine differences between respondents.

Explanatory models that have been proposed for the effects of data collection mode on data quality include the impersonality of the method of contacting respondents and in delivering and administering the questionnaire (highest in self-administration methods), the cognitive burden imposed on respondents by the method (greatest in self-administration methods), the legitimacy of the study (it is more difficult to establish the credentials of some surveys in telephone contacts), the control over the questionnaire (interviewers have the highest level of control over the order and completion of the questions), the rapport between respondent and interviewer (lowest in self-administration settings where there is no visual contact), and communication style (an interviewer can be motivating and clarify questions, but can lead to interviewer and social desirability bias) 4, 8 (see Box 5). These models need to be fully tested in experimental designs.

Explanations for effects of data collection mode on data quality.

1. The impersonality of the method: while an interviewer can enhance motivation to respond as well as response accuracy, self-administration methods increase perceived impersonality and may encourage reporting of some sensitive information (e.g. in interview situations there may be fear of embarrassment with the exposure of weakness, failure or deviancy in the presence of a stranger).
2. The cognitive burden imposed by the method: different methods make different demands on respondents, including reading, listening, following instructions, recognising numbers and keying in responses. Face-to-face interviews make the least demands, while the lack of visual support in telephone interviews may make the task more complex.
3. The legitimacy of the study: this may be more difficult to establish with some methods than others. In contrast to paper or electronic communications, telephone contacts limit the possibilities for establishing the survey’s credentials. This might affect initial response and the importance respondents place on the study, and their motivation to answer questions accurately.
4. The control over the questionnaire varies: interviewers have the highest level of control over question order in self-administered paper questionnaire modes there is little control over question order.
5. Rapport: rapport between respondent and interviewer may be more difficult to establish in self-administration and telephone interview than in face-to-face modes, as there is no visual contact. This can adversely affect motivation to respond, although social desirability bias may be reduced as there is less need for approval.
6. Communication style: more information may be obtained in interview than other situations, as interviewers can motivate respondents, pause to encourage (more, longer) responses, and clarify questions interviewers can also lead to interviewer and social desirability bias.
1. The impersonality of the method: while an interviewer can enhance motivation to respond as well as response accuracy, self-administration methods increase perceived impersonality and may encourage reporting of some sensitive information (e.g. in interview situations there may be fear of embarrassment with the exposure of weakness, failure or deviancy in the presence of a stranger).
2. The cognitive burden imposed by the method: different methods make different demands on respondents, including reading, listening, following instructions, recognising numbers and keying in responses. Face-to-face interviews make the least demands, while the lack of visual support in telephone interviews may make the task more complex.
3. The legitimacy of the study: this may be more difficult to establish with some methods than others. In contrast to paper or electronic communications, telephone contacts limit the possibilities for establishing the survey’s credentials. This might affect initial response and the importance respondents place on the study, and their motivation to answer questions accurately.
4. The control over the questionnaire varies: interviewers have the highest level of control over question order in self-administered paper questionnaire modes there is little control over question order.
5. Rapport: rapport between respondent and interviewer may be more difficult to establish in self-administration and telephone interview than in face-to-face modes, as there is no visual contact. This can adversely affect motivation to respond, although social desirability bias may be reduced as there is less need for approval.
6. Communication style: more information may be obtained in interview than other situations, as interviewers can motivate respondents, pause to encourage (more, longer) responses, and clarify questions interviewers can also lead to interviewer and social desirability bias.

Explanations for effects of data collection mode on data quality.

1. The impersonality of the method: while an interviewer can enhance motivation to respond as well as response accuracy, self-administration methods increase perceived impersonality and may encourage reporting of some sensitive information (e.g. in interview situations there may be fear of embarrassment with the exposure of weakness, failure or deviancy in the presence of a stranger).
2. The cognitive burden imposed by the method: different methods make different demands on respondents, including reading, listening, following instructions, recognising numbers and keying in responses. Face-to-face interviews make the least demands, while the lack of visual support in telephone interviews may make the task more complex.
3. The legitimacy of the study: this may be more difficult to establish with some methods than others. In contrast to paper or electronic communications, telephone contacts limit the possibilities for establishing the survey’s credentials. This might affect initial response and the importance respondents place on the study, and their motivation to answer questions accurately.
4. The control over the questionnaire varies: interviewers have the highest level of control over question order in self-administered paper questionnaire modes there is little control over question order.
5. Rapport: rapport between respondent and interviewer may be more difficult to establish in self-administration and telephone interview than in face-to-face modes, as there is no visual contact. This can adversely affect motivation to respond, although social desirability bias may be reduced as there is less need for approval.
6. Communication style: more information may be obtained in interview than other situations, as interviewers can motivate respondents, pause to encourage (more, longer) responses, and clarify questions interviewers can also lead to interviewer and social desirability bias.
1. The impersonality of the method: while an interviewer can enhance motivation to respond as well as response accuracy, self-administration methods increase perceived impersonality and may encourage reporting of some sensitive information (e.g. in interview situations there may be fear of embarrassment with the exposure of weakness, failure or deviancy in the presence of a stranger).
2. The cognitive burden imposed by the method: different methods make different demands on respondents, including reading, listening, following instructions, recognising numbers and keying in responses. Face-to-face interviews make the least demands, while the lack of visual support in telephone interviews may make the task more complex.
3. The legitimacy of the study: this may be more difficult to establish with some methods than others. In contrast to paper or electronic communications, telephone contacts limit the possibilities for establishing the survey’s credentials. This might affect initial response and the importance respondents place on the study, and their motivation to answer questions accurately.
4. The control over the questionnaire varies: interviewers have the highest level of control over question order in self-administered paper questionnaire modes there is little control over question order.
5. Rapport: rapport between respondent and interviewer may be more difficult to establish in self-administration and telephone interview than in face-to-face modes, as there is no visual contact. This can adversely affect motivation to respond, although social desirability bias may be reduced as there is less need for approval.
6. Communication style: more information may be obtained in interview than other situations, as interviewers can motivate respondents, pause to encourage (more, longer) responses, and clarify questions interviewers can also lead to interviewer and social desirability bias.

This topic has important implications for research methodology, the validity of the results of research, the soundness of evidence-based public policy, and for clinicians who wish to screen their patients using questionnaires. 51 All users of questionnaires need to be aware of the potential effects of mode of administration on their data. The validity of the common research practice of comparing data from dual modes of administration within studies is also called into question. While calls have been made for greater attention to questionnaire development in epidemiology, 77 there has been less focus on the wide range of different biases, at different levels, stemming from the various modes of administering questionnaires.


Data files and exercises

Throughout the SPSS Survival Manual you will see examples of research that is taken from a number of different data files, survey.zip, error.zip, experim.zip, depress.zip, sleep.zip and staffsurvey.zip. To use these files, which are available here, you will need to download them to your hard drive or memory stick. Once downloaded you'll need to unzip the files. To do this, right click on the downloaded zip file and select 'extract all' from the menu. You can then open them within SPSS.

(To do this, start SPSS, click on the Open an existing data source button from the opening screen and then on More Files. This will allow you to search through the various directories on your computer to find where you have stored your data files. Find the file you wish to use and click Open.)

Survey.sav

This is a real data file, condensed from a study that was conducted by my Graduate Diploma in Educational Psychology students. The study was designed to explore the factors that impact on respondents' psychological adjustment and wellbeing. The survey contained a variety of validated scales measuring constructs that the extensive literature on stress and coping suggest influence people's experience of stress. The scales measured self-esteem, optimism, perceptions of control, perceived stress, positive and negative affect, and life satisfaction. A scale was also included that measured people's tendency to present themselves in a favourable or socially desirable manner. The survey was distributed to members of the general public in Melbourne, Australia and surrounding districts. The final sample size was 439, consisting of 42 per cent males and 58 per cent females, with ages ranging from 18 to 82 (mean=37.4).

Error.sav

The data in this file has been modified from the survey.zip file to incorporate some deliberate errors to be identified using the procedures covered in Chapter 5. For information on the variables etc. see details on survey.zip.

Experim.sav

This is a manufactured data set that was created to provide suitable data for the demonstration of statistical techniques such as t-test for repeated measures, and one-way ANOVA for repeated measures. This data set refers to a fictitious study that involves testing the impact of two different types of interventions in helping students cope with their anxiety concerning a forthcoming statistics course. Students were divided into two equal groups and asked to complete a number of scales (Time 1). These included a Fear of Statistics test, Confidence in Coping with Statistics scale and Depression scale. One group (Group 1) was given a number of sessions designed to improve mathematical skills, the second group (Group 2) was subjected to a program designed to build confidence in the ability to cope with statistics. After the program (Time 2) they were again asked to complete the same scales that they completed before the program. They were also followed up three months later (Time 3). Their performance on a statistics exam was also measured.

Manipulate.sav

This file contains data extracted from hospital records which allows you to try using some of the SPSS data manipulation procedures covered in Chapter 8 Manipulating the data. This includes converting text data (Male, Female) to numbers (1, 2) that can be used in statistical analyses and manipulating dates to create new variables (e.g. length of time between two dates).

Depress.sav

This file has been included to allow the demonstration of some specific techniques in Chapter 16. It includes just a few of the key variables from a real study conducted by one of my postgraduate students on the factors impacting on wellbeing in first time mothers. It includes scores from a number of different psychological scales designed to assess depression (details in Chapter 16 on Kappa Measure of Agreement).

Sleep.sav

This is real data file condensed from a study conducted to explore the prevalence and impact of sleep problems on various aspects of people's lives. Staff from a university in Melbourne, Australia were invited to complete a questionnaire containing questions about their sleep behaviour (e.g. hours slept per night), sleep problems (e.g. difficulty getting to sleep) and the impact that these problems have on aspects of their lives (work, driving, relationships). The sample consisted of 271 respondents (55% female, 45% male) ranging in age from 18 to 84 years (mean=44yrs).

Staffsurvey.sav

This is a real data file condensed from a study conducted to assess the satisfaction levels of staff from an educational institution with branches in a number of locations across Australia. Staff were asked to complete a short, anonymous questionnaire (shown later in this Appendix) containing questions about their opinion of various aspects of the organisation and the treatment they have received as employees.


You’ll often find the most useful insights by analyzing your open-ended survey questions (such as free-text responses to your Net Promoter Score survey questions). But, what if you’re faced with hundreds or thousands of answers? It takes too long and is a huge mental tax!

The answer: coding open-ended questions.

Summarizing open-ended survey questions? Start here.

Manual, or automated coding?

There’s ample debate about whether to go for manual or automated coding.

You can do automated coding with the help of text analytics software (such as Thematic), which is a lot simpler. But if you decide to go for manual coding, you’ll want to learn best practices from the people who have been dealing with text for decades, qualitative researchers.

For this post, I’ve dived into how manual coding works.

What is coding and why does it matter?

When you hear a term like ‘big data’ it almost always refers to quantitative data : numbers or categories. Statistical and machine learning techniques love numbers. Free text is an example of qualitative data . Dealing with it is difficult but it’s crucial to find the customer insights you’re after.

By nature, qualitative researchers believe that numbers won’t get you very far. They believe that by interviewing (or surveying) your customers and asking them to answer open-ended questions, you can gain much deeper learnings.

The value in Net Promoter Score surveys

Let’s take for example Net Promoter Score ( NPS) surveys . The score, calculated from numeric answers to the question ‘How likely, on a scale from 0 to 9, are you to recommend us to friend or family?’ will result in a single measure of a company’s performance.

Let’s dig a bit deeper. It’s actually the open-ended answers to the question ‘Why did you give us that score?’ that will teach you how to improve that measure in the future.

As you know, qualitative research produces a lot of text.

Survey questions where respondents are free to write whatever they like are also called open-ended questions . A response is known as a verbatim .

Researchers use coding to draw conclusions from this data with the objective of making data-driven decisions. ‘Coding’ or ‘tagging’ each response with one or more codes helps capture what the response is about, and in turn, summarise the results of the entire survey effectively. If we compare coding to Natural Language Processing (NLP) methods for analyzing text, in some cases coding can be similar to text categorization and in other ways to keyword extraction.

Now, let’s look at coding and the different methodologies in more detail.

Coding frames

We often refer to how to perform the task manually, but if you are looking at using an automated solution, this knowledge will help you understand what matters and how to choose an effective approach.

What’s a coding frame?

When creating codes, they’re put into what we call a coding frame. The coding frame is important because it represents the organizational structure and influences how useful the coded results will be. There are two types of frames: ‘flat’ and ‘hierarchical’:

  • A Flat frame means that all codes are of the same level of specificity and importance. That’s easy to understand. But if it gets large, organizing and navigating it will be difficult.
  • Hierarchical frames capture a taxonomy of how the codes relate to one another. They allow you to apply a different level of granularity during the coding and the analysis of the results.

One interesting application of a hierarchical frame is to support differences in sentiment. If the top-level code describes what the response is about, a mid-level one can specify if it’s positive or negative and a third level can specify the attribute or specific theme.

You can see an example of this type of coding frame below.

Example of a Coding Frame

Coding frames – pros and cons

Flat code frame Hierarchical code frame
Supports fewer codes Supports a larger code frame
(+) Easier and faster to manually code with (-) Requires navigating the code frame to find the right one
(+) Easy to provide consistent coding (-) Prone to a subjective opinion of how each answer is coded
(-) Difficult to capture answers that aren’t common leading to a large ‘other’ category (+) Can organize on basis of organizational structure
(-) Doesn’t differentiate between the importance and levels of specificity of themes (+) Allows for different levels of granularity

Two critical things to consider when coding open-ended questions

A couple of critical things to consider when coding open-ended questions are the size and the coverage of the frame.

Coverage

Make sure to group responses with the same themes, disregarding wording, under the same code. For example, a code such as ‘cleanliness’ could cover responses mentioning words like ‘clean’, ‘tidy’, ‘dirty’, ‘dusty’ and phrases like ‘looked like a dump’, ‘could eat off the floor’. The coder needs a good understanding of each code and its coverage .

Having only a few codes and a fixed frame makes the decision easier. If you have many codes, particularly in a flat frame, this makes it harder as there can be ambiguity and sometimes it isn’t clear what exactly a response means. Manual coding also requires the coder to remember or be able to find all of the relevant codes, which is harder with a large coding frame.

Flexibility

Coding frames should be flexible . Coding a survey is a costly task, especially if done manually, and so the results should be usable in different contexts. Imagine this: You are trying to answer the question ‘what do people think about customer service’ and create codes capturing key answers. Then you find that the same survey responses also have many comments about your company’s products.

If you need to answer “what do people say about our products?” you may find yourself having to code from scratch! Creating a coding frame that is flexible and has good coverage (see the Inductive Style below) is a good way to ensure value in the future.


E. Data Collection

I. Sampling Method

In line with the research objectives and also the issues to become investigated, it could have been most suitable if all recruitment employees inside the organization were interviewed. However, because of the time limitations and resource limitations natural within this study, a non-probability sample of people was selected. Saunders et al (2007) asserts that the non-probability sample is most frequently used when adopting a situation study strategy. A non-probability sample, as explained (Oppenheim, 2000), is really a sample where the possibility of each situation being selected in the people in this country isn’t known.

The examples of graduates which were selected to take part in the quantitative study are they canrrrt constitute a probability sample of graduates within London or United kingdom. Also, the amount of employees within Lloyds who required part within the qualitative study wasn’t sufficient to constitute a substantial area of the recruitment department within Lloyds TSB. And so the study focused more about the quantitative details from the thought of recruitment inside the organization, instead of theories expressed within the literature review, and just what graduates around the outdoors considered online recruitment.

Ii. Primary Data Collection

In collecting data that may be analysed using quantitative means, Easterby-Cruz et al (2008) claims that researchers could collect either primary or secondary data. He further claims that though all these means get their merits and demerits, the gathering of one’s own data gives control of the dwelling from the sample and also the data acquired from each respondent. Additionally, it gives greater confidence the data collected would match the study objectives.

The investigator therefore made a decision to collect primary data from 20 graduates using questionnaires distributed-in-person to every respondent. It was done among buddies and colleagues inside the college who’ve utilized online recruitment systems. Data in the semi-structured interviews could be collected utilizing a tape recorder, and also the conversations with all of four employees could be transcribed sentence after sentence, and expression for expression. The benefits natural within this approach is it enables the investigator to document and find out patterns in words and feelings that wouldn’t be available if other kinds of interviews were conducted.


Data files and exercises

Throughout the SPSS Survival Manual you will see examples of research that is taken from a number of different data files, survey.zip, error.zip, experim.zip, depress.zip, sleep.zip and staffsurvey.zip. To use these files, which are available here, you will need to download them to your hard drive or memory stick. Once downloaded you'll need to unzip the files. To do this, right click on the downloaded zip file and select 'extract all' from the menu. You can then open them within SPSS.

(To do this, start SPSS, click on the Open an existing data source button from the opening screen and then on More Files. This will allow you to search through the various directories on your computer to find where you have stored your data files. Find the file you wish to use and click Open.)

Survey.sav

This is a real data file, condensed from a study that was conducted by my Graduate Diploma in Educational Psychology students. The study was designed to explore the factors that impact on respondents' psychological adjustment and wellbeing. The survey contained a variety of validated scales measuring constructs that the extensive literature on stress and coping suggest influence people's experience of stress. The scales measured self-esteem, optimism, perceptions of control, perceived stress, positive and negative affect, and life satisfaction. A scale was also included that measured people's tendency to present themselves in a favourable or socially desirable manner. The survey was distributed to members of the general public in Melbourne, Australia and surrounding districts. The final sample size was 439, consisting of 42 per cent males and 58 per cent females, with ages ranging from 18 to 82 (mean=37.4).

Error.sav

The data in this file has been modified from the survey.zip file to incorporate some deliberate errors to be identified using the procedures covered in Chapter 5. For information on the variables etc. see details on survey.zip.

Experim.sav

This is a manufactured data set that was created to provide suitable data for the demonstration of statistical techniques such as t-test for repeated measures, and one-way ANOVA for repeated measures. This data set refers to a fictitious study that involves testing the impact of two different types of interventions in helping students cope with their anxiety concerning a forthcoming statistics course. Students were divided into two equal groups and asked to complete a number of scales (Time 1). These included a Fear of Statistics test, Confidence in Coping with Statistics scale and Depression scale. One group (Group 1) was given a number of sessions designed to improve mathematical skills, the second group (Group 2) was subjected to a program designed to build confidence in the ability to cope with statistics. After the program (Time 2) they were again asked to complete the same scales that they completed before the program. They were also followed up three months later (Time 3). Their performance on a statistics exam was also measured.

Manipulate.sav

This file contains data extracted from hospital records which allows you to try using some of the SPSS data manipulation procedures covered in Chapter 8 Manipulating the data. This includes converting text data (Male, Female) to numbers (1, 2) that can be used in statistical analyses and manipulating dates to create new variables (e.g. length of time between two dates).

Depress.sav

This file has been included to allow the demonstration of some specific techniques in Chapter 16. It includes just a few of the key variables from a real study conducted by one of my postgraduate students on the factors impacting on wellbeing in first time mothers. It includes scores from a number of different psychological scales designed to assess depression (details in Chapter 16 on Kappa Measure of Agreement).

Sleep.sav

This is real data file condensed from a study conducted to explore the prevalence and impact of sleep problems on various aspects of people's lives. Staff from a university in Melbourne, Australia were invited to complete a questionnaire containing questions about their sleep behaviour (e.g. hours slept per night), sleep problems (e.g. difficulty getting to sleep) and the impact that these problems have on aspects of their lives (work, driving, relationships). The sample consisted of 271 respondents (55% female, 45% male) ranging in age from 18 to 84 years (mean=44yrs).

Staffsurvey.sav

This is a real data file condensed from a study conducted to assess the satisfaction levels of staff from an educational institution with branches in a number of locations across Australia. Staff were asked to complete a short, anonymous questionnaire (shown later in this Appendix) containing questions about their opinion of various aspects of the organisation and the treatment they have received as employees.


Discussion

This paper was based on a narrative review of systematic and non-systematic searches of the literature on the effects of mode of questionnaire administration on data quality. The review showed that, while some studies were inconsistent or inconclusive, different modes of questionnaire administration are likely to affect the quality of the data collected. The effects appeared to be more marked between interview and self-administration modes, rather than within modes. It was often difficult to isolate the effects of the method of contact from the other differences between the data collection methods, and this limits knowledge about how the mode of administration alters the process of answering questions. 4 A main problem with the literature elicited is that most of the studies did not use experimental or randomization methods to allocate the different questionnaire modes to participants. Thus, differences detected in responses between different modes could be due to differences between settings, or to genuine differences between respondents.

Explanatory models that have been proposed for the effects of data collection mode on data quality include the impersonality of the method of contacting respondents and in delivering and administering the questionnaire (highest in self-administration methods), the cognitive burden imposed on respondents by the method (greatest in self-administration methods), the legitimacy of the study (it is more difficult to establish the credentials of some surveys in telephone contacts), the control over the questionnaire (interviewers have the highest level of control over the order and completion of the questions), the rapport between respondent and interviewer (lowest in self-administration settings where there is no visual contact), and communication style (an interviewer can be motivating and clarify questions, but can lead to interviewer and social desirability bias) 4, 8 (see Box 5). These models need to be fully tested in experimental designs.

Explanations for effects of data collection mode on data quality.

1. The impersonality of the method: while an interviewer can enhance motivation to respond as well as response accuracy, self-administration methods increase perceived impersonality and may encourage reporting of some sensitive information (e.g. in interview situations there may be fear of embarrassment with the exposure of weakness, failure or deviancy in the presence of a stranger).
2. The cognitive burden imposed by the method: different methods make different demands on respondents, including reading, listening, following instructions, recognising numbers and keying in responses. Face-to-face interviews make the least demands, while the lack of visual support in telephone interviews may make the task more complex.
3. The legitimacy of the study: this may be more difficult to establish with some methods than others. In contrast to paper or electronic communications, telephone contacts limit the possibilities for establishing the survey’s credentials. This might affect initial response and the importance respondents place on the study, and their motivation to answer questions accurately.
4. The control over the questionnaire varies: interviewers have the highest level of control over question order in self-administered paper questionnaire modes there is little control over question order.
5. Rapport: rapport between respondent and interviewer may be more difficult to establish in self-administration and telephone interview than in face-to-face modes, as there is no visual contact. This can adversely affect motivation to respond, although social desirability bias may be reduced as there is less need for approval.
6. Communication style: more information may be obtained in interview than other situations, as interviewers can motivate respondents, pause to encourage (more, longer) responses, and clarify questions interviewers can also lead to interviewer and social desirability bias.
1. The impersonality of the method: while an interviewer can enhance motivation to respond as well as response accuracy, self-administration methods increase perceived impersonality and may encourage reporting of some sensitive information (e.g. in interview situations there may be fear of embarrassment with the exposure of weakness, failure or deviancy in the presence of a stranger).
2. The cognitive burden imposed by the method: different methods make different demands on respondents, including reading, listening, following instructions, recognising numbers and keying in responses. Face-to-face interviews make the least demands, while the lack of visual support in telephone interviews may make the task more complex.
3. The legitimacy of the study: this may be more difficult to establish with some methods than others. In contrast to paper or electronic communications, telephone contacts limit the possibilities for establishing the survey’s credentials. This might affect initial response and the importance respondents place on the study, and their motivation to answer questions accurately.
4. The control over the questionnaire varies: interviewers have the highest level of control over question order in self-administered paper questionnaire modes there is little control over question order.
5. Rapport: rapport between respondent and interviewer may be more difficult to establish in self-administration and telephone interview than in face-to-face modes, as there is no visual contact. This can adversely affect motivation to respond, although social desirability bias may be reduced as there is less need for approval.
6. Communication style: more information may be obtained in interview than other situations, as interviewers can motivate respondents, pause to encourage (more, longer) responses, and clarify questions interviewers can also lead to interviewer and social desirability bias.

Explanations for effects of data collection mode on data quality.

1. The impersonality of the method: while an interviewer can enhance motivation to respond as well as response accuracy, self-administration methods increase perceived impersonality and may encourage reporting of some sensitive information (e.g. in interview situations there may be fear of embarrassment with the exposure of weakness, failure or deviancy in the presence of a stranger).
2. The cognitive burden imposed by the method: different methods make different demands on respondents, including reading, listening, following instructions, recognising numbers and keying in responses. Face-to-face interviews make the least demands, while the lack of visual support in telephone interviews may make the task more complex.
3. The legitimacy of the study: this may be more difficult to establish with some methods than others. In contrast to paper or electronic communications, telephone contacts limit the possibilities for establishing the survey’s credentials. This might affect initial response and the importance respondents place on the study, and their motivation to answer questions accurately.
4. The control over the questionnaire varies: interviewers have the highest level of control over question order in self-administered paper questionnaire modes there is little control over question order.
5. Rapport: rapport between respondent and interviewer may be more difficult to establish in self-administration and telephone interview than in face-to-face modes, as there is no visual contact. This can adversely affect motivation to respond, although social desirability bias may be reduced as there is less need for approval.
6. Communication style: more information may be obtained in interview than other situations, as interviewers can motivate respondents, pause to encourage (more, longer) responses, and clarify questions interviewers can also lead to interviewer and social desirability bias.
1. The impersonality of the method: while an interviewer can enhance motivation to respond as well as response accuracy, self-administration methods increase perceived impersonality and may encourage reporting of some sensitive information (e.g. in interview situations there may be fear of embarrassment with the exposure of weakness, failure or deviancy in the presence of a stranger).
2. The cognitive burden imposed by the method: different methods make different demands on respondents, including reading, listening, following instructions, recognising numbers and keying in responses. Face-to-face interviews make the least demands, while the lack of visual support in telephone interviews may make the task more complex.
3. The legitimacy of the study: this may be more difficult to establish with some methods than others. In contrast to paper or electronic communications, telephone contacts limit the possibilities for establishing the survey’s credentials. This might affect initial response and the importance respondents place on the study, and their motivation to answer questions accurately.
4. The control over the questionnaire varies: interviewers have the highest level of control over question order in self-administered paper questionnaire modes there is little control over question order.
5. Rapport: rapport between respondent and interviewer may be more difficult to establish in self-administration and telephone interview than in face-to-face modes, as there is no visual contact. This can adversely affect motivation to respond, although social desirability bias may be reduced as there is less need for approval.
6. Communication style: more information may be obtained in interview than other situations, as interviewers can motivate respondents, pause to encourage (more, longer) responses, and clarify questions interviewers can also lead to interviewer and social desirability bias.

This topic has important implications for research methodology, the validity of the results of research, the soundness of evidence-based public policy, and for clinicians who wish to screen their patients using questionnaires. 51 All users of questionnaires need to be aware of the potential effects of mode of administration on their data. The validity of the common research practice of comparing data from dual modes of administration within studies is also called into question. While calls have been made for greater attention to questionnaire development in epidemiology, 77 there has been less focus on the wide range of different biases, at different levels, stemming from the various modes of administering questionnaires.


SDA Features

Documentation:

  • Codebooks: SDA can produce both HTML and print-format codebooks. The documentation for each study contains a full description of each variable, indexes to the variables, and links to study-level information.

Analysis:

  • Various analysis types are available: frequencies and crosstabulation, comparison of means, correlation matrix, comparison of correlations, multiple regression, logit/probit regression.

Other Capabilities:

  • Subsetting: Users can generate and download a customized subset of an SDA dataset. In addition to generating a data file, the subset procedure produces a codebook for the subset and data definitions for SAS, SPSS, Stata and DDI. The subset can include both the original dataset variables and new variables created with recode or compute.

SDA Manager:

  • Import SPSS .sav files, Stata .dta files, CSV files and TSV files and automatically convert them into SDA datasets.
  • Create and configure personal user workspaces. These user workspaces enable analysts to create and store recoded and computed variables in their own private storage areas. For groups -- such as college classes -- the leader or instructor can make their created variables accessible to the group to use in their own analysis projects.
  • Configure dataset-level access control -- specifying which users can access which datasets.
  • Generate reports on usage of the datasets in the archive.
  • Troubleshoot problems.

What are behavioral surveys?

As you have probably already guessed after reading some of the other sections in this chapter, asking people questions can be immensely valuable for gaining insight and information into various questions, puzzles, and problems that may exist in your community.

Another type of survey, the behavioral survey, asks people to respond to questions about certain actions or behaviors that affect their physical, emotional, or mental well-being. These behaviors might include cigarette use, unprotected sexual activity, or habits that might increase the chance for cardiovascular diseases.

Unlike the constituent surveys of goals, process, and outcomes, behavioral surveys do not try to determine what people think rather, they focus on what people do. But, one important distinction to make with behavioral surveys is this: these surveys will tell you what people say they do. Consequently, the surveys must be taken as self-reports. That is, your group should recognize that the results will be subjective accounts of individual actions. This doesn't diminish the value of behavioral surveys rather, it simply must be taken into consideration when you analyze the data.


You’ll often find the most useful insights by analyzing your open-ended survey questions (such as free-text responses to your Net Promoter Score survey questions). But, what if you’re faced with hundreds or thousands of answers? It takes too long and is a huge mental tax!

The answer: coding open-ended questions.

Summarizing open-ended survey questions? Start here.

Manual, or automated coding?

There’s ample debate about whether to go for manual or automated coding.

You can do automated coding with the help of text analytics software (such as Thematic), which is a lot simpler. But if you decide to go for manual coding, you’ll want to learn best practices from the people who have been dealing with text for decades, qualitative researchers.

For this post, I’ve dived into how manual coding works.

What is coding and why does it matter?

When you hear a term like ‘big data’ it almost always refers to quantitative data : numbers or categories. Statistical and machine learning techniques love numbers. Free text is an example of qualitative data . Dealing with it is difficult but it’s crucial to find the customer insights you’re after.

By nature, qualitative researchers believe that numbers won’t get you very far. They believe that by interviewing (or surveying) your customers and asking them to answer open-ended questions, you can gain much deeper learnings.

The value in Net Promoter Score surveys

Let’s take for example Net Promoter Score ( NPS) surveys . The score, calculated from numeric answers to the question ‘How likely, on a scale from 0 to 9, are you to recommend us to friend or family?’ will result in a single measure of a company’s performance.

Let’s dig a bit deeper. It’s actually the open-ended answers to the question ‘Why did you give us that score?’ that will teach you how to improve that measure in the future.

As you know, qualitative research produces a lot of text.

Survey questions where respondents are free to write whatever they like are also called open-ended questions . A response is known as a verbatim .

Researchers use coding to draw conclusions from this data with the objective of making data-driven decisions. ‘Coding’ or ‘tagging’ each response with one or more codes helps capture what the response is about, and in turn, summarise the results of the entire survey effectively. If we compare coding to Natural Language Processing (NLP) methods for analyzing text, in some cases coding can be similar to text categorization and in other ways to keyword extraction.

Now, let’s look at coding and the different methodologies in more detail.

Coding frames

We often refer to how to perform the task manually, but if you are looking at using an automated solution, this knowledge will help you understand what matters and how to choose an effective approach.

What’s a coding frame?

When creating codes, they’re put into what we call a coding frame. The coding frame is important because it represents the organizational structure and influences how useful the coded results will be. There are two types of frames: ‘flat’ and ‘hierarchical’:

  • A Flat frame means that all codes are of the same level of specificity and importance. That’s easy to understand. But if it gets large, organizing and navigating it will be difficult.
  • Hierarchical frames capture a taxonomy of how the codes relate to one another. They allow you to apply a different level of granularity during the coding and the analysis of the results.

One interesting application of a hierarchical frame is to support differences in sentiment. If the top-level code describes what the response is about, a mid-level one can specify if it’s positive or negative and a third level can specify the attribute or specific theme.

You can see an example of this type of coding frame below.

Example of a Coding Frame

Coding frames – pros and cons

Flat code frame Hierarchical code frame
Supports fewer codes Supports a larger code frame
(+) Easier and faster to manually code with (-) Requires navigating the code frame to find the right one
(+) Easy to provide consistent coding (-) Prone to a subjective opinion of how each answer is coded
(-) Difficult to capture answers that aren’t common leading to a large ‘other’ category (+) Can organize on basis of organizational structure
(-) Doesn’t differentiate between the importance and levels of specificity of themes (+) Allows for different levels of granularity

Two critical things to consider when coding open-ended questions

A couple of critical things to consider when coding open-ended questions are the size and the coverage of the frame.

Coverage

Make sure to group responses with the same themes, disregarding wording, under the same code. For example, a code such as ‘cleanliness’ could cover responses mentioning words like ‘clean’, ‘tidy’, ‘dirty’, ‘dusty’ and phrases like ‘looked like a dump’, ‘could eat off the floor’. The coder needs a good understanding of each code and its coverage .

Having only a few codes and a fixed frame makes the decision easier. If you have many codes, particularly in a flat frame, this makes it harder as there can be ambiguity and sometimes it isn’t clear what exactly a response means. Manual coding also requires the coder to remember or be able to find all of the relevant codes, which is harder with a large coding frame.

Flexibility

Coding frames should be flexible . Coding a survey is a costly task, especially if done manually, and so the results should be usable in different contexts. Imagine this: You are trying to answer the question ‘what do people think about customer service’ and create codes capturing key answers. Then you find that the same survey responses also have many comments about your company’s products.

If you need to answer “what do people say about our products?” you may find yourself having to code from scratch! Creating a coding frame that is flexible and has good coverage (see the Inductive Style below) is a good way to ensure value in the future.


3 Answers 3

A binary logistic regression would work, for the same reason that you can use a regression when you want to compare two sample means.

However, it may be more apparatus that is required.

With a small sample, one might consider a binomial test. In this case, your sample sizes are nice and big, so a straight out proportions test should be effectively indistinguishable from it and a little simpler to deal with.

However, since you have three experimental runs it might be worth including experimental run as a covariate (even though if all is well it should not have a covariate effect). In that case, you could use logistic regression to do teh comparison incorporating the experimental run variable, or you could look at a chi-square test to achieve basically the same thing (though if you want a one-tailed test, the first thing would be better).


Exporting data

There are several options for exporting data from Survey Monkey. My suggestion is to export the data in several different formats, as some formats will require less preparation for analysis than other formats. If you are able to export the data in SPSS format, then most of the work will be done for you (except variable names for your survey items). Also, you will almost always want to export the individual rows of data rather than the summary data. If you want summaries of your data, you can create them from the individual rows of data. However, you may not be able to do all of the analyses that you want from the summary data that you download from Survey Monkey, and there is no way to get back to the individual rows of data if you downloaded only the summary data.

Below we will show four ways that you could export the individual rows of data. No matter which of these four methods of data extraction you use, you will probably want to rename the variables. The SPSS command rename variables can be used in a few different ways to rename variables, and you can rename multiple variables in a single call to the command. Also, many of the variables come into SPSS as string variables (even if the variable contains numbers), so you will need to convert those to numeric variables to use them in analyses such as ANOVA and regression. Again, there are many ways to do this in SPSS. After you have renamed your variables and converted them into numeric form, you may want to recode or collapse some variables. We will show some examples of this. Also, if a variable has no responses, it will be a numeric variable. We will see an example of this (in question 7, for example).

Let’s see a few examples of some of the commands used most commonly to prepare the data for analysis. Please remember that all commands in SPSS MUST end in a period. Also, think about how you would prefer to organize your SPSS syntax file. One way to organize the file is to work with each variable in turn. For example, you might rename, make a numeric version of the variable, add a variable label and then a value label to the first variable, and then go through the same process for the second variable, and so on. Another way is to do each type of task for each variable. For example, you might recode all of the variables in one section, and then add value labels in another section. You may find a different way to organize your syntax file. How you organize the file doesn’t really matter, but having some form of organization does matter. This will help ensure that you don’t forget to do something. It will also make the file more understandable if you need to share it with someone else. You will also want to add comments to the syntax file. You can do this either by using the comment command or by starting the line with an asterisk. As will all other commands in SPSS, you need to end your comment with a period. In my experience, these types of files tend to get “inherited”, meaning that the person who wrote the syntax file is no longer working on the project and someone else needs to step in and take over. You should try to write a syntax file that you would want to inherit.

I would suggest that this be the first command in your syntax file. It is one way to ensure that you are running your syntax on the correct data file. Here is an example of the command:

Although sometimes not technically necessary, I would suggest that you enclose the path specification and data file name in quotes. This is necessary if you have blanks in a folder name or the data file name.

The get data command is used to import data into SPSS. For example, you would use this command if you were trying to import data in an Excel file into SPSS.

Once you have finished your data cleaning tasks, I suggest that you save your dataset with a new name. When you do your analyses, you can open and use this dataset rather than re-running all of the data cleaning commands that are necessary to transform your raw dataset into an analysis-ready dataset. Also, it is possible that your analysis-ready dataset may contain only the numeric versions of your variables. Here is an example of this command:

The rename variables command

The rename variables command does exactly what you think it should do: it renames variables.

When Survey Monkey names variables, it often uses the first part of the question. Here are a few examples from my survey: WhatyearofschoolareyouinEnteranumber, DidyouattendanothercollegeuniversitybeforecomingtoUCLA, Pleasetelluswhereyouprefertostudyandwhereyouactually, and WhatkindofnoiseleveldoyoupreferwhenstudyingPleasera. Some of the variable names are more than 50 characters long. Now you can see why you would want to rename your variables! You have some choices as to how you write your rename variables syntax, as well as how you organize your syntax file. One possibility is that you rename each of your variables one at a time. If you do this, you may want to do any other operations regarding that variable next, so that your syntax file is organized by variables. Alternatively, you could rename all of your variables in a single call to the rename variables command.

The compute command is one of the commands that can be used to create a new variable. When cleaning the data, you can use the compute command to create a numeric version of a string variable, or you can use it to create a collapsed version of a multi-category variable. Here are some examples:

The if command is another command that you can use to create a new variable. You can also use the if command to recode the value of one variable based on the value of another variable. This command can also be used to create a numeric version of a string variable.

The recode command is used, of course, to recode variables. It can also be used to create a new variable (with the into keyword) and convert certain string variables into numeric variables (with the convert keyword).

The autorecode command does exactly what its name suggests: it automatically recodes variables. Sometimes this command is very useful, and other times it produces undesired (or unexpected) results. Care should be taken when one or more of the variables has missing values.

The crosstabs command

The crosstabs command is very useful for ensuring that the recoding of a variable went as planned. You can have one or more tables subcommands in your call to the crosstabs command. I strongly recommend that all recoded variables be checked against the original variable to be certain that the recode worked as intended. I understand that this can quickly become tedious, but making a mistake when recoding variables can cause lots of problems when you use that recoded variable in analyses, and at that point, the error may be quite difficult to uncover. Also, if you eliminate the original version of the variable from your dataset before you checked the recode, you may have considerable trouble finding the error.

The alter type command has at least three purposes. It can change some string variables into numeric variables, it can alter then length of string variables, and it can change the format of numeric variables.

The value labels command associates descriptive text to the values of categorical variables. This is very useful, because it reminds you what the values of the variable mean. The descriptive text is also given in output involving the variable, which often makes the output easier to interpret.

The variable labels command

The variable labels command allows you to associate descriptive text to a variable. For example, you may make the actual question from your questionnaire the variable label.

The delete variables command

The delete variables command does exactly what it says it does: it deletes variables from your dataset. You may find that some of the variables in your dataset have nothing but missing values in other words, they are useless. You can use the delete variables command to remove these variables from the dataset. In general, we do not suggest that researchers remove string variables from the dataset once a numeric version of them has been created. Rather, we suggest that after all of the data cleaning has been done, that dataset gets saved. Then you can make a copy of that dataset and remove unneeded string variables from it. This will result in a dataset that is cleaned and rid of unnecessary variables in other words, a dataset that is ready for analysis.

The document command can be used to associate text with a dataset. You can use the add document command to make additional notes as needed. These commands are very useful for keeping important information with the dataset, as opposed to writing the important information in a notebook that may get separated from the dataset.

Below are the SPSS syntax files that I wrote to prepare the data for analysis. Although these files have an .sps extension, they are simply text files. This means that you do not need SPSS to open the files you can open them with any text editor, such NotePad or WordPad. If you look at each of these files, you will notice that when the data were extracted using “SPSS format”, the least amount of data cleaning was necessary. The other extraction methods required more steps to prepare the data for analysis. Furthermore, there are many differences between the different types of extraction in terms of what needed to be done to prepare the data for analysis.

If you download the data using this method, most of the variables will be string variables and the values will be words associated with the options. You can access the SPSS syntax used to clean the data for the example questionnaire used in this workshop here. The original comma-separated values file is here.

If you download the data using this method, most of the variables will be string variables and the values will be words associated with the options. You can access the SPSS syntax used to clean the data for the example questionnaire used in this workshop here. The original comma-separated values file is here.

If you download the data using this method, most of the variables will be numeric variables and the values will be numbers associated with the options. You will not know from the information in the dataset which value labels should be associated with each numeric value, but you can probably get that information from the questionnaire. You can access the SPSS syntax used to clean the data for the example questionnaire used in this workshop here. The original comma-separated values file is here.

If you download the data using this method, most of the variables will be numeric variables and the value labels will be correctly associated with the numeric values. Also, most of the variables will have variable labels. The amount of data cleaning needed should be much less than with the other possible methods of downloading the data. You can access the SPSS syntax used to clean the data for the example questionnaire used in this workshop here. The original SPSS data file is here.

Improvements that could (should!) be made to this questionnaire

Notice that there is a problem with the question about the year in school. It is a good thing that we caught this error in our pilot testing. Also, question 9 about when people study is difficult to analyze because we specified it as “choose all that apply”. While in theory we may want to know all of the times that people study, we need to consider how we are going to analyze that variable if we allow respondents to choose multiple responses. The point is that piloting your questionnaire is important for working out problems with items, but it also allows you to discover potential difficulties with analyses. Remember that you want to know how you are going to analyze data before you collect them.


Appendix

What is survey data collection?

Survey data collection uses surveys to gather information from specific respondents. Survey data collection can replace or supplement other data collection types, including interviews, focus groups and more. The data collected from surveys can be used to boost employee engagement, understand buyer behaviour and improve customer experiences.

What is longitudinal analysis?

Longitudinal data analysis (often called “trend analysis”) is basically tracking how findings for specific questions change over time. Once a benchmark is established, you can determine whether and how numbers shift. Let's suppose that the satisfaction rate for your conference was 50% three years ago, 55% two years ago, 65% last year and 75% this year. In this case, congratulations are in order because your longitudinal data analysis shows a solid, upward trend in satisfaction.

What is the difference between correlation and causation?

Causation is when one factor causes another, whereas correlation is when two variables move together but one does not influence or cause the other. For example, drinking hot chocolate and wearing a woolly hat are two variables that are correlated, in that they tend to go up and down together however, one does not cause the other. In fact, they are both caused by a third factor: cold weather. Cold weather influences both hot chocolate consumption and the likelihood of wearing a woolly hat. Cold weather is the independent variable, and hot chocolate consumption and the likelihood of wearing a woolly hat are the dependent variables. In the case of our conference feedback survey, cold weather most probably influenced attendees' dissatisfaction with the conference city and the conference overall. Finally, to further examine the relationship between variables in your survey, you might need to perform a regression analysis.

What is regression analysis?

Regression analysis is an advanced method of data visualisation and analysis that allows you to look at the relationship between two or more variables. There are many types of regression analysis and the one(s) a survey scientist chooses will depend on the variables he or she is examining. What all types of regression analysis have in common is that they look at the influence of one or more independent variables on a dependent variable. In analysing our survey data, we might be interested in knowing what factors have the greatest impact on attendees’ satisfaction with the conference. Is it a matter of the number of sessions? The keynote speaker? The social events? The site? Using regression analysis, a survey scientist can determine whether and to what extent satisfaction with these different attributes of the conference contribute to overall satisfaction.

This, in turn, provides insight into which aspects of the conference you might want to alter next time around. Let's suppose, for example, that you paid a hefty fee to secure the services of a top-flight keynote speaker for your opening session. Participants gave this speaker and the conference overall high marks. Based on these two facts, you might think that securing the services of a fabulous (and expensive) keynote speaker is the key to conference success. Regression analysis can help you determine whether this is indeed the case. You might find that the popularity of the keynote speaker was a major driver of satisfaction with the conference. If so, next year you’ll want to secure the services of a great keynote speaker again. However, if for example, the regression shows that although everyone liked the speaker, this did not contribute much to attendees’ satisfaction with the conference, the large sum of money spent on the speaker might be better spent elsewhere. If you take the time to carefully analyse the soundness of your survey data, you’ll be on your way to using the answers to help you make informed decisions.


Watch the video: Analiza podataka (August 2022).