Google and Twitter Cross-National Religion Measures
CitationAdamczyk, A., Hitlin, S., & Scott, J. (2021, September 13). Google and Twitter Cross-National Religion Measures.
SummaryThis dataset's Google Trends (GT) measures comprise of data aggregated from 2012 to 2018, and were collected using the Python library pytrends, which has the ability to access GT's API, in this case either through regional searches (interest by region) or time-based searches (interest over time). Regional searches for a certain term or set of terms returns relative search interest between countries. Time-based searches for a certain set of terms returns relative search interest over time for the terms within each - in this dataset, interest in each term for all time periods are averaged to provide an overall raw number reflecting search interest. Both are included here. Twitter measures were collected using the Python library tweepy, which has the ability to access the Twitter API. In this case, the Search API was used, which returns matched Tweets against a sampling of recent Tweets published in the past seven days.
Other variables were collected from several official publicly available sources, and their corresponding websites are listed in the codebook. These variables comprise sources of official data, which can be used to validate the collected GT and Twitter measures as well as to control for related phenomena such as access to computers, Internet use, wealth/development and other factors.
Data FileCases: 230
Weight Variable: None
Data CollectionGoogle Trends data were collected periodically throughout the study period, from September 2019 to June 2020, for the period of January 1, 2012 to June 30, 2018. Twitter data were collected periodically from January 2020 to June 2020, for varying periods which are outlined in the codebook. Dates for all other datasets are also included in the codebook where the various collection periods are listed.
Original Survey (Instrument)Google and Twitter Cross-National Religion Measures
Funded ByFunding provided by Templeton Religion Trust of Nassau, Bahamas, through the Global Religion Research Initiative at the University of Notre Dame.
Collection ProceduresGoogle Trends (GT) is the platform Google provides to produce information on people's search interests across time and geographical areas. In this database, the GT and Twitter measures were collected in order to validate them against established measures such as the World Values Survey, Pew population data and other official measures. In this dataset, any measures which were not derived from GT or Twitter can be used to either validate these measures or as control variables.
There are several methods of conducting term or topic interest searches on the GT's platform, but the two which are available in this database are 1) regional interest and 2) paired interest-over-time. The codebook reflects which type of search yielded each variable, and both types of searches are included. For regional interest term and topic searches, Google Trends returns a list of the chosen geographical areas (DMAs, countries), each with a standardized measure, which is a ratio that represents search interest from the highest to the lowest 'search rates' among designated areas. It first calculates the original 'search rate' - the search volume of an inquiry keyword in a geographic unit divided by the total search volume of the same geographic unit - in each of the designated geographical areas, then it assigns the number 100 to the area of the highest search rate, with each of other areas assigned a number ranging from 0 to 100 according to its proportion of the search rate to the highest one (see www.google.com/trends). A higher value denotes that the topic or term had a higher proportion of all the queries in that area.
For interest-over-time searches, the platform returns a similar list but with time periods (weeks or months) as the unit of analysis for each geographical area. Terms and topics can be paired in order to gauge relative interest in each word for each specified time period for each area - these results can then be averaged to provide an overall relative interest in each term. In this case, Google returns a list of each term by week/month and by geographical area, and calculates a search rate, similar to the regional searches. However, instead of volume by geography, it returns volume by time period for each geographical area, with the week/month with the highest search rate for any of the terms assigned the number 100, and the other time periods and terms assigned a number ranging from 0 to 100 according to its proportion of the search rate to the highest one. A higher value denotes that the term had a higher proportion of all the queries in that geographical area for that week. In this database, these numbers are then averaged over time. This paired time series averaging returns an average of all search activity for the weeks in question for each word or topic, so the proportions of interest for each word or topic do not add up to equal 100 but instead depict the raw relative search activity for each term over time. Word pairings depend on the topic - for example, it makes sense to pair 'Islam' with 'Christianity' but it is not as clear what other less-specific terms, such as 'worship' or 'God' could be paired with. In this database, religion-related topics which did not have a clear pairing were paired with the topic 'weather,' which is a passive and non-religious term and one of the most commonly searched words in the USA and globally. All word pairings are noted in the codebook. Single-term regional searches provide a measure of comparative interest across geographies, while paired time searches provide relative interest in 2+ search terms for each country over time.
Google Trends can produce estimates for 'topic' or 'term' searches. With 'term' searches, Google Trends produces the matched terms that people search in the language given. However, with 'topic' searches, Google Trends considers a group of terms that share the same concept in any given language. Thus, using Google Trends by topic is more useful when researchers want to obtain search estimates among different countries, or in a country that has more than one major language. This dataset contains only topic-based searches. All Google Trends data were derived from the same time period from January 1, 2012 to June 30, 2018. The end point is when we started the study, and we use 2012 as our beginning point because we felt that Internet usage would be well-established not only in the United States, but across the world, providing stable estimates.
Using a Python program to access Twitter's API, Tweets which contained the words of interest can be downloaded covering a specified time frame, and a proportion can be calculated based on how many Tweets refer to each topic. GT data demonstrates interrogatory, information-seeking interest, while Twitter data demonstrates self-presentational, expressive processes. As noted above, our search terms were queried in pairs or groups against the available Twitter sample for the dates specified in the codebook, and a proportion of 'interest' in each term was determined using a ratio of the Tweets relating to each subject in comparison to its paired or grouped terms. For example, we searched for terms relating to all five of the major world religions during a four-day time period of May 25 to 29 and calculated a ratio of search interest by country for each using available user-generated location data. In this case, 93 percent of collected Tweets from Afghanistan, a majority-Muslim country, referred to Islam in some way, with a low proportion of Christianity-related (3.4 percent) or Judaism-related (3.4 percent) Tweets and none related to Hinduism or Buddhism. Conversely, 70 percent of Tweets from the Philippines, a majority Christian country, were related to Christianity, another 23 percent were related to Islam, and under 3 percent were related to each of the other religions.
Sampling ProceduresSamples from Twitter were collected using the publicly available standard search Application Programing Interface (API). Twitter's Standard Search allows queries against a sample of Tweets from the past week. It is not a complete dataset, but rather allows for a representative sample of recent Tweets which can be extrapolated to a larger population. In the case of the measures provided here, our search terms were queried in pairs or groups against the available Twitter sample for the dates specified in the codebook, and a proportion of 'interest' in each term was determined using a ratio of the Tweets relating to each subject in comparison to its paired or grouped terms.
Samples from Google Trends (GT) are based on both regional and time-based search interest. Again, using the GT API, search terms are queried against the available sample to provide a general idea of search interest. Interest over time searches provide numbers which represent interest relative to the highest peak of search activity for a given topic. Interest by region searches provide numbers which represent interest relative to the highest peak of search activity by location. Higher values represent a higher proportion of all queries for each location. All relevant available data were included from the official sources used in the dataset.
Principal InvestigatorsDr. Amy Adamczyk, John Jay College of Criminal Justice and The Graduate Center, City University of New York
Dr. Steven Hitlin, University of Iowa
Jacqueline Scott, John Jay College of Criminal Justice and The Graduate Center, City University of New York