1. 文章基本信息
- 文章标题:Fields of Gold: Scraping Web Data for Marketing Insights
- 作者:Johannes Boegershausen、Andrew T. Stephen、Hannes Datta、Abhishek Borah
- 发表期刊:Journal of Marketing 2022, Vol. 86(5) 1 - 20
- 文章主旨:介绍营销学者如何使用网络爬虫和应用程序编程接口(APIs)从互联网收集数据,提出一个新的方法框架以提高数据有效性,并回顾相关文章进行分类,最后指出未来研究方向。
- Main Idea of the Article: To introduce how marketing scholars use web scraping and application programming interfaces (APIs) to collect data from the internet, propose a new methodological framework to enhance data validity, review related articles and classify them, and finally point out future research directions.
2. 网络数据在营销研究中的应用
- 应用增长情况
- 在营销研究中使用网络数据的出版物占比从2010年的约4%增加到2020年的15%,多数依赖网络爬虫(59%),APIs使用较少(12%),部分结合使用(9%),还有20%是手动提取网络数据。不同子领域都已采用网络数据,其中在线口碑和社交媒体是使用网络爬虫最突出的研究领域,最常用的数据来源是亚马逊。
- The Growth of Application
- The proportion of publications using web data in marketing research increased from about 4% in 2010 to 15% in 2020. Most rely on web scraping (59%), APIs are used less frequently (12%), and some combine the two (9%). Another 20% extract web data manually. Different subfields have adopted web data, with online word-of-mouth and social media being the most prominent areas using web scraping, and Amazon being the most commonly used data source.
- 促进营销知识的四种途径
- 研究新现象:网络数据能使营销学者研究新现象,如本世纪初的在线对话和消费者评论对销售的影响等,还能在传统数据可用之前研究当代问题,如疫情封锁政策对消费的影响。
- 提升生态价值:网络数据可让研究者更接近营销的“自然栖息地”,补充更受控的数据收集方法,还可在无数据提供者直接参与下收集,确保研究问题的社会相关性优先于商业目标。
- 促进方法进步:消费者和企业产生的数据多为非结构化,网络数据推动了处理非结构化数据方法的发展,如自动化文本分析和图像视频内容分析,也促进了网络分析方法的使用和进步。
- 改进测量:网络数据能让研究者更精确地测量构念并获得更有效的推论,可收集控制变量、高效操作新测量,还能以更高频率收集数据,增强统计效力,识别因果效应。
- Studying New Phenomena
- Web data enable marketing scholars to study new phenomena, such as online conversations and the impact of consumer reviews on sales in the early 2000s. They can also study contemporary issues before traditional data becomes available, such as the impact of pandemic lockdown policies on consumption.
- Boosting Ecological Value
- Web data allow researchers to get closer to the “natural habitat” of marketing, complement more controlled data collection methods, and can be collected without the direct involvement of data providers, ensuring that the social relevance of research questions takes precedence over commercial objectives.
- Facilitating Methodological Advancement
- Since much of the data produced by consumers and firms is unstructured, web data have driven the development of methods for dealing with and extracting insights from unstructured data, such as automated text analysis and image and video content analysis. They have also promoted the use and advancement of network analysis methods.
- Improving Measurement
- Web data allow researchers to measure constructs more precisely and obtain more valid inferences. They can be used to collect control variables, efficiently operationalize new measures, and collect data at higher frequencies, enhancing statistical power and identifying causal effects.

3. 网络数据收集的方法框架
- 框架概述
- 该框架聚焦于网络数据收集过程中的有效性问题,涉及数据来源选择、数据收集设计和数据提取三个阶段,需综合考虑技术、法律/伦理和有效性问题,以提高研究结果的可信度。
- Framework Overview
- This framework focuses on the validity issues in the process of web data collection, involving three stages: data source selection, data collection design, and data extraction. It requires a comprehensive consideration of technical, legal/ethical, and validity issues to enhance the credibility of research results.
- 数据来源选择
- 探索潜在来源的宇宙:要避免仅关注熟悉平台,积极考虑多种网站和APIs,还可从不同视角选择,确定何时停止探索需评估所选来源的优势,也可从多个来源收集数据。
- 考虑网络爬虫的替代方案:APIs是网络爬虫的替代方案,提取数据更具扩展性且法律风险低,还可使用其他文档化的数据集,搜索时应明确包含相关术语。
- 映射数据上下文:需识别可能影响研究有效性的相关上下文发展,包括数据结构变化、与焦点数据相关的信息以及潜在的研究机会,可通过多种方式了解和映射数据上下文。
- Exploring the Universe of Potential Sources
- Avoid focusing only on familiar platforms and actively consider a variety of websites and APIs. You can also choose from different perspectives and determine when to stop exploring by assessing the advantages of the selected sources. Data can also be collected from multiple sources.
- Considering Alternatives to Web Scraping
- APIs are an alternative to web scraping. Extracting data via APIs is more scalable and has lower legal risks. Other documented datasets can also be used, and relevant terms should be explicitly included in the search.
- Mapping the Data Context
- It is necessary to identify relevant contextual developments that may affect the validity of the research, including changes in the data structure, information related to the focal data, and potential research opportunities. The data context can be understood and mapped in various ways.

- 数据收集设计
- 提取哪些信息:在无下载数据集时,需决定从来源提取哪些信息,要考虑收集数据的次数、设置提取频率的**实践、技术参数以及机器人文件等因素,同时注意算法干扰和信息的时间稳定性,收集元数据可增强有效性。
- 如何采样:在无法获取整个数据库时,需设计采样框架,确定足够的样本量是关键,可利用外部来源或内部数据,选择采样方式需谨慎,避免系统性偏差。
- 以何种频率提取信息:需考虑提取信息的频率,单次提取虽有优点但可能存在有效性问题,多次提取有助于发现变化,自动调度可确保一致性,还需考虑是否设置提取结束日期。
- 在提取过程中如何处理信息:在收集过程中需平衡处理信息的效率和有效性,理想情况是保留原始数据,但存在技术和伦理风险,原始数据有助于后续处理和减少错误。
- Which Information to Extract
- When there is no downloadable dataset, it is necessary to decide which information to extract from the source. Consider factors such as the number of times data is collected, best practices for setting the extraction frequency, technical parameters, and the robot file. At the same time, pay attention to algorithm interference and the temporal stability of information. Collecting metadata can enhance validity.
- How to Sample

- When it is not possible to access the entire database, a sampling frame needs to be designed. Determining a sufficient sample size is crucial. External sources or internal data can be used, and the choice of sampling method should be made carefully to avoid systematic bias.
- At Which Frequency to Extract Information
- Consider the frequency of information extraction. Although single extraction has advantages, there may be validity problems. Multiple extractions can help detect changes, and automatic scheduling can ensure consistency. It is also necessary to consider whether to set an end date for extraction.
- How to Process the Information During the Extraction
- During the collection process, it is necessary to balance the efficiency and validity of information processing. Ideally, raw data should be retained, but there are technical and ethical risks. Raw data is helpful for subsequent processing and reducing errors.
- 数据提取
- 提高提取性能:大规模数据收集可能遇到技术问题,可通过多种方式解决,如使用稳定选择器、选择稳定的API版本、重新解析数据等。
- 监控数据质量:设置监控系统可实时诊断数据质量问题,需考虑不同层面的性能,长期收集时自动报告和警报有助于监控。
- 记录数据:在提取过程中需实时记录相关信息,准确全面的记录对未来数据使用至关重要,可使用模板并记录源的机构背景。
- Improving the Performance of Extraction
- Large-scale data collection may encounter technical problems, which can be solved in various ways, such as using stable selectors, choosing a stable API version, and re-parsing data.
- Monitoring Data Quality
- Setting up a monitoring system can monitor data quality issues in real-time. Consider different levels of performance. Automatic reporting and alerts are helpful for monitoring during long-term collection.
- Documenting Data
- Relevant information needs to be recorded in real-time during the extraction process. Accurate and comprehensive documentation is crucial for future data use. Templates can be used and the institutional background of the source can be recorded.
4. 未来研究方向
- 识别新的网络数据源
- 利用未充分利用的来源:鼓励研究未受关注的来源或新的因变量,如美国**行业相关的网络数据可用于研究营销问题。还可关注非洲等地区的网站,以及从多个来源构建独特丰富的数据集。
- 重新发现常用来源:重新关注常用来源上的不同信息,可能发现新现象,如在TripAdvisor等平台研究性别或种族问题。
- 改变提取频率:多次提取数据可能揭示新的营销现象,如研究亚马逊“假”评论。
- Drawing from Underutilized Sources
- Encourage the study of sources that have not received attention or new dependent variables, such as web data related to the US ******** industry for marketing research questions. Attention can also be paid to websites in regions such as Africa, and unique and rich datasets can be constructed from multiple sources.
- Rediscovering Frequently Used Sources
- Refocusing on different information on frequently used sources may reveal new phenomena, such as studying gender or racial issues on platforms like TripAdvisor.
- Altering the Extraction Frequency
- Multiple extractions of data may reveal new marketing phenomena, such as studying “fake” reviews on Amazon.
- 利用网络数据的多功能性提升生态价值
- 注入生态有效性到实验刺激中:通过精心选择网站和APIs增强实验的生态有效性,如社会心理学家的相关应用,营销核心话题和方法也可受益。
- 通过APIs进行自我管理的实地实验:利用APIs进行实地实验可让研究者更好地控制实验过程,收集高频数据以分析实验处理的影响,未来可能有更多应用。
- Infusing Ecological Validity into Experimental Stimuli
- By carefully selecting websites and APIs, the ecological validity of experiments can be enhanced, as demonstrated by social psychologists. Core marketing topics and methods can also benefit.
- Running Self-Administered Field Experiments via APIs
- Using APIs to conduct field experiments allows researchers to better control the experimental process and collect high-frequency data to analyze the impact of experimental treatments. There may be more applications in the future.
- 采用新的指标和方法产生营销见解
- 探索提供更好营销指标的网络来源:研究哪些网络来源可提供更优的营销指标,如谷歌搜索数据和推特数据可分别用于不同目的,但要避免过度依赖易获取的数据。
- 运营基于API的微服务:通过APIs提供微服务,研究者可研究新兴话题,如推荐系统,还可获取独特的公司数据,促进知识发现。
- Exploring Web Sources that Provide Better Marketing Metrics
- Research which web sources can provide better marketing metrics. For example, Google search data and Twitter data can be used for different purposes, but avoid over-reliance on easily accessible data.
- Operating API-based Microservices
- By providing microservices via APIs, researchers can study emerging topics such as recommendation systems and access unique company data to promote knowledge discovery.
- 利用效率提升改进测量
- 利用网络来源描述多样的线上线下行为:鼓励研究人员利用网站和APIs描述多样行为,不仅限于线上行为,可研究企业的服务导向和环境资质等。
- 采用APIs以更好地测量:APIs可用于改进测量,如自动化管理参与者以降低成本,还可用于发送监控警报、编排虚拟计算基础设施、促进刺激选择等。
- Leveraging Web Sources to Describe Diverse Online and Offline Behaviors
- Encourage researchers to use websites and APIs to describe diverse behaviors, not just online behavior. This can be used to study aspects such as a firm’s service orientation and environmental credentials.
- Embracing APIs for Better Measurement
- APIs can be used to improve measurement, such as automating participant management to reduce costs, sending monitoring alerts, orchestrating virtual computing infrastructure, and facilitating stimulus selection.


版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容,请联系我们,一经查实,本站将立刻删除。
如需转载请保留出处:https://51itzy.com/kjqy/159618.html