News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021

SND-ID: 2021-256

Creator/Principal investigator(s)

Peter M. Dahlgren - University of Gothenburg, Department of Journalism, Media and Communication (JMG) orcid

Description

This dataset contains news articles from Swedish news sites during the covid-19 corona pandemic 2020–2021. The purpose was to develop and test new methods for collection and analyses of large news corpora by computational means. In total, there are 677,151 articles collected from 19 news sites during 2020-01-01 to 2021-04-26. The articles were collected by scraping all links on the homepages and main sections of each site every two hours, day and night.

The dataset also includes about 45 million timestamps at which the articles were present on the front pages (homepages and main sections of each news site, such as domestic news, sports, editorials, etc.). This allows for detailed analysis of what articles any reader likely was exposed to when visiting a news site. The time resolution is (as stated previously) two hours, meaning that you can detect changes in which articles were on the front pages every two hours.

The 19 news sites are aftonbladet.se, arbetet.se, da.se, di.se, dn.se, etc.se, expressen.se, feministisktperspektiv.se, friatider.se, gp.se, nyatider.se, nyheteridag.se, samnytt.se

... Show more..

Language

English

Swedish

Research principal, contributors, and funding

Research principal

University of Gothenburg

Responsible department/unit

Department of Journalism, Media and Communication (JMG)

Funding

  • Funding agency: The Swedish Civil Contingencies Agency (MSB)
Protection and ethical review

Data contains personal data

No

Method and time period

Unit of analysis

Population

News articles

Time Method

Sampling procedure

Total universe/Complete enumeration
An open source web scraper scraped news articles from 19 Swedish news sites every two hours. Code in Python for the web scraper is available at: https://github.com/peterdalle/mechanicalnews

Time period(s) investigated

2021-01-01 – 2021-04-26

Geographic coverage

Geographic spread

Geographic location: Sweden

Topic and keywords

Research area

Media, Language and linguistics, Public health (CESSDA Topic Classification)
Language Technology (Computational Linguistics), Media Studies (The Swedish standard of fields of research 2011)

Publications

Sort by name | Sort by year

Dahlgren, P. M. (2021). Svenskar eller utrikesfödda i medierna? – att identifiera födelseland från namn. I L. Truedson & J. Lundqvist (Red.), Vitt eller brett? – vilka får ta plats i medier och på redaktioner. Stockholm: Institutet för mediestudier.
ISBN: 978-91-987098-0-3

Dahlgren, P. M. (2021). Medieinnehåll och mediekonsumtion under coronapandemin: Datoriserade metoder för insamling och analys av stora mängder text- och mediedata. Göteborg: Institutionen för journalistik, medier och kommunikation (JMG), Göteborgs universitet.
ISSN: 1101-4679

If you have published anything based on these data, please notify us with a reference to your publication(s). If you are responsible for the catalogue entry, you can update the metadata/data description in DORIS.

Dataset
News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021

Associated documentation

Description

The dataset consists of the following:
article_metadata.csv (53 MB): The file contains information about each news article, one article per row. In total, there are 677,151 observations and 17 variables.

article_text.csv (236 MB): The file contains the id of each news article and how many times (count) a specific word occurs in the news article. The file contains 80,090,784 observations and 3 variables in long format.

frontpage_timestamps.csv (175 MB): The file contains when each news article

... Show more..

Version 1

Citation

Peter M. Dahlgren. University of Gothenburg (2021). News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021. Swedish National Data Service. Version 1. https://doi.org/10.5878/d18f-q220

Download citation

Data format / data structure

Text

Creator/Principal investigator(s)

Peter M. Dahlgren - University of Gothenburg, Department of Journalism, Media and Communication (JMG) orcid

Data collection

  • Time period(s) for data collection: 2019
  • Source of the data: Communications: Public

Variables

17

Number of individuals/objects

677151

License

Creative Commons  Attribution 4.0 International (CC BY 4.0)

CLARIN Virtual Collection Registry

Add this resource to a virtual collection

A virtual collection is connected to a specific research purpose and contains links to data resources from various digital archives. It is easy to create, access, and cite the collection.

Read more about virtual collections on the CLARIN website.

Published: 2021-11-02