22.01.2018 change 22.01.2018

Expert: The need to archive online resources is a real problem

Photo: Fotolia Photo: Fotolia

Websites exist on average from 40 to 100 days. In many European countries their archiving is done by national libraries. There is no such initiative in Poland, although this problem has been noticed, among others, by scientists and lawyers - an expert from the University of Warsaw told PAP.

"We are not able to archive all the Internet resources, there are just too many of them, but attempts are made to preserve some websites for future generations. Some institutions also archive the contents of their mailboxes and official accounts on social media" - told PAP expert from the Digital Humanities Lab at the University of Warsaw Marcin Wilkowski.

Every minute Internet users share more than half a million photos on Snapchat, post nearly half a million tweets on Twitter, almost 50,000 photos on Instagram; Americans alone use more than 2.6 million GB of data. 90 percent of all available data have been created in the last two years. 2.5 quintillion bytes of data appear every day, according to the Data Never Sleeps 5.0 report prepared by DOMO.

In many European countries, including in Germany, Austria, the Czech Republic, Finland, Great Britain, national libraries archive the content of websites. However, they do not record all sites, only those that are published in the domain of a given country (that is, in Germany sites with the country code ".de", and in the UK - ".uk" and "co.uk"). Other countries, in addition to pages in the national domain, also archive important Internet publications that concern them. For example, Portugal does this in its web archive, created 10 years ago - said Wilkowski.

"Poland lacks a similar initiative" - said the expert. "Entire collection of archival websites is not and will never be available anywhere, so it is not true that nothing is ever lost on the Internet" - Wilkowski added. He reminded that the average website lifetime is from 40 to 100 days, and in addition some website design solutions, such as JavaScript, make archiving difficult.

Malwina Rozwadowska from the National Digital Archives (NDA) informed PAP that in 2009-2010 the NDA carried out a project consisting in the archiving of websites belonging to the "gov.pl" domain. "It was a one-off project. Talks are ongoing with the Ministry of Digital Affairs to continue archiving the Internet, but at this stage we are not able to provide specific information on the date and form of making the data available to a wider audience" - emphasised Rozwadowska.

More than 279 billion websites archived since 1996 are available today thanks to the American foundation Internet Archive, which also deals with digitalisation, offering access to multimedia collections and old computer games.

"The founder of the Internet Archive, Brewster Kahle, compared its activity to the Library of Alexandria, whose goal was to collect all available written texts from around the world. Kahle considered websites a part of the digital heritage already in the mid-1990s" - said Wilkowski. In 2002, the Internet Archive signed an agreement with the modern Library of Alexandria, under which the latter undertook to create a backup copy of archived Internet collections.

Lawyers are also involved in the discussion about archiving. During court hearings, referrals to websites, including archival ones that no longer exist, are increasingly being used in evidence. The problem, especially in the latter case, is the recognition of their credibility in court proceedings. Similarly, scientists have a problem with disappearing pages, to which the authors of scientific papers refer in footnotes.

"While archiving a website is a relatively simple task, is definitely more difficult to archive the content of social media, which today become the space of official communication of public institutions" - said Wilkowski. That\'s why ministerial tweets are archived in some countries. This is the case, for example, in the UK.

Another big challenge for archival science is also the fact that today the Internet is largely personalized and the content of many websites is dynamically adapted to the user\'s earlier choices. This means that different people may receive different content at same URL; in such case, what is the original that should be preserved? - Wilkowski wonders.

In Wilkowski\'s opinion, attempts to archive posts on social media are very limited by the rules of the platforms and data limits that can be obtained through special programming interfaces. Added to this is the scale of recorded data.

The US Library of Congress began to archive all Twitter posts in April 2010 and so far has archived several billion tweets. In December 2017, it announced that it would no longer collect all entries on this social network. Starting from January 1, 2018, the Library of Congress selects tweets that will be kept for the future. These will include important social events and trends.

The Internet also means e-mail boxes, including those used by official state institutions or heads of state. For example, mailboxes of presidents are archived in the US.

"Unfortunately, Polish public institutions do not disclose the principles and methods of archiving their online resources" - said Wilkowski.

Malwina Rozwadowska from the NDA told PAP that this institution is not involved in archiving mailboxes of Polish public institutions.

The Chancellery of the President of the Republic of Poland - to the PAP question, whether the official mailbox of the President and its ministers are archived - replied that there are no grounds for such activity.

"A given message is archived if it initiates a case of complaint or application, or constitutes a part of a case, as a result of which it is included in the case file. Archiving messages that are not part of any case would be pointless" - informs the Chancellery of the President of the Republic of Poland.

"It should also be remembered that as a rule, each e-mail user personally manages the content of their mailbox" - the Chancellery added.

By the publication date, PAP has not receive a response from the Chancellery of the Prime Minister to the question about the possible archiving of governmental boxes - of the Prime Minister, ministers or province governors.

According to Marcin Wilkowski, the most rational form of archiving the Internet is securing the content of websites created in the national domain and social media resources documenting important events - for example, in the case of Poland, entries concerning, for example, World Youth Days or elections. Such a resource must be collected already during the events.

Asked about the future of researching Internet resources, Wilkowski said that relevant methods had been developed for years. "However, we can not do historical research concerning the Internet without programming and digital tools. We are trying to develop these competences at the Digital Humanities Lab of the University of Warsaw, established in 2015" - he said. (PAP)

Author: Szymon Zdziebłowski

szz/ ekr/ zan/ kap/

tr. RL

Copyright © Foundation PAP 2018