Skip to main content [aditude-amp id="stickyleaderboard" targeting='{"env":"staging","page_type":"article","post_id":2132143,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,bots,business,cloud,dev,mobile,","session":"D"}']

The Internet Archive aims to preserve 100 terabytes of government website data… just in case

Internet Archive / WayBack Machine

Image Credit: Paul Sawers / VentureBeat

A little over a month since becoming President-Elect of the United States of America, there is an almost tangible tension in the air about what life will be like under a Donald Trump regime.

In some cases, there’s blind panic.

[aditude-amp id="flyingcarpet" targeting='{"env":"staging","page_type":"article","post_id":2132143,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,bots,business,cloud,dev,mobile,","session":"D"}']

News emerged yesterday that environmentalists, academics, and climate scientists were frantically working to preserve U.S. government climate data in a Canadian archive. The reason? Well, let’s just say that Trump, and some of his team, have a track record of refuting climate change.

https://twitter.com/realdonaldtrump/status/265895292191248385?lang=en-gb

AI Weekly

The must-read newsletter for AI and Big Data industry written by Khari Johnson, Kyle Wiggers, and Seth Colaner.

Included with VentureBeat Insider and VentureBeat VIP memberships.

But while some are concerned about the preservation of data under Trump’s stewardship, there has been a campaign over the past decade that has sought to preserve the government’s website data as a president’s tenure nears its conclusion.

Introduced initially as George W. Bush’s time in office was coming to an end in 2008, the End of Term Web Archive is a collaboration between the Internet Archive, the Library of Congress, University of North Texas, George Washington University, Stanford University, and California Digital Library, among other libraries, and is designed to serve as a permanent record of government-related communications during presidential transitions.

By way of example, an estimated 83 percent of PDF documents on .gov domains vanished during President Obama’s first term in the White House.

Above: Missing PDFs: 2008 – 2012

The disappearance of online content from government websites isn’t always sinister — departments may merge, or projects and programs may simply become obsolete. It’s difficult to maintain millions of web pages. But with the impending Trump presidency, the four-yearly end-of-term archiving ritual perhaps takes on a greater degree of urgency, and this Saturday a team of volunteers will kickstart a guerilla archiving project to “save environmental data from Trump.”

The broader End of Term Web Archive initiative expands far beyond scientific climate change data, and will look to preserve key documents and data from across the .gov and .mil domains, alongside “federal websites on other domains and official government social media accounts,” explained Jefferson Bailey, director of web archiving programs at the Internet Archive, in a statement.

The Internet Archive, for the uninitiated, has been documenting the web’s evolution for two decades, letting anyone revisit the Apple homepage in 1998 or VentureBeat in 2006 by plugging their desired URL into the Wayback Machine. And part of that will involve preserving government websites, while the not-for-profit has other programs designed to record other government-related content — since its inception, the Internet Archive says it has preserved more than 3.5 billion .gov webpages, including in excess of 45 million PDFs.

[aditude-amp id="medium1" targeting='{"env":"staging","page_type":"article","post_id":2132143,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,bots,business,cloud,dev,mobile,","session":"D"}']

But the End of Term Web Archive is expansive and specific in its purpose. In total, the Internet Archive and its partners will collect webpages from more than 6,000 domains, 200,000 hosts, and 10,000 federal social media accounts, amounting to “hundreds of millions of individual government webpages,” with the accumulated data expected to hit more than 100 terabytes.

“No single government entity is responsible for archiving the entire federal government’s web presence,” added Bailey. “Web data is already highly ephemeral and websites without a mandated custodian are even more imperiled. These sites include significant amounts of publicly funded federal research, data, projects, and reporting that may only exist or be published on the web. This is tremendously important historical information. It also creates an amazing opportunity for libraries and archives to join forces and resources and collaborate to archive and provide permanent access to this material.”

The Internet Archive’s political role has grown in recent times. It launched the Political TV Ad Archive in January to help journalists fact-check claims made during political campaigning. And last month, it revealed it was seeking to build a backup of its gargantuan database in Canada, to prepare for a web “that may face greater restrictions,” it said.

The End of Term Archive wasn’t built with Donald Trump in mind, but its existence will provide some comfort for those concerned about how data — scientific or otherwise — will be managed under a Trump presidency.

[aditude-amp id="medium2" targeting='{"env":"staging","page_type":"article","post_id":2132143,"post_type":"story","post_chan":"none","tags":null,"ai":false,"category":"none","all_categories":"big-data,bots,business,cloud,dev,mobile,","session":"D"}']

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn More