It’s an experience we have all endured at some point: you visit a website and click a link, but instead of the page you were expecting, this message stares back at you: “Error 404 – Page Not Found.”
The Internet, which turned 40 this year, is now littered with millions of dead links. In fact, a study found that over 20 years, more than 98% of web links suffer from “link rot,” a phenomenon where hyperlinks break over time.
Link rot can wreak havoc online, cause frustration, and deny users access to important information. A study found about half of the links in United States Supreme Court opinions analyzed were rotten. According to Harvard Law School research, a quarter of the deep links in The New York Times articles are now dead.
While several things can cause a URL to stop working, a 404 error often appears when a webpage no longer exists. But, in many cases, this error is caused by website reorganization. In other words—the page still exists, but at a different URL.
Now, a research team led by Harsha Madhyastha, a computer science associate professor at USC, has developed a system that, given a broken link to a page, automatically discovers the page’s new URL. Four years in the making, the system, named FABLE (for Finding Aliases for Broken Links Efficiently), is outlined in a paper presented at the ACM Internet Measurement Conference this week.
The paper’s lead author is Jingyuan Zhu, a doctoral student at the University of Michigan who is advised by Madhyastha. The work is supported by a grant from the Alfred P. Sloan Foundation.
Finding patterns
When a webpage moves during a website reorganization, it is almost always done “programmatically,” or using software, said Madhyastha. As a result, the new URLs tend to follow a predictable pattern.
For instance: a tutorial page that was previously at: http://ruby.railstutorials.org/chapters/following-users
is now at https://www.railstutorial.org/book/following_users
In the same pattern, the page that was previously at: http://ruby.railstutorials.org/chapters/static-pages
is now at https://www.railstutorial.org/book/static_pages
The researchers capitalized on this phenomenon to create FABLE.
“We’re basically trying to reverse engineer the patterns underlying these changes in URLs.” Harsha Madhyastha
“When you reorganize a website, many pages are moved, not just one, and there is a pattern in how the old URLs are transformed to the new URLs,” said Madhyastha.
“A large number of broken links on the web can be fixed, because the pages at those links still exist at new URLs. We’re basically trying to reverse engineer the patterns underlying these changes in URLs.”
In this study, titled “Reviving Dead Links on the Web with FABLE,” the team crawled close to half a million pages from Wikipedia, Medium and StackOverflow to identify broken links. They ran FABLE on 20,000 broken links and found new URLs in about 25% of cases. They estimate that about 90% of these new URLs point to the correct webpage.
“Apart from the methods for accurately identifying URL transformation patterns and efficiently exploiting them, a lot of heavy engineering went into making this work,” said Madhyastha. “Nobody has built a similar system at this scale—and if it doesn’t work for millions of broken links, it’s simply not going to be useful.”
The researchers are currently working with Wikipedia to discover new URLs for thousands of broken external links on the site. Within a few months, they aim to offer the system as a browser extension, test its functionality on additional websites, and expand to other languages beyond English.
Published on October 25th, 2023
Last updated on May 16th, 2024