Cover Image
Trutz Fries

Automating 404 / 301 Redirects with URL Similarity Matching using Python

02/25/2024 • by Trutz Fries

Dealing with 404 errors is a common challenge for website owners and developers. Not only do these errors affect user experience, but they can also impact your site's SEO performance. In this blog post, we'll explore a sophisticated yet straightforward approach to redirecting 404 URLs to valid ones by leveraging URL similarity matching. We'll then discuss how to implement these redirects on Netlify, a popular web hosting and automation platform.

Use Case

Imagine you've recently restructured your website, resulting in changed or removed URLs. This situation often leads to visitors encountering 404 errors when accessing old links from bookmarks, search engines, or other sites. To solve this, we can redirect these 404 URLs to the most similar valid URLs, improving both user experience and SEO.

Preparing Your Files

Before diving into the script, you need two text files:

  • 404 URLs File: A list of URLs that currently return a 404 error. You can obtain this list from your website analytics tool, server logs or the Google Search Console
  • Valid URLs File: A list of valid URLs on your website. This list can come from your sitemap or a current crawl of your site.

Ensure each URL is on a new line within its respective file.

Running the Script

This Python script reads both files, matches each 404 URL to the closest valid URL using URL similarity, and outputs the pairs to a "redirects.txt" file. To run this script, follow these steps:

Environment Setup

Ensure you have Python installed on your computer. You will also need the Polyfuzz library, which you can install via pip:

pip install polyfuzz

# If you run this in a jupyter notebook
#!pip install polyfuzz

The Script

Copy the Python script provided in the previous response into a new Jupyter notebook (e.g. Google Colab) cell or a Python script file (.py). Make sure to adjust the file paths to your 404 URLs and valid URLs files.

from polyfuzz import PolyFuzz
from typing import List

# Function to read URLs from a file
def read_urls_from_file(file_path: str) -> List[str]:
    with open(file_path, 'r') as file:
        urls = file.read().splitlines()
    return urls

# Function to write the redirect pairs to a file
def write_redirects_to_file(redirects: List[tuple], file_path: str):
    with open(file_path, 'a') as file:  # 'a' to append to the file if it already exists
        for source, target in redirects:
            file.write(f"{source} {target}\n")

# Read the URLs
urls_404 = read_urls_from_file("404-urls.txt") # Get it from the search console
valid_urls = read_urls_from_file("valid-urls.txt") # Get it from your sitemap

# Initialize PolyFuzz model
model = PolyFuzz("TF-IDF")
model.match(urls_404, valid_urls)

# Get the best matches
matches = model.get_matches()

# Prepare the redirects
redirects = [(row['From'], row['To']) for index, row in matches.iterrows()]

# Write redirects to file
write_redirects_to_file(redirects, "redirects.txt")

Execution

Run the script in your Jupyter notebook or Python environment. Upon completion, you'll find a new file named "redirects.txt" in your working directory.

Important note

Remember, while automation can handle a significant portion of the work, it's always a good idea to manually review the redirects to ensure accuracy and relevance.

Script Key Functionality

The script's core functionality revolves around reading the input files, using Polyfuzz for URL similarity matching, and then writing the matched URL pairs to an output file. This process automates the identification of the most appropriate redirect targets for URLs that would otherwise lead to a 404 error, significantly reducing manual effort and potential errors.

Implementing Redirects on Netlify

With your "redirects.txt" file ready, the next step is to implement these redirects on your Netlify-hosted site. Netlify supports redirect rules defined in a "_redirects" file at the root / public of your site directory. Here's how to use your generated file:

  1. Format Adjustment: Ensure your "redirects.txt" follows Netlify's redirect rules format, which typically looks like /old-path /new-path 301. The script already outputs in a similar format, but you may need to prepend your domain if you're working with absolute URLs.
  2. Rename and Upload: Rename "redirects.txt" to "_redirects" and upload it to the root of your site directory in your repository or via Netlify's UI in the site settings.
  3. Deploy: Trigger a new deploy on Netlify for your changes to take effect. Netlify will automatically apply the redirects defined in your "_redirects" file.
  4. Monitor: Wait a couple of days and check the results of your redirect within the Google Search console.

Other static hosters like Vercel offer similar redirect options.

Conclusion

Automating the process of redirecting 404 URLs to valid ones based on similarity is a powerful technique to enhance your website's user experience and maintain SEO rankings. By following the steps outlined in this post, you can efficiently manage URL redirects, especially after major site updates or migrations. Implementing these redirects on Netlify further simplifies the process, allowing you to focus on creating great content and providing value to your visitors.

Would you like to have a better overview on Amazon?
Monitor your listings 14 days for free!
Do you have any questions? Don't hesitate to call us or send us an email!
Tel. +49 221-29 19 12 32 | info@amalytix.com