• Home
  • About
  • Doom II
  • Flog
  • Inspiration

Auto-linking Text URLs to HTML

Posted in Computing. on Wednesday, November 19th, 2008 by Derek
Nov 19

An implementation for converting a plaintext URL to a link, brought up in the post The Problem With URLs. This is useful for a custom comment or feedback system, but we’ll forget gopher:// links (sorry.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class HTMLTools {
	public static String convertTextUrls(String html) {
		Pattern pattern = Pattern.compile("\\(?\\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]");
		Matcher matcher = pattern.matcher(html);
 
		while(matcher.find()) {
			String url = matcher.group(0);
			if(url.startsWith("(") && url.endsWith(")"))
				url = url.substring(1, url.length() -1);
			html = html.replace(url, "<a href=\"" + url + "\">" + url + "</a>");
		}
 
		return html;
	}
}

The problem with the regular expression provided in the above URL is that when it’s mixed with HTML content, matching the pattern

\(?\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]

to

<img src="http://localhost/favicon.gif" />

will turn it into

<img src="<a href="http://localhost/favicon.gif">http://localhost/favicon.gif"</a> />

since ” is a non-word character. I found the conclusion to just not use parenthesis in links to be pretty defeatist as common usages in MSDN and Wikipedia can be handled with relative ease. You must provide additional string processing, which is an absolute headache in Java but here I go. 10 lines for the original outline turns into almost 50. (We are going to ignore that user-generated text shouldn’t allow the <img> tag. Sometimes the client wants it and you just roll with it, son.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import java.util.Vector;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class HTMLTools {
	public static String convertTextUrls(String html) {
		Pattern pattern = Pattern.compile("\\(?\\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]");
		Matcher matcher = pattern.matcher(html);
		Vector<String> urls = new Vector<String>();
 
		// Get the list of URLs in the block of text.
		while(matcher.find()) {
			String url = matcher.group(0);
			if(url.endsWith(")")) {
				if(url.startsWith("("))
					url = url.substring(1, url.length() - 1);
				else if(url.indexOf('(') == -1)
					url = url.substring(0, url.length() - 1);
			}
 
			// Only add unique URLs so there are no multiple
			// auto-link replacements.
			if(!urls.contains(url))
				urls.add(url);
		}
 
		// Auto-link only URLs that aren't contained within
		// a HTML tag (i.e., assume ="url" or ='url') or as a
		// child of a link as a more selective take on
		// html.replace(url, "<a href=\"" + url + "\">" + url + "</a>");
		for(int i = 0; i < urls.size(); i++) {
			int from = 0, j;
			String url = urls.get(i);
 
			while((j = html.indexOf(urls.get(i), from)) >= 0) {
				if((j > 1 && html.substring(j-2, j).matches("=['\"]|\">")) ||
				   (j + url.length() + 1 < html.length() && !html.substring(j + url.length(), j + url.length() + 1).matches("\\s|\\)|<"))) {
					from = j + url.length();
				} else {
					String replaceWith = "<a href=\"" + url + "\">" + url + "</a>";
					html = html.substring(0, j) + replaceWith + html.substring(j + url.length());
					from = j + replaceWith.length();
				}
			}
		}
 
		return html;
	}
}

The PHP take, mixing up ereg*() and preg*() functions like there’s no tomorrow:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
<?php
class HTMLTools {
	public static function convertTextUrls($html) {
		$urlPattern = '/\(?\b(https?|ftp|file):\/\/[-A-Za-z0-9+&@#\/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#\/%=~_()|]/';
		$urls = array();
 
		if(preg_match_all($urlPattern, $html, $matchedUrls)) {
			// Get the list of URLs in the block of text.
			foreach($matchedUrls[0] as $url) {
				if($url{strlen($url)-1} == ')') {
					if($url{0} == '(')
						$url = substr($url, 1, strlen($url) - 2);
					else if(strpos($url, '(') === false)
						$url = substr($url, 0, strlen($url) - 1);
				}
 
				// Only add unique URLs so there are no multiple
				// auto-link replacements.
				if(!in_array($url, $urls))
					array_push($urls, $url);
			}
 
			// Auto-link only URLs that aren't contained within
			// a HTML tag (i.e., assume ="url" or ='url') or as a
			// child of a link.
			foreach($urls as $url) {
				$from = 0;
 
				while(($i = strpos($html, $url, $from)) !== false) {
					$urlLen = strlen($url);
 
					if(($i > 1 && ereg('=[\'"]|">', substr($html, $i-2, 2))) ||
					   ($i + $urlLen + 1 < strlen($html) && !ereg("[ \n\r\t]|\\)|<", $html{$i + $urlLen}))) {
						$from = $i + $urlLen;
					} else {
						$replaceWith = '<a href="' . $url . '">' . $url . '</a>';
						$html = substr($html, 0, $i) . $replaceWith . substr($html, $i + $urlLen);
						$from = $i + strlen($replaceWith);
					}
				}
			}
		}
 
		return $html;
	}
}
?>

This is obviously not the most elegant solution as this code still makes plenty of assumptions (especially as to the location of an URL placed in an already existing <a …> tag) and it really just highlights how you can’t handle every situation where a user throws a link onto their glob of unformatted text. But they’ll still yell at you for it not working right.

  • Now Playing: The Constantines - Tournament of Hearts - 02 - Hotline Operator

5 Comments

  1. fraggle on November 19th, 2008

    Why on earth would you do this instead of just using the DOM?

  2. Afterglow on November 19th, 2008

    …because that’s coming in part 2?

  3. Jon on November 19th, 2008

    Oh man I can’t wait.

  4. Afterglow on November 19th, 2008

    Anticipation 2.0!!

  5. Afterglow on November 30th, 2008

    Looks like someone created a better implementation: http://josephscott.org/archives/2008/11/makeitlink-detecting-urls-in-text-and-making-them-links/ (which is probably being used in this very meta comment.)



Leave a Reply

CAPTCHA Image
Refresh Image
*

Derek MacDonald


  • Photo Stream
  • Categories
    • Australia
    • Computing
    • Film & TV
    • Food
    • Gaming
    • General
    • Music
    • Sports
    • Visual Art
  • Search






  • Home
  • About
  • Doom II
  • Flog
  • Inspiration

© Copyright Derek MacDonald. All rights reserved.
Designed by FTL Wordpress Themes brought to you by Smashing Magazine

Back to Top