An implementation for converting a plaintext URL to a link, brought up in the post The Problem With URLs. This is useful for a custom comment or feedback system, but we’ll forget gopher:// links (sorry.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import java.util.regex.Matcher; import java.util.regex.Pattern; public class HTMLTools { public static String convertTextUrls(String html) { Pattern pattern = Pattern.compile("\\(?\\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]"); Matcher matcher = pattern.matcher(html); while(matcher.find()) { String url = matcher.group(0); if(url.startsWith("(") && url.endsWith(")")) url = url.substring(1, url.length() -1); html = html.replace(url, "<a href=\"" + url + "\">" + url + "</a>"); } return html; } } |
The problem with the regular expression provided in the above URL is that when it’s mixed with HTML content, matching the pattern
\(?\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]
to
<img src="http://localhost/favicon.gif" />
will turn it into
<img src="<a href="http://localhost/favicon.gif">http://localhost/favicon.gif"</a> />
since ” is a non-word character. I found the conclusion to just not use parenthesis in links to be pretty defeatist as common usages in MSDN and Wikipedia can be handled with relative ease. You must provide additional string processing, which is an absolute headache in Java but here I go. 10 lines for the original outline turns into almost 50. (We are going to ignore that user-generated text shouldn’t allow the <img> tag. Sometimes the client wants it and you just roll with it, son.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | import java.util.Vector; import java.util.regex.Matcher; import java.util.regex.Pattern; public class HTMLTools { public static String convertTextUrls(String html) { Pattern pattern = Pattern.compile("\\(?\\b(https?|ftp|file)://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]"); Matcher matcher = pattern.matcher(html); Vector<String> urls = new Vector<String>(); // Get the list of URLs in the block of text. while(matcher.find()) { String url = matcher.group(0); if(url.endsWith(")")) { if(url.startsWith("(")) url = url.substring(1, url.length() - 1); else if(url.indexOf('(') == -1) url = url.substring(0, url.length() - 1); } // Only add unique URLs so there are no multiple // auto-link replacements. if(!urls.contains(url)) urls.add(url); } // Auto-link only URLs that aren't contained within // a HTML tag (i.e., assume ="url" or ='url') or as a // child of a link as a more selective take on // html.replace(url, "<a href=\"" + url + "\">" + url + "</a>"); for(int i = 0; i < urls.size(); i++) { int from = 0, j; String url = urls.get(i); while((j = html.indexOf(urls.get(i), from)) >= 0) { if((j > 1 && html.substring(j-2, j).matches("=['\"]|\">")) || (j + url.length() + 1 < html.length() && !html.substring(j + url.length(), j + url.length() + 1).matches("\\s|\\)|<"))) { from = j + url.length(); } else { String replaceWith = "<a href=\"" + url + "\">" + url + "</a>"; html = html.substring(0, j) + replaceWith + html.substring(j + url.length()); from = j + replaceWith.length(); } } } return html; } } |
The PHP take, mixing up ereg*() and preg*() functions like there’s no tomorrow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | <?php class HTMLTools { public static function convertTextUrls($html) { $urlPattern = '/\(?\b(https?|ftp|file):\/\/[-A-Za-z0-9+&@#\/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#\/%=~_()|]/'; $urls = array(); if(preg_match_all($urlPattern, $html, $matchedUrls)) { // Get the list of URLs in the block of text. foreach($matchedUrls[0] as $url) { if($url{strlen($url)-1} == ')') { if($url{0} == '(') $url = substr($url, 1, strlen($url) - 2); else if(strpos($url, '(') === false) $url = substr($url, 0, strlen($url) - 1); } // Only add unique URLs so there are no multiple // auto-link replacements. if(!in_array($url, $urls)) array_push($urls, $url); } // Auto-link only URLs that aren't contained within // a HTML tag (i.e., assume ="url" or ='url') or as a // child of a link. foreach($urls as $url) { $from = 0; while(($i = strpos($html, $url, $from)) !== false) { $urlLen = strlen($url); if(($i > 1 && ereg('=[\'"]|">', substr($html, $i-2, 2))) || ($i + $urlLen + 1 < strlen($html) && !ereg("[ \n\r\t]|\\)|<", $html{$i + $urlLen}))) { $from = $i + $urlLen; } else { $replaceWith = '<a href="' . $url . '">' . $url . '</a>'; $html = substr($html, 0, $i) . $replaceWith . substr($html, $i + $urlLen); $from = $i + strlen($replaceWith); } } } } return $html; } } ?> |
This is obviously not the most elegant solution as this code still makes plenty of assumptions (especially as to the location of an URL placed in an already existing <a …> tag) and it really just highlights how you can’t handle every situation where a user throws a link onto their glob of unformatted text. But they’ll still yell at you for it not working right.

Why on earth would you do this instead of just using the DOM?
…because that’s coming in part 2?
Oh man I can’t wait.
Anticipation 2.0!!
Looks like someone created a better implementation: http://josephscott.org/archives/2008/11/makeitlink-detecting-urls-in-text-and-making-them-links/ (which is probably being used in this very meta comment.)