Excel - Extract domain names out of URLs

Asked By MikeB
21-Apr-08 01:44 AM
Hi, I have a list of URLs and I would like to find a way to extract
the domain name from the URL.

The URLs can be prefixed with www in some cases and I'd also like to
strip that off.

So, for the following URLs
http://www.wiseclerk.com/group-news/tag/libor
http://www.wiseclerk.com/group-news/2008/04/
http://www.wiseclerk.com/group-news/
http://www.tiscali.co.uk/search/results.php?section=&from=&query=setana
sports
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/articles/20080408/223932792.shtml
http://www.techdirt.com/
http://www.techdirt.com/
http://techdirt.com/index.php
http://techdirt.com/blog.php?tag=loans



I'd like to extract

wiseclerk.com
tiscali.co.uk
techdirt.com

I'm struggling to find the correct algorithm to locate the starting
point (after the http:// or after the http://www.) and the ending
point (the first /) for my Mid function.

Any help appreciated.

I'm thinking that I might have to do a find/search function for either
of the two starting strings and then have the "max" value (for the
longer of the two strings) but that's a lot of extra columns in my
spreadsheet.

Is there perhaps a better way to do this?
HodgeTAKETHISOUT
(1)
Excel
(1)
SUBSTITUTE
(1)
Worksheet
(1)
ISERR
(1)
FIND
(1)
UNIX
(1)
CHAR
(1)
  Rick Rothstein \(MVP - VB\) replied...
20-Apr-08 01:52 PM
I think this formula does what you want and handles all the various
possibilities...

=IF(LEFT(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1),4)="www.",MID(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1),5,256),LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","")&"/")-1))

Rick
  Nick Hodge replied...
20-Apr-08 02:07 PM
Mike

Try something like

=IF(ISERR(FIND("www",A1,1)),MID(A1,8,FIND("/",A1,8)-8),MID(A1,12,FIND("/",A1,12)-12))

this works on the theory that there is a trailing / after the domain
somewhere, it will need amending if there is not, e.g
http://excelusergroup.org

--
HTH
Nick Hodge
Microsoft MVP - Excel
Southampton, England
nick_hodgeTAKETHISOUT@zen.co.ukANDTHIS
web: www.excelusergroup.org
web: www.nickhodge.co.uk
  MikeB replied...
21-Apr-08 01:44 AM
Hey Nick! Some years ago I had the pleasure of living in Romsey near
Southampton. I just saw your signature and that brought back fond
memories.
  Rick Rothstein \(MVP - VB\) replied...
20-Apr-08 02:17 PM
That does not seem to work for URLs like these...

www.techdirt.com/articles/20080408/223932792.shtml

http://www.com/

Rick
  Howard Kaikow replied...
20-Apr-08 03:48 PM
=IF(LEFT(LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",SUBSTITUTE(A1,"http://","
E(A1,"http://","")&"/")-1),5,256),LEFT(SUBSTITUTE(A1,"http://",""),FIND("/",
SUBSTITUTE(A1,"http://","")&"/")-1))

That does not work for URLs that start with, say, ftp.

The type of the URL is relevant, e.g., what about ftp.xyz.com and
www.xyz.com, both  should result in same doughmain.

In this case, a worksheet function may be better..
  Rick Rothstein \(MVP - VB\) replied...
21-Apr-08 11:49 AM
Well, I could modify my formula to account for the FTP protocol (and, by
changing the approach, maybe so others even), but there seems to be so many
protocols possible... even if you wrote a function to handle them, which
protocols would you select to handle?

Rick
  Howard Kaikow replied...
22-Apr-08 04:50 PM
many

There is no simple answer.
It's up to the programmer.

A function may be easier because there are likely Windows APIs for parsing
URLs.
  Rick Rothstein \(MVP - VB\) replied...
22-Apr-08 05:43 PM
Fair enough. Here is a possible Function solution to this question which
makes use of the UrlGetPart API to get to the "company name". For those
following this thread, add a Module to the project (Insert/Module from the
VBA editor) and copy/paste the code after my signature into its code window.
Inside the GetCompanyName function, I have seeded the Protocols constant
with "www" and "ftp"... you can add any other protocols you think necessary
in this constant, just make sure each protocol is separated by a dot. You
would use this function just like any other worksheet function. So, for
example, if A1 contains the URL, then put this formula in the cell you want
to extract the company name to...

=GetCompanyName(A1)

It is that simple (assuming I didn't screw up my code that is<g>).

Rick

Private Declare Function UrlGetPart Lib "shlwapi" _
Alias "UrlGetPartA" _
(ByVal pszIn As String, _
ByVal pszOut As String, _
pcchOut As Long, _
ByVal dwPart As Long, _
ByVal Flags As Long) As Long

Private Const MAX_PATH As Long = 260
Private Const URL_PART_HOSTNAME As Long = 2

Public Function GetCompanyName(URL As String) As String
Dim dwPart As Long
Dim Flags As Long
Dim Part As String
Dim Size As Long
Dim Host As String
'  Protocols is a **dot** delimited string
Const Protocols As String = "www.ftp"
If Len(URL) > 0 Then
If InStr(URL, "//") = 0 Then URL = "http://" & URL
Part = Space$(MAX_PATH)
Size = Len(Part)
If UrlGetPart(URL, Part, Size, URL_PART_HOSTNAME, 0&) = 0 Then
GetCompanyName = Left$(Part, Size)
End If
Host = Left(GetCompanyName, InStr(GetCompanyName, "."))
If InStr("." & Protocols & ".", "." & Host) Then
GetCompanyName = Mid(GetCompanyName, Len(Host) + 1)
End If
End If
End Function
  Ron Rosenfeld replied...
22-Apr-08 10:08 PM
Since you indicated that these are URL's, I will assume they have a normal
structure, so the domain name will be the part between the first // and the
next /.  Then we just remove the www. if it exists.


=SUBSTITUTE(MID(A1,FIND("/",A1)+2,FIND(CHAR(1),
SUBSTITUTE(A1,"/",CHAR(1),3))-FIND("/",A1)-2),"www.","")

--ron
  Ron Rosenfeld replied...
22-Apr-08 10:15 PM
Missed something -- the case where there is no final "/".

So use this instead:

=SUBSTITUTE(MID(A1,FIND("/",A1)+2,IF((LEN(A1)-LEN(SUBSTITUTE(A1,"/","")))=2,
255,FIND(CHAR(1),SUBSTITUTE(A1,"/",CHAR(1),3))-FIND("/",A1)-2)),"www.","")

--ron
  Howard Kaikow replied...
23-Apr-08 02:25 AM
window.
necessary
want

It is almost always better to use the Unicode version of APIs.
  Rick Rothstein \(MVP - VB\) replied...
23-Apr-08 04:24 AM
You are probably right; but I have zero experience with international
programming issues (all my programming efforts for the past 27 years have
been US based), whether involving fonts or not. The end result is I have no
idea if there is anything special  that needs to be accounted for or not
when dealing with them. My gut feeling is simply using UrlGetPartW in place
of UrlGetPart is not the whole answer.

Rick
  Howard Kaikow replied...
23-Apr-08 05:17 AM
I really tied up now, but the following  will give you an idea on how  to
Convert to Unicode.
Apparently, the code is still not parsing as desired.

Option Explicit

Private Declare Function UrlGetPart Lib "shlwapi" _
Alias "UrlGetPartW" _
(ByVal pszIn As Long, _
ByVal pszOut As Long, _
pcchOut As Long, _
ByVal dwPart As Long, _
ByVal Flags As Long) As Long

Private Const MAX_PATH As Long = 260
Private Const URL_PART_HOSTNAME As Long = 2

Public Function GetCompanyName(URL As String) As String
Dim dwPart As Long
Dim Flags As Long
Dim Part As String
Dim Size As Long
Dim Host As String
'  Protocols is a **dot** delimited string
Const Protocols As String = "www.ftp"
If Len(URL) > 0 Then
If InStr(URL, "//") = 0 Then URL = "http://" & URL
Part = Space$(MAX_PATH)
Size = Len(Part)
If UrlGetPart(StrPtr(URL), StrPtr(Part), Size, URL_PART_HOSTNAME, 0&)
= 0 Then
GetCompanyName = Left$(Part, Size)
End If
Host = Left(GetCompanyName, InStr(GetCompanyName, "."))
If InStr("." & Protocols & ".", "." & Host) Then
GetCompanyName = Mid(GetCompanyName, Len(Host) + 1)
End If
End If
End Function

Private Sub btnByeBye_Click()
Unload Me
End Sub

Private Sub btnRunMe_Click()
With lstResult
.AddItem GetCompanyName("http://www.BagelsAndLox.com/")
.AddItem GetCompanyName("http://BagelsAndLox.com/")
.AddItem GetCompanyName("www.BagelsAndLox.com")
.AddItem GetCompanyName("BagelsAndLox.com")
.AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/")
.AddItem GetCompanyName("http://Carol.BagelsAndLox.com/")
.AddItem GetCompanyName("www.Ted.BagelsAndLox.com")
.AddItem GetCompanyName("Alice.BagelsAndLox.com")
End With
End Sub
  Howard Kaikow replied...
23-Apr-08 05:26 AM
A betta test is:


Private Sub btnRunMe_Click()
With lstResult
.AddItem GetCompanyName("http://www.BagelsAndLox.com/")
.AddItem GetCompanyName("http://BagelsAndLox.com/")
.AddItem GetCompanyName("www.BagelsAndLox.com")
.AddItem GetCompanyName("BagelsAndLox.com")
.AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/")
.AddItem GetCompanyName("http://Carol.BagelsAndLox.com/")
.AddItem GetCompanyName("www.Ted.BagelsAndLox.com")
.AddItem GetCompanyName("Alice.BagelsAndLox.com")

.AddItem GetCompanyName("http://www.BagelsAndLox.com/Alpha.htm")
.AddItem GetCompanyName("http://BagelsAndLox.com/Beta.html")
.AddItem GetCompanyName("http://www.Bob.BagelsAndLox.com/Gamma.doc")
.AddItem GetCompanyName("http://Carol.BagelsAndLox.com/Delta.jpg")
End With
End Sub
  Ron Rosenfeld replied...
23-Apr-08 09:38 AM
Here is a UDF that will probably do a better job, given all the variables that
have been proposed in this thread.

Note that the line that starts with  re.Pattern = and the subsequent line(s)
within quotation marks should be all on one line.

======================================
Option Explicit
Function ExtrURL(str As String) As String
Dim re As Object, mc As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern =
If re.test(str) = True Then
Set mc = re.Execute(str)
ExtrURL = mc(mc.Count - 1).submatches(2)
ExtrURL = Replace(ExtrURL, "www.", "")
If InStr(1, ExtrURL, ".") = 0 Then ExtrURL = ""
End If
End Function
==================================

--ron
  Rick Rothstein \(MVP - VB\) replied...
23-Apr-08 10:23 AM
Now that is what I miss about Regular Expressions from my days many years
ago working with them in the UNIX world... their clarity and readability.<g>

Rick
  Ron Rosenfeld replied...
23-Apr-08 02:33 PM
On Wed, 23 Apr 2008 10:23:45 -0400, "Rick Rothstein \(MVP - VB\)"



And even when you write out the explanation:

===============================
URL capturing

\b((https?|ftp)://)?([-A-Z0-9.]+)(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?

Options: case insensitive

Assert position at a word boundary «\b»
Match the regular expression below and capture its match into backreference
number 1 «((https?|ftp)://)?»
Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
Match the regular expression below and capture its match into backreference
number 2 «(https?|ftp)»
Match either the regular expression below (attempting the next
alternative only if this one fails) «https?»
Match the characters “http” literally «http»
Match the character “s” literally «s?»
Between zero and one times, as many times as possible, giving back
as needed (greedy) «?»
Or match regular expression number 2 below (the entire group fails if
this one fails to match) «ftp»
Match the characters “ftp” literally «ftp»
Match the characters “://” literally «://»
Match the regular expression below and capture its match into backreference
number 3 «([-A-Z0-9.]+)»
Match a single character present in the list below «[-A-Z0-9.]+»
Between one and unlimited times, as many times as possible, giving back
as needed (greedy) «+»
The character “-” «-»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
The character “.” «.»
Match the regular expression below and capture its match into backreference
number 4 «(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?»
Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
Match the character “/” literally «/»
Match a single character present in the list below
«[-A-Z0-9+&@#/%=~_|!:,.;]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
The character “-” «-»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
One of the characters “+&@#/%=~_|!:,.;” «+&@#/%=~_|!:,.;»
Match the regular expression below and capture its match into backreference
number 5 «(\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?»
Between zero and one times, as many times as possible, giving back as needed
(greedy) «?»
Match the character “?” literally «\?»
Match a single character present in the list below
«[-A-Z0-9+&@#/%=~_|!:,.;]*»
Between zero and unlimited times, as many times as possible, giving back
as needed (greedy) «*»
The character “-” «-»
A character in the range between “A” and “Z” «A-Z»
A character in the range between “0” and “9” «0-9»
One of the characters “+&@#/%=~_|!:,.;” «+&@#/%=~_|!:,.;»


Created with RegexBuddy
======================================
--ron
  Howard Kaikow replied...
26-Apr-08 08:11 AM
Did this issue  ever get solved?
  Harlan Grove replied...
27-Apr-08 11:27 AM
Ron Rosenfeld <ronrosenf...@nospam.org> wrote...
...
[reformatted]
...

Why so verbose?

re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
ExtrURL = re.Replace(str, "$2")
  Ron Rosenfeld replied...
27-Apr-08 07:46 AM
On Sat, 26 Apr 2008 20:27:28 -0700 (PDT), Harlan Grove <hrlngrv@gmail.com>


It's a pattern (from a library) that captures the different URL parts into
different backreferences, so does more than what the OP requested.

But I did test against all the content mentioned in the thread.

Running a quick test, using
============================
Function Extr(str As String) As String
Dim re As Object
Set re = CreateObject("vbscript.regexp")
re.IgnoreCase = True
re.Global = True
re.Pattern = "[^:]*:(//)?[^/:]*?([^./:]+\.[^./:]+(\.[a-z]{2})?)[:/].*"
If re.Test(str) = True Then
Extr = re.Replace(str, "$2")
End If
End Function
================================


Your pattern doesn't seem to match:

http://excelusergroup.org
www.techdirt.com/articles/20080408/223932792.shtml

www.BagelsAndLox.com
BagelsAndLox.com
www.Ted.BagelsAndLox.com
Alice.BagelsAndLox.com

and won't extract the URL from
.AddItem GetCompanyName("www.BagelsAndLox.com")
.AddItem GetCompanyName("BagelsAndLox.com")
.AddItem GetCompanyName("www.Ted.BagelsAndLox.com")
.AddItem GetCompanyName("Alice.BagelsAndLox.com")


Granted, these kinds of examples were not all in the OP's specifications.
--ron
  Harlan Grove replied...
29-Apr-08 07:05 AM
Ron Rosenfeld <ronrosenf...@nospam.org> wrote...
...
...

The domain name in the url above should be

techdirt.com

and that's what my approach returns.

...

I don't consider these urls. They're missing a protocol specifier
(http, https, ftp, or mailto, news, gofer, etc.) All depends on how we
define urls, but there could be substrings in arbitrary text that
match \b[^. ]+\.[^. ]\b that aren't urls, e.g., section numbers like
2.34.5. How would one distinguish these from urls without making the
protocol specifiers mandatory?
  Ron Rosenfeld replied...
28-Apr-08 10:27 PM
On Mon, 28 Apr 2008 14:35:06 -0700 (PDT), Harlan Grove <hrlngrv@gmail.com>


A valid objection.

Of course, we could just go back to the OP's original request:


which can be easily handled with a worksheet function, and given his
description of "having a list of URL's" are probably not embedded in text.

--ron
  Harlan Grove replied...
02-May-08 12:47 AM
Ron Rosenfeld <ronrosenf...@nospam.org> wrote...
...

OP's often don't provide comprehensive examples, as you know. If the
urls always have protocol specifiers, and there's always 2 slashes
just after the protocol specifier and colon, then the domain name will
appear between :// and the subsequent /, but such urls *can* also
contain port number specifiers. For example,

http://www.foo.com:80/bar/

which your approach chokes on but mine parses as foo.com. Then there
slashes, but they're perhaps a digression.

The domain name will be the last 2 or 3 period-separated tokens
between the first colon, possibly followed by 2 slashes, and the first
subsequent colon or slash. The only characters you need to check for
as delimiters are colons and slashes. The domain name will contain 1
or 2 periods separating any other characters.
  Howard Kaikow replied...
30-Apr-08 12:19 AM
The inclusion of the whatever:// is irrelevant to the issue of extracting
the domain.
Proper code will work either way.
And do not forget about country codes at the end of the string/URL.
  Dave Mills replied...
01-May-08 12:25 PM
First you will need to answer how a human can tell what the domain part is for
these examples. The only way I could think of would be to query "Who Is" and
look at the registrant data.

The problem is com, nz, uk etc. are all domains
so are net.nz and ace.net.nz. ace.net.nz is a sub domain of net.nz but then net
is a sub domain of nz. Since the domain itself can have an IP and be used to
point to a web server there is no way you can extract what you have defined as
the domain part from the string programmatically. The solution needs additional
data about what you consider is the boundary point in each string.


On Wed, 30 Apr 2008 19:02:08 -0700 (PDT), Harlan Grove <hrlngrv@gmail.com>

--
Dave Mills
There are 10 type of people, those that understand binary and those that don't.
  Harlan Grove replied...
02-May-08 12:48 AM
Dave Mills <Ne...@nospam--djmills-dot-co.uk> wrote...
...
...

There are some rules. Maybe not complete, but they'll cover most
situations. Domains should be parsed right to left by token, and
tokens are period-delimited strings.

If the rightmost token is 2 chars,
it's a country top-level domain, and presumably more tokens wanted.
If the next token going left is also 2 chars or a common generic
top-level domain name (net, org, etc.), then it's presumably also
a higher level domain. Otherwise, the 2nd token from the right
would complete the domain name.
If the rightmost token is us, the next is 2 chars and the next is
k12, we'd need the 4th token from the right too; otherwise, the 3rd
token from the right would complete the domain name. Any further
tokens going left would be hostnames within domain.
Else (the rightmost token is 3 or more chars) the 2nd token from the
right would complete the domain name.

These rules would fail if www.foobar.museum.ru were a valid url, in
which case the domain name should be foobar.museum.ru. Perhaps what's
needed is a complete list of accepted top-level domain names, then the
domain name would stop at the first token going right to left that
isn't an accepted top-level domain name.

The joker in the set of urls I posted before was stats.ox.ac.uk.
Create New Account
help
Need help with if-then statement in formula Excel Hello, I need to create a formula that looks in cell B2 and B3 and if either cell B2 or B3 contains "TRUE", then it returns a "TRUE". Thanks! Jaret Excel Miscellaneous Discussions HodgeTAKETHISOUT (1) Excel (1) SUBSTITUTE (1) Worksheet (1) ISNA (1) HTHNick (1) ExcelSouthampton (1) HodgeMicrosoft (1) Try this = OR(B2 formula works. Mike Jaret Try = OR(B2 = TRUE, B3 = TRUE) - - HTH Nick Hodge Microsoft MVP - Excel Southampton, England nick_hodgeTAKETHISOUT@zen.co.ukANDTHIS web: www.excelusergroup.org web: www.nickhodge.co.uk produce the results for B2 and B3 which is really M2 and N2 on my excel worksheet but I'll give it a shot. The data in cell K2 is a or N2 is "TRUE". Cell M2 uses the following formula to produce a "TRUE" statement: = SUBSTITUTE(B2, "(null)", "TRUE") Cell N2 uses the following formula to produce a "TRUE": = ISNA(L46
non-VBA Alternative to SUBTITUTE function Excel I have a somewhat complex problem, so I will try to explain with enough detail a book about it. I am adapting a large spreadsheet that was not created in excel. Many of the functions work a little differently, so I was not able to simply to the cell that contains this formula. Here is my problem. I am needing to substitute a lot of values, and the formula is waaay to long. Here is what it looks like, and while it would work, excel will not accept something this long: IFERROR(if(F19, IF(backwards_compatible, "<description> <![CDATA SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE
limits to multiple subsitute funtion?? Excel Is there a limit to how many substitute functions I can have in one formula? It does not seem to let me have more than 8. If there is a limit, is there a work around? = SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(BM2, "Oscar Heyman", "16431685"), "Buccelati", "16431678"), "Cartier", "16431674"), "Chanel", "16431677"), "Chopard", "16431686"), "David
Substitute Function Question #2 Excel Help! I'm soooo close. My cell looks like this - .0470N.0180E.21 I want this - T47N R18E It currently looks like this - T47N.18E Here is my subsitute function: = SUBSTITUTE(SUBSTITUTE(SUBSTITUTE(SUBSTITUTE (F2, ".0", "T", 1), "0", ""), ".0", "R", 2), ".21", "") I just need the second .0 to be replaced by a <space> "R" Excel Miscellaneous Discussions SUBSTITUTE (1) LEFT (1) FIND (1) LEN (1) The Substitute functions are evaluated from the inside