Suppose you have an email that claims to be from a particular web destination (”Chase Bank”, “eBay”, “Middle of Nowhere Bank”, etc.) and directs you to a url purportedly at that location. Suppose further that you possess the capability of extracting both these pieces of information from any email if the email falls into said category. So you have
A. Purported Web Destination of Email
B. URL Email is Instructing you to Follow
So here is an open-ended question: how can you use existing network services to determine that B is an authentic location in A? A subset of existing spam filtering heuristics work quite well towards this end (visible text of html link does not match actual url, href attribute is expressed as IP address, etc.), but using network services opens of a new dimension of validation, one in which the data gathered for heuristic application are outside the control of the email’s sender. So post any ideas you have. Kurt asked a similar question at an SFS meeting last semester pertaining to the parasitic storage project. Whereas his aim was using network services for caching, my aim is using them for source authentication. Thanks and please keep the discussion focused, at least primarily, on this particular method.













So let me get this straight. You’re asking whether it’s possible for a spam filter to look at the contents of an e-mail and determine whether a URL typed in it agrees with where it’s supposed to be coming from? So if the filter sees an e-mail coming from PayPal and it sees the url 843jsjv.com/paypal/loginnow/ in the message then it will mark the e-mail as phishing/spam because someone talking about PayPal is trying to send you somewhere else.
So the problem in this case is figuring out who the REAL PayPal is (at what domain) because your filter is going to see way more bogus phishing emails than it will legit emails from PayPal. And the network service part you’re talking about is you hoping that such a directory of business names that maps to the URLs they use exists. The only one I can think of is Google itself. Paypal.com is legitimate because when I ask Google for “PayPal” most other people on the web think that means Paypal.com. You can also have users tag previous e-mails as legitimate and use that to develop an internal mapping in the spam filter (show the user a dialog box “The business PayPal operates only on the paypal.com, is that correct?”. Or you can input it yourself by hand.
But then what happens when I send you an e-mail talking about PayPal and link back to ISIS Blogs? The above system will probably mark me as a phish and discard my obviously valid e-mail. Now you’ve got to look at the From: and Reply-to: headers to figure out if one e-mail has bad intentions and the other does not. But a phisher can change those headers abitrarily and I’m sure people will still click links in the e-mail. I’d say the Subject line is the most important. I’m glad this is your problem and not mine :-)
Now originally, I thought you were asking “How can I be sure someone didn’t own the Paypal server?” as in “how do I prove the identity of the website I’m talking to?”, and “how do I prove that someone else hasn’t hijacked that identity?” That’s a much tougher problem, and one that you can fix by changing the rules of the game, notably not having a user account with a password at every freakin website on the internet! I think that http://uselessaccount.com/ sums up that feeling pretty well. God I hope that OpenID takes off…
Yes, Google is something that we are definetly using. A methodology using data gathered from WHOIS Registrant entries can also be applied. If the url uses SSL, data gathered from the SSL certificate can also be
applied. Perhaps something can even be done using an IP Region database
(hhhmmm, another PayPal email from China?). Sorry I can’t be very specific. I’m sort of kind of under no disclosure terms. But with those examples in mind, can anyone think of any other service out there that can be used to ascertain legitimacy? Of course, each heuristic has its downfall. Some downfalls include Google Bombing, WHOIS Registrant Information Spoofing, etc. But the idea is that the more independent heuristics you apply the stronger your authentication is.
Yes, there is also the issue of false positives. Fortunately, the scope is limited to the email contents. I was considering using a Natural Language Processing approach. The justification for this is that all Phishing emails come from the same semantic domain. This domain, summarized crudely, is:
“This is a message from HERE. Something has happened related to you over HERE. This something requires you to visit HERE by clicking on THIS LINK.”
Once identified as being from this semantic domain, an email can be checked for legitimacy. The problem is NLP is not the easiest subject and Poly does not have any Professors that specialize in it. If anyone can offer me guidance/council for my NLP woes, please go ahead.
I don’t think a perfect method exists to flag emails for fishing ( same counts for spam ). If the user is not careful it does not matter how good your filter is. So the client program (in this case the email client) should provide as much information as possible about the URL and help the user decide.
Regards,
Open Source in Israel