How to Verify Email Address
http://www.ruanyifeng.com/blog/2017/06/smtp-protocol.html 如何验证 Email 地址:SMTP 协议入门教程
https://en.wikipedia.org/wiki/Simple_Mail_Transfer_Protocol Simple Mail Transfer Protocol
How to Verify Email Address
http://blog.online-domain-tools.com/2014/11/14/how-to-verify-email-address/
aving a long list of emails of your potential customers and partners is very common today. And if you run an online business email marketing is probably on of the tools that you often use. It is thus handy to keep your mailing list clean. Emails in your database might be collected months, or even years, ago and many of them might be invalid by now.
In this post we will uncover techniques and tricks that are used by email verification tools and scripts. Email address verification is not as simple as it might seem. Here are some of the questions that we will answer in this post: How to verify an email address using SMTP protocol? How to recognize a good email address, to which a message will be delivered? How to clean up a mailing list?
We mention the SMTP protocol a lot in this article. Its current specification is in RFC 5321. Do not hesitate to open and read this RFC while reading this post.
Definition, SMTP Basics and Core of Verification
What is a valid email address? Let’s define that an email is valid if there is a mail server that will accept a message to that address. By accepting a message we mean that the server read contents of the message from the sender, but this does not imply that the message will be delivered to the recipient’s mailbox. And of course, this says nothing about whether the message will be read by its recipient.
A verification of an email address is performed using SMTP protocol. There are several SMTP commands that can be used for this. These are EXPN, VRFY, and RCPT TO. EXPN and VRFY are rarely enabled in today’s mail systems, this is why the most common verification command is RCPT TO. There are several steps that have to be done before a client can use the RCPT TO command:
- Find IP address of the target mail server.
- Establish a connection to the target mail server.
- Send HELO or EHLO command.
- Send MAIL FROM command.
- Now we can send RCPT TO command.
We will discuss these steps in detail later in this post. Let’s focus on the RCPT TO command now. In our case, the only interesting syntax of the RCPT TO command is when its argument is the target email address. For example:
RCPT TO:<name@example.com>
When a client send this command to the server, the server will either accept the email or report an error. If the server recognizes the address and accept it, it will return either code 250 or 251. When we receive this code, it means the email address is valid. Otherwise, it may or may not be valid. Return codes starting with 4 – i.e. codes between 400 and 499, are considered to be temporary errors. These are usually not returned if an email address does not exist at all. Return codes starting with 5 – i.e. codes between 500 and 599, are reserved for permanent errors, which imply that the email address does not exist or that a message sent to it will not be accepted.
There are five possible results of email verification process:
- The email address is valid – The server returned 250 or 251 to RCPT TO command.
- The email address is probably valid, but a temporary error prevents immediate delivery – This usually occurs when a greylistingmechanism is installed on the target mail server and it is likely that repeated attempt to send a message to this email will succeed. It can also occur in case the target mailbox is full.
- The email address is valid, but every other email address on its domain is valid – There is a catch-all address enabled for the target domain. This means that we are unable to decide whether the email address represents a real user’s mailbox or whether messages to the address are sent to the catch-all address.
- It is not possible to determine whether the email address is valid, but messages will not be delivered – This happens in case that the target mail server does not respond.
- The email address is not valid – There is no mail server for the given domain or the mail server returned a permanent 5xx error to RCPT TO command.
This may not sound very complicated, but the real problem is to determine these result correctly and not to be misled by various anti-spam mechanisms that today’s mail servers implement.
Target Mail Servers
The first step in the process is to check whether MX records exist for the domain of the email address that we want to verify. MX records are DNS records that hold information about mail servers that handle emails for the particular domain. If there no MX records exist for a domain, no valid email addresses exist too. Let’s take a look at two examples using nslookup command. Note that you can similarly use other commands like dig.
Our sample emails to test will be test@gmail.com and test@nonexistingdomainname123456.com and we will be using Google’s public DNS server 8.8.8.8 to get DNS records.
> nslookup
> server 8.8.8.8
Default Server: google-public-dns-a.google.com
Address: 8.8.8.8
> set q=mx
> gmail.com
Server: google-public-dns-a.google.com
Address: 8.8.8.8
Non-authoritative answer:
gmail.com MX preference = 40, mail exchanger = alt4.gmail-smtp-in.l.google.com
gmail.com MX preference = 30, mail exchanger = alt3.gmail-smtp-in.l.google.com
gmail.com MX preference = 10, mail exchanger = alt1.gmail-smtp-in.l.google.com
gmail.com MX preference = 5, mail exchanger = gmail-smtp-in.l.google.com
gmail.com MX preference = 20, mail exchanger = alt2.gmail-smtp-in.l.google.com
> nonexistingdomainname123456.com
Server: google-public-dns-a.google.com
Address: 8.8.8.8
*** google-public-dns-a.google.com can't find nonexistingdomainname123456.com: Non-existent domain
For the email address on non-existing domain, there simply were not found any MX server. This automatically means that test@nonexistingdomainname123456.com is not a valid email address. For test@gmail.com, the situation is different. We can see that for gmail.com domain there 5 MX records. It is common to see only 1 MX records for domains with small number of emails, but larger services may have more than one mail server. It is important not to pick only a single mail server here. In order to verify email properly, we may have to work with all servers from all MX records that were found.
Why is this so important? Imagine a scenario that we picked up just a single MX server from the list. It could happen that when we try to connect to it in order to verify an email, the server would not respond for whatever reason, or its response will report a temporary error because of a recent problem with this particular server. If we do not try other available mail servers, we could end up with a wrong result. It might be easily possible that another server will be perfectly fine and give us correct answers. So, in case of any problems with one server, just try another one. The problems include an inability to connect to SMTP port, no greeting received, non-OK responses to HELO/EHLO or MAIL FROM commands.
Handling Greetdelay and HELO/EHLO
At this step, we assume that we have a mail server from DNS MX record of the target domain. Next step is to connect to the target mail server to its SMTP port, which is port 25/TCP. We just mentioned that if a connection cannot be established, then we should try another server if available.
If a TCP connection is established, it is now important to wait until the mail server sends us its greeting. We cannot simply send our commands to the server, because it is prescribed in the SMTP protocol that the server talks first. And some servers are configured to refuse to work with clients that do not respect that. This is one of the many anti-spam techniques. Spam sending programs are sometimes programmed to be as simple and as fast as possible. This is why they do not wait for the server’s greeting and start sending commands just after the connection is established. Servers that reject this kind of clients get rid of these spammers at the earliest stage possible.
Another anti-spam technique, which is relatively new, is called Greetdelay. The basic idea is based on the same fact that we already mentioned – many programs used by spammers to send emails are very simple and coded to work as fast as possible. And they have to be fast if they want to send millions of emails to get some hits and sales. The Greetdelay idea is to take the time allowed by RFC specification of the SMTP protocol to delay the message transfer agent (MTA, the software that wants to send an email) a bit. For example, the server implementing Greetdelay waits 30 seconds before it sends the greeting to the client. Or 60 seconds. Or 90. Spam software that does not fully comply with the protocol may give up before it is allowed to send a message. Similar strategy can be implemented in other stages of the SMTP communication. The before-greeting stage is just the most common.
So, in order to successfully pass the Greetdelay mechanism, the email verifier engine has to set up its timeouts to at least 5 minutes. The greeting should come with code SMTP response code 220. Other codes may mean that we should try another MX server.
After the greeting is received, the client (our email verification engine) should send HELO or EHLO command. Since HELO command exists only for compatibility with very old software, use EHLO everytime. Its only argument is the fully-qualified domain name (FQDN) of the client. We will cover the importance of having your own domain name for the server where you run the email verification client in the next section, just after we send MAIL FROM command. But before we can do that, we will wait for the reply to our EHLO. The SMTP code here should be 250, otherwise we can assume that something is wrong with this server.
Let’s see how this goes when we talk to one of the Gmail’s servers (lines starting with S: came from the server, lines starting with C:came from our client):
> nc alt4.gmail-smtp-in.l.google.com 25
S: 220 mx.google.com ESMTP m14si19812165icp.91 - gsmtp
C: EHLO mail.example.com
S: 250-mx.google.com at your service, [198.51.100.123]
S: 250-SIZE 35882577
S: 250-8BITMIME
S: 250-STARTTLS
S: 250-ENHANCEDSTATUSCODES
S: 250-PIPELINING
S: 250-CHUNKING
S: 250 SMTPUTF8
We used Netcat as a client that allowed us to type the SMTP commands manually. Our IP address was 198.51.100.123. The MX server sent us a correct greeting and replied with EHLO command. The server’s response code was 250 and it sent us a list of supported extensions, which are not interesting for us here. We can go on.
Continue to the second part of this article
How to Verify Email Address – Part 2
Go back to the first part of this article
http://blog.online-domain-tools.com/2014/11/14/how-to-verify-email-address-part-2/
MAIL FROM Implications to Source Server Environment
In order to be able to send RCPT TO command, which is our goal from the very beginning, we need to send MAIL FROM command to the server and receive OK response (code 250) from it. However, this action has big consequences. Here is where another bunch of anti-spam measures are implemented by many mail servers.
MAIL FROM command requires an argument, in which we specify the sender of an email we want to send. We will refer to the domain part of the sender email address as the source domain. Let’s assume that our source domain is example.com and our sender email address is verifier@example.com and see how our SMTP session continues:
C: MAIL FROM:<verifier@example.com>
S: 250 2.1.0 OK 91si19491992ioi.66 - gsmtp
This does not look very complicated, so what’s the catch? Once we have used the MAIL FROM command, we have gave away our source domain. This is what many spam-fighting techniques are waiting for. Once they got your source domain, they start to check it. Some of them perform very complex checks. Note that some mail servers may perform some of these checks (#5, #6, #7) even before MAIL FROMcommands is received.
Source Check #1 – Is There Mail Server?
We have identified our sender as verifier@example.com. The target mail server may thus want to know if there are MX records for domain example.com. The target mail server thus obtains DNS MX records of example.com. If there are no MX servers available, it may consider the incoming message as spam or fake and refuses to further communicate with our client.
Source Check #2 – Is It Alive?
Some servers do not consider the presence of MX records to be good enough. They attempt to connect to those MX servers to find out whether they are alive. If there are no running MX servers then again, they may refuse to talk to you again.
Source Check #3 – Is There Postmaster Account?
The SMTP RFC prescribes that every domain that accepts emails via SMTP must have the Postmaster account. In our example, this means that mailbox postmaster@example.com must exist. Some SMTP servers requires the source servers that attempt to send them emails to comply with RFC at this point. This is why they perform an email verification process to check that the mailbox postmaster@example.com really exists.
Source Check #4 – Is Sender Address Valid?
Another anti-spam technique is to verify the sender email address. This is done similarly to how Postmaster account is checked. If the sender’s mailbox does not exist on the source domain mail server then they reply with an error status to your MAIL FROM command.
Source Check #5 – Is Client Blacklisted?
A very common method of fighting spam is using one or more blacklists. The client’s IP (198.51.100.123 in our example) and/or the source domain’s IP addresses are checked and if they are blacklisted, further communication is disallowed. For a manual blacklists lookup, you can try online blacklist checker.
Source Check #6 – Is There Good Reverse Record?
This method checks the client’s IP address (198.51.100.123 in our example). The mail server attempts to obtain its reverse DNS record. Trustworthy mail servers operate on IP address with a valid reverse DNS record. If it is missing, it is more likely that a spam is incoming. Some mail servers go even further and only the presence of a rDNS record is not enough for them. They require the reverse domain to start with mail, mx, or smtp. In our example, mail.example.com would be a good domain name.
Source Check #7 – Is There Good SPF Record?
A very common technique is checking SPF records. SPF stands for Sender Policy Framework, which is a definition of who is allowed to send emails on behalf of a domain in question. SPF records are TXT records in DNS with a special syntax that define list of mail servers that are allowed to send mails, but they also define how non-listed servers should be treated if they attempt to send emails on behalf of the specific domain. In our example, we run our client from IP address 198.51.100.123. If this IP address is not found among servers allowed to send emails on behalf of the source domain example.com, then our attempt will be rejected.
Various mail servers will implement different subsets of these checks, so you may succeed to verify/send emails on/to one server and fail on another, if you do not have everything configured properly. If you want to pass all these checks, you will have to:
- Choose a sender email address such that its domain exists.
- Make sure there is an MX record for that domain.
- Make sure the MX records points to a running SMTP server.
- Have the SMTP server operating properly. It must accept emails for Postmaster mailbox and for the sender email address.
- Not to have the server, from which you are running your verification software, blacklisted.
- Make sure its IP address has a valid reverse record that matches the sender email address domain and its FQDN starts with mail, mx, or smtp.
If the target mail server implements any of these checks, it may or may not let you continue in the SMTP session. Some of the mail servers may let you allow to continue and just silently send emails from you to trash, if you send any. For email verification purposes, this situation would not be a problem. Other mail servers will report an error to the MAIL FROM command. In any case, if you send MAIL FROMand receive status code 250, you can finally send RCPT TO.
Handling Greylisting
Now we are finally allowed to send our RCPT TO command and there are several responses we can get. As we mentioned already, codes 250 and 251 means that the recipients address was accepted. In such a case, we can mark the email address as valid. We can also receive a permanent error in form of 5xx code. In this case, we can mark the email as invalid. However, we can also also receive a temporary error in form of 4xx code. What does this mean actually? It means that at this very moment, the server tells us that it will be unable to deliver our message for whatever reason. It is thus unlikely that the recipient’s mailbox is invalid, in which case the server could sent us a permanent error. So, it is likely that the server recognized that the mailbox is valid, but also detected a problem that prevents it to continue. We use the terms likely and unlikely on purpose – to really express uncertainty of these thoughts because we do not have any guarantee that the server performed validation of the mailbox before it run into the problem it reports. If greylisting did not exist, our verification work would be finished, and we would have to produce a verdict over the email address after we received a temporary error. In that case, we would say that email is probably valid.
Greylisting is a popular anti-spam method that exploits temporary error responses to RCPT TO command. Greylisting creates an artificial deliver problem to verify that the sender is a real message transfer agent that complies with the SMTP protocol, and not any simple spam sending program. When a proper MTA receives a temporary error, it does not give up entirely. It waits several minutes or hours and then tries again. This is something that spammers can rarely afford themselves to do. A mail server that implements greylisting maintains a database of triplets consisting of the client’s IP address, the sender’s address, and the recipient’s address. When a specific triplet is being seen for the first time, a temporary error is returned. If the client tries again after some time, it may be allowed, depending on how much time elapsed since the first attempt. Some servers are configured to allow attempts after a couple of minutes, some require longer periods of time, up to several hours.
If we want to support verification of emails managed by servers with greylisting, we have to react to temporary error response codes to RCPT TO by disconnecting from the server, waiting a certain period of time, and then try again. Note that trying another MX server, if it is available, will not help us here. Greylisting is usually implemented so that if one mail server has it enabled then all the domain’s mail servers have it. The database of triplets is shared among all the target servers, which means there is no benefit for us to try another server. We really need to wait and try again later. How long should we wait? A couple of minutes is reasonable before we try for the second time. But since some servers will require up to several hours of delay, it is a good idea to set a limit for ourselves after which we simply give up and finish with uncertain result of “email is probably valid”.
Many servers, when they return temporary error due to greylisting, provide an information that greylisting was performed and some servers even provide information on when they will accept a next attempt. However, these additional data may or may not be present, and most importantly, their format differs from one server software to another. So, it is not very beneficial for us to try to somehow parse them.
Here are some examples with RCPT TO commands:
C: RCPT TO:<test@gmail.com>
S: 550-5.1.1 The email account that you tried to reach does not exist. Please try
S: 550-5.1.1 double-checking the recipient's email address for typos or
S: 550-5.1.1 unnecessary spaces. Learn more at
S: 550 5.1.1 http://support.google.com/mail/bin/answer.py?answer=6596 q17si5372562igi.1 - gsmtp
C: RCPT TO:<goodaddress@gmail.com>
S: 250 2.1.5 OK q17si5372562igi.1 - gsmtp
In this example, we can see that test@gmail.com is not a valid email address, since permanent error code 550 is returned. Then we try goodaddress@gmail.com, which is confirmed to be a valid email address.
The second example show us a response of mail server that implements greylisting:
C: RCPT TO:<xxx@censored.pl>
S: 451 Temporary local problem - please try later
C: QUIT
... Connecting to server again after 2 minutes, sending EHLO, MAIL FROM ...
C: RCPT TO:<xxx@censored.pl>
S: 250 Accepted
In this case, an undisclosed Polish server implemented greylisting and reported temporary error 451 to our RCPT TO command. Then after 2 minutes the very same command was accepted.
Disposable Email Services and Handling Catch-all Address
If we determined that the given email address is valid, we might be interested to obtain a little more information about it. In our verifier we provide information whether the target email is hosted by disposable email service (also called 10-minute). This is done simply by looking up into a database of domains that these services use. Although this will never be 100% accurate as there always will be disposable email domains that we are not aware of, it is possible to cover the vast majority of these services.
Finally, we might want to recognize and report domains with enabled catch-all address. This is done simply using by sending an extra RCPT TO command with recipient address set to an email address that would not be valid unless catch-all mechanism is enabled. We simply generate a random string, long enough to avoid collision with any existing address. We use strings of 12 characters.
Here is an example of testing a normal email address followed by the test for catch-all address on mailinator.com, which is a disposable email service with catch-all address enabled.
S: 220 mail.mailinator.com ESMTP Postfix
C: EHLO mail.example.com
S: 250-mail.mailinator.com
S: 250-8BITMIME
S: 250-SIZE 150000
S: 250 Ok
C: MAIL FROM:<verifier@example.com>
S: 250 Ok
C: RCPT TO:<john@mailinator.com>
S: 250 Ok
C: RCPT TO:<ze6V7y6rcZEV@mailinator.com>
S: 250 Ok
C: QUIT
S: 221 Bye
At first the server confirmed that john@mailinator.com is a valid email address. Then we checked if catch-all address is enabled by verifying a random email address ze6V7y6rcZEV@mailinator.com. Since this also has been accepted, we conclude that catch-all address is enabled and thus we can not be sure whether john@mailinator.com was accepted because that mailbox existed or just because of the catch-all address.
Reference
The techniques described in this article were implemented in Bulk Email Verifier – a free email verification tool with API.