Spambots and Other Junk Traffic - What is it and how to Get Rid of it

element 4
Spambots and Other Junk Traffic - What is it and how to Get Rid of it
Table of Contents
    Add a header to begin generating the table of contents

    What is a Spambot

    You cannot deny the importance of Google Analytics for understanding and measuring your users’ behavior. Millions of people around the globe use it for good reason.

    It is still my experience that many sites (in all sizes) forgo data filtering after installing the tracking code, despite this important factor in decision-making for many businesses.

    Google Analytics (GA) data has been entered by referral spammers without ever actually visiting our websites since around 2013.

    Admins often see referral spam as either a fake traffic referral, a search term, or even a direct visit.

    The referrer, displayed in your GA referral traffic, is hijacked by spammers pretending to be from their preferred website but actually, it is their own.

    It’s unlikely that referral spam will harm your site since it doesn’t actually trigger a fake visit (provided you don’t click on spam links).

    In order to make sense of Google Analytics data, marketers must filter out this type of traffic manually.

    Our major ongoing marketing decisions are based on GA, so clean data is of the utmost importance.

    Marketers may draw inaccurate conclusions based on bogus bot traffic if they do not know about referral spam and how to filter it.

    The purpose of this column is to teach marketers how to filter referral spam from Google Analytics data.

    If filtered data isn’t available, Google Analytics properties can be compared with those things that are made from styrofoam but contain edible parts. You might think it is true at first glance, and it might even feel right when you cut a slice of it, but as you go deeper you find there is much that is artificial.

     

    Most people don’t pay attention to the real user data in Google Analytics or haven’t configured theirs properly. If you’re like someone who only pays attention to the summary reports, you might not notice all the bogus data mixed in with your site visitors.

     

    Because of this, you won’t realize that your time is spent analyzing data that isn’t representative of your site’s performance.

    We’ll use GA to eliminate the artificial excess that inflates your reports and corrupts your data so you get only the real ingredients and don’t eat that slice of styrofoam.

    How do spambots operate

    Different methods can be used to operate spambots on various mediums. By creating additional accounts on various sites, it can comment on social groups, forums, and communities with irrelevant information. 

    On different forums and communities, these bots are programmed to interact with users as if they were humans. 

    Ready to Chat About
    Spam Boats

    Drop us a line today!

    How Spambots do Multiple Signups?

    There are few fields on signup forms, and any hacker could write a script that would program a bot to fill in those fields. In this way, they do a number of vague signups resulting in a flood of spam accounts. 

    Similarly, the genuine user will experience increased interactions on websites due to this irrelevant signup, increasing the possibility of having higher bounce rates on the signup form. 

    Using these spambots will allow other users to send unwanted spam from your platform.

    In most cases, spambots scrape information from around the web. After that, the data is sold on the dark web. Information such as sensitive financial data, phone numbers, and social accounts may be included in these data sets.

    Types of spambot

    Based on the kind of activity, Spambots can be of different types. Like a few scrap data, some spam on the comment section of websites, some send an unwanted message through emails.

    • Email spam

    They crawl web pages and collect email addresses based on patterns, such as [email protected]. A database of email addresses is created once the data is harvested by scraping.

    Users are emailed by the attackers in large numbers. A malicious email will either contain malware or a link that will allow it to collect your personal information (phishing email).

    In addition to using harvested databases, spammers also purchase email lists from the dark web to spam.

    • Comment spam

    Spam is a form of automated posting usually found in open forums. Fake comments are typically created with the intent of selling a product or generating links to increase traffic.

    Public commenting is available on many websites, which makes it easier for spambots to leave comments without creating an account. This means that bots can create accounts and start leaving comments on your platform without needing to authenticate. 

    • Social media bots

    Facebook, Twitter, and Instagram are the sites where most bots are active. Offers, deals, and products are generally posted by these bots. The post will be liked, shared, and commented on even though it has no relevance to connecting posts. Alternatively, a real user’s account can be compromised by a fake account. This will appear to be a legitimate account. There can usually be found a set of rules that require Twitter bots to retweet, like, tweet, and retweet posts. 

    How to detect spambots

    What is important is to identify these bad bots and how to avoid being influenced by them.

    There are many ways to detect bots. They often mimic human behavior in order to mask bot traffic as a real human.

    There are some methods of bot detection that are relatively simple and require little technical knowledge. You can easily check if and when bots visit your website by using the easy methods.

    Some other methods can be more difficult to implement, as they require more technical expertise in order to analyze the data and apply the fixes accordingly.

    Having said that, here are some of the best ways to detect bot traffic on your website. 

    Direct Traffic Sources

    The majority of the site’s traffic comes from a variety of channels (or sources), including direct traffic, organic traffic, social traffic, and referral traffic.

    The only channel that contributes to the traffic on a “bot attack” will be direct traffic. Your website showing this behaviour can be an obvious sign of bots. In addition, it is possible for bots to spoof the referring URL, but usually, this can be detected.

    Reducing Server Performance 

    Bots are largely to blame for slowing down your hosting server.

    The best way to understand a server slowdown is from the bot’s perspective. A large part of the reason is related to how many bots hit the site in a short period of time and how the servers cannot handle them.

    As a result of a bot attack, your site’s performance is affected and directly impacts the user experience on your website for “normal” traffic, that is organic, referral, and social. In order to prevent any interruptions to legitimate users’ access to your website, it’s vital to block bot traffic on your site.

    Speed of your Website 

    Bot activities can also be detected in this way. A website’s performance will slow down when it experiences a massive influx of bots.

    It is unlikely that one bot will affect your website’s speed overall, but multiple bots entering your website simultaneously can affect site performance. Having many bots enter a website at once is a common way to overload the server, causing the site or server to crash.

    An attempt like this is also called a DDoS attack. Your website performance can be adversely affected by these attacks, damaging your brand and business. When your website is the only source for doing business and selling, the effects of such a bot invasion will be worse. 

    Faster Browsing Speed

    A computer can browse a website much faster than a human. Bots are the main reason for huge growth in traffic over a short period of time. A simple solution for blocking bots is to find browsing rate metrics on your website.

    Nonetheless, sophisticated bot attackers can lower their bot speed so that it is more in line with human speed. In order for your system to believe these requests are coming from different, valid sources, we do this deliberately.

    Using this method, many different IP addresses are used in a botnet to gain access to websites.

    It is good that there are many companies that monitor these IP addresses so that they may collect information. This information is then sold as “threat intelligence”. 

    Junk User Information 

    The creation of strange or unusual accounts or the use of strange emails, in combination with potential fake names or phone numbers, is an indication that your website is under attack from bots. Form-filling bots or spambots perform this strange activity of filling out forms and sending strange submissions.

    Content Scraping 

    It is more convenient and cheaper to collect data using bots rather than paying for subscription feeds or high-end databases. Using a bot, a website may scrape coupon sites for the purpose of stealing coupons and displaying them on its own. Subscriptions to coupon feed often cost coupon sites thousands of dollars a month.

    It is not uncommon for bots to gain access to IPQS results to avoid paying to use our service, even IPQS is scrapped by bots. As a result, bots can get an edge over you by targeting a wide range of data.  

    Inconsistent Page Views

    Your Google Analytics statistics can be viewed here if any inconsistencies exist among the visits to your page. Examine how many page views, referral traffic, and average session duration you have. You can easily recognize bot visits and how often they occur by comparing this to your normal track record.

    Your page views will increase unusually when bots visit your page, which is one easy and obvious sign that bots are on your site. A bot enters a website, which causes many pages to appear all at once.

    Imagine, for example, that a page is visited by 3 users at one time on average, but suddenly the number of visits rises to 70 pages in one visit. In that case, there’s a good chance that you’re dealing with a bot.

    Increasing Bounce Rate 

    When a user leaves the website without visiting another page or performing any additional action or interaction on the page, Google considers the visit as a “bounce”.

    Checking your website’s average page duration and bounce rate is another thing you can do. In the event that the average page duration for your website (the number of time visitors spend on the page) declines and the bounce rate increases (visitors do not view other pages or interact with the page), this is an indication that rogue bots are visiting your site.

    Bots are usually extremely fast, so they can perform a variety of actions within just a few seconds. Bots need only seconds to crawl through your whole website and collect all of the data they require.

    Due to this, when comparing the time spent on a page (the page duration) by users, a bot will take much less time than a typical user. A bot will leave a particular website after crawling all the pages it needs. Your bounce rate could significantly increase if this is done.

    In the long run, if this bounce rate increases, your Google metrics will become distorted. If you carefully observe the inconsistencies and changes in these metrics, it can give you an easy way to identify bots visiting your site.

    Spike in Traffic from Unexpected Location 

    You may be unfamiliar with a particular region, country, city, or other location where new users appear suddenly. A lot of your website’s visitors come from a very specific location not known to be fluent in your website’s native language or where your customers cluster within a specific location. Bots can also cause such situations.

    Passive Fingerprinting

    In this case, we’re talking about identifiable metadata. Specific browsers send certain header messages, for example, as a means of identifying themselves. These types of identifying headers are not included in bots from unsophisticated attackers or sources. The attack tool may at times be used as the name of many basic bots. The header information in bots’ messages makes it easy to identify them in such circumstances.

    Active Fingerprinting

    There are a lot of features and information packed into web browsers. Therefore, duplicating them is difficult. A bot attacker’s attempt to build a bot that would be able to represent all the browser’s features and specifications is a complicated task – often too complex to even try.

    In order to identify each browser, your system will send a request to each one. An approach like device fingerprinting can provide the ability to easily identify bots by using deep information about the user.

    The browser would be required to perform a task that would identify your “fingerprint” attributes – this usually occurs in the background. You can tell if it’s a bot based on the corresponding browser response.

    Using this method, a system can determine whether a request is coming from a real person or a bot. Automated machine learning algorithms stop bot traffic easily, preventing false positives and negatively impacting healthy users.

    If you notice any unusual metrics or slow page loading times in addition to these requests, we recommend checking your stats a little deeper to identify whether it’s a bot attack.

    Additionally, it’s important to note here, that even with these indicators, a sophisticated bot attacker will try to duplicate every possible attribute of a genuine or real browser, bots can fool many systems, even if you cover every potential angle.

    Additionally, companies that specialize in bot detection can provide their clients with a variety of other advanced tools, techniques, and services.

    Best Practices, tools and techniques to get rid of spambot

    a. Blocking Comment Spam 

     In an effort to make your blog and tutorials more valuable, we recommend ignoring vague comments. Spam comments can be filtered using a tool called Akismet, and the API service can be used to implement it.

    b. Time-analysis of forms 

     When registering, you only have to fill in a few fields. When filled out by a human, the forms require some time. While bots do not require any time. It is easy to calculate the difference between the two. Filling out the form can be timed by keeping an average time. The form should be submitted below the average time. If this happens, then check whether it’s a bot.

    c. Geolocation based Blocking form 

    Bots from a particular region of the world are restricted by geolocation blocking. Keep in mind, however, blocking bots in one place will also block real users in the same location. It is recommended to use this method only when you think that there are more problems than benefits at this location.

    d. Blacklisting IPs 

    Spambots can be blocked this way most easily, but the damage has already been done. By blacklisting the IP or series of IPs on the firewall, you can prevent further spam. It is possible to limit the number of submissions from an IP address to a specific number before blocking, and then block once that number has been exceeded.

    e. Web Application Firewalls 

     You can use this tool to protect yourself from XSS attacks and SQL injections. An XSS attack is when someone injects javascript or a block code on a website for the targeted browser in order to manipulate its contents with an intent to steal data and cookies. The same holds true for SQL injection, which injects a SQL query into an application. An attacker can use this injection to bypass authentication methods and to gain direct access to the database to perform CURD (Create, Update, Read, and Delete) operations. This is a very serious threat. Implementing Web Application Firewalls is very important.

    f. ReCAPTCHA 

     Humans and bots click on the first field differently, so reCAPTCHA is a better choice.

    g. Confirmed or Double Opt-In 

     Sign-ups for your form should be confirmed by double opt-in. In other words, you are sending an automatic confirmation link to an email address entered into your form when someone enters an email address. It is necessary for the user to open their inbox, click the link, and then return to their website. The email will be verified as valid and so will the user. The chances of a bot completing this step are very slim. You should not add anyone to your list who does not complete this form.

    Filter spam & bots in your Google Analytics Traffic

                  a. In which reports can you look for spam?

    Referral traffic usually appears in your reports as Spam, but spam can appear in unsuspecting places as well, such as a language or page title.

    It is not uncommon for spammers to use misleading URLs that look very similar to well-known websites, or they may use unusual characters and emojis in the source name to catch your attention.

    It doesn’t matter what type of spam you find in your reports, there are 3 things you should always do:

    1. Avoid visiting suspicious URLs. Most spammers try to promote their service or sell you something, but some spammers might use malicious scripts.
    2. Installing scripts from untrusted sites is bad practice, so if you install one from an untrusted site, remove it immediately and scan your site for malware.
    3. Make sure your Google Analytics data is clean by filtering out spam.

    For instance, you can search for the URL in quotation marks (“example.com”) if you’re not sure whether it is a real entry. Instead of opening the site, your browser will show you search results; if the site is spam, you will typically find complaints on forums or posts.

               b. Bot traffic

    Scripts executed by a bot over the Internet are automated and serve a variety of purposes.

    There are many kinds of bots. The bots that check for copyrighted content or those that index your website for the search engines have good intentions, and those that scrape it can have negative intentions, such as those that generate clones of your content.

    Both the volume and the difficulty to identify (and thus to filter out) these types of traffic are less useful for reporting.

    It’s important to note that your server can be blocked from bots to prevent them from reaching your website, but editing sensible files is not for everyone, and as I said before, there are also good bots.

    It is therefore wise to filter them in Analytics unless you’re being directly attacked and it is skewing your statistics.

             c. In which reports can you look for bot traffic?

    You’ll need to look for patterns in other dimensions in order to filter out bots in Google Analytics since they will usually show up as Direct traffic. It is common for companies to use only a single service provider for their bots to navigate the Internet.

    This will be discussed in more detail below.

    Internal traffic

    Spam can be frustrating for most users, who dislike weird URLs showing up in their spam reports. However, spam isn’t the largest menace to your Google Analytics reports or traffic. 

    Astonishingly, it’s you!

    Despite the large negative impact of the traffic generated by people working on the site (and by bots), it is often neglected. Internal traffic is so harmful because it can easily be confused with real users’ data due to its differences from spam.

    Internal traffic comes in many forms and there are many ways to manage it.

    Direct internal traffic

    Testing team, development team, marketing & distribution team, customer care and your outsourced members. There are many more. Every team member can contribute to a blog or website by visiting the site for any reason.

    Reports to look at for direct internal traffic.

    In Google Analytics, this traffic will generally show up as Direct by default, unless your company uses a private domain.

    Sites/tools provided by third parties

    When you or your team use management tools like Trello or Asana to manage the site, this kind of internal traffic may be generated directly by you or your team.

    It takes into account traffic that is generated by robots doing automatic tasks for you, such as Pingdom or GTmetrix.

    Some types of tools you should consider:

    • Managing projects
    • Managing social media
    • Monitoring of performance/availability
    • Tools for SEO

    What are the recommended reports to view traffic from internal third-party tools?

    Google Analytics usually tracks this traffic as Referrals.

    Development environments/staging environments

    Changing content on a website usually requires testing it on a test environment first. Since these development stages have the same tracking code as the production stages, if you don’t remove it from Google Analytics, all the testing and its results will be tracked, in your Google Analytics board. 

    For development/staging environments, which reports can you look at?

    Google Analytics usually shows Direct traffic under its own hostname, but you can also find this traffic under its hostname in your account. 

    Sites and services that archive and cache web content

    Wayback Machine, for example, offers a view of websites from the past. Even if your site did not host those visits, you can still see them in your analytics because Wayback Machine retrieved the tracking code when it copied the website to its archives.

    It’s safe to assume that when someone checks how your site was in the year 2015, they don’t intend to buy anything from you – they’re simply curious, so this is pretty useless traffic. 

    In which reports can you look for traffic from web archive sites and cache services?

    Hostname reports provide information on this traffic as well.

    Google Analytics Filters that you can use to Filter Traffic

    a. When Using Filters, Consider These Factors

    1. Generate a view without filters.

    A view without filters is highly recommended before you do anything; it will help you track the effectiveness of your filters. In addition, it serves as a backup in case things don’t work out the very first time.

    1. Verify that the permissions are set correctly.

    Create filters only if you have edit permissions at the account level; filter creation won’t be possible with a view or property permissions.

    1. Retroactive application of filters is impossible.

    It is not possible to delete historical aggregated data in GA permanently, at least not yet. Because of this, it is better to apply filters as soon as possible.

    1. The changes made by filters are permanent!

    In the event that you have configured your filter incorrectly (missing relevant entries, an extra space, etc. ), you risk losing valuable data for all time; it is impossible to recover filtered data.

    1. Be patient.

    It can take up to twenty-four hours for the filter to start showing its effect, so be patient. Often, you can see the results after a few minutes or seconds of applying the filter.

    b. Types of Filters

    Default filters and custom filters are the two main types.

    Default filters are rarely used because they are limited. Regular expressions enable the use of custom one so that they are much more flexible.

    In the custom filters, you can select between five categories: exclude, include, lowercase/uppercase, search and replace, and advanced.

    Let’s begin with the exclude and include filters. The rest we will discuss later. 

    Regular expressions: the basics

    It’s okay to jump to the next section if you already understand how regular expressions work.

    Regular expressions (or REGEX) are patterns that can be matched with special characters within text strings. Multiple entries can be matched with these characters in one filter.

    It’s ok if you have no experience with them. The basics are all you need, and some filters will only require copying and pasting expressions I set up.

    Characters that are special in REGEX

    Although REGEX has plenty of  special characters, we can concentrate on three of them as the basis for GA expressions:

    • ^ The caret: indicates that a pattern is beginning,
    • $ Symbol: signifies the end of a pattern,
    • | The pipe or bar: It represents “OR” and is used when you begin a novel pattern.

    If you are using the “|” symbol, in no case should you ever

    • Start the expression with it
    • It should be put at the end
    • Combine two or more

    You probably won’t be able to filter correctly if you do any of them.

    Using REGEX in a simple way

    In a restaurant, say I wanted fruit salad. The automatic machine chooses the fruit for me, and regular expressions are used to do that.

    The fruits that are available in the super machine: strawberry, banana, blueberry, custard apple, pineapple, Kiwi, and watermelon.

    I have to create a REGEX to match all the fruits I like (strawberry, blueberry, custard apple, and watermelon) to make the salad. No problem! In order to do this, I can use the pipe symbol “|” for OR:

    • REGEX 1:blueberry|custard apple|watermelon|strawberry

    This expression is problematic because this  REGEX also takes into account partial matches, as pineapple contains “apple,” it would also be selected. And I dislike Pineapples. 

    By using the two other special characters previously mentioned, I can make a precise match for an apple. The symbol for caret “^” (begins here) and  “$” the dollar sign (ends here). This is what it will look like:

    • REGEX 2: strawberry|blueberry|^custardapple$|watermelon

    In my case, the expression selects exactly what I want.

    To illustrate, let’s say fewer characters means a cheaper salad. Using REGEX, I can optimize the expression by using partial matches.

    My expression can be rewritten as follows: Since strawberry and blueberry both contain the word “berry,” and none of the other fruits does, I can rewrite it as follows:

    • Optimized REGEX: berry|^custardapple$|watermelon

    The fruit salad I want now has everything I want and is cheaper as well.

    c. Test your Filters

    In addition to ensuring correct filters and REGEX, make sure that the filter changes are permanent. They can be tested in three different ways:

    • Immediately after you have selected the filter, click on “Verify this filter.” It is quick and easy. In addition, a small sample of data doesn’t make it the most accurate.
    • Online REGEX testers are a good option since they are very accurate and colourful. They also let you learn since they display every matching part and explain why it matches.
    • You can test your filter by using an in-table temporary filter in GA; the filter will be applied to your entire historical data set. You surely will not miss anything by following this method.

    It is easy to use the built-in filter verification for a simple filter or if you have experience with filters. In order to be certain your REGEX is correct, my recommendation is to build it on the online tester and then retest it with the in-table filter.

     

    d. How to Create Filters

    I will describe the steps involved in creating the filters below in the standard manner in order to avoid being repetitive:

    1. Visit the administration section of your Google Analytics account (the gear icon), and then go to the configuration section.
    2. Then, click the “Filters” button ” (it will say “All filters” click on that ) under the View column (master view)
    3. To add a filter, click the red button “+Add Filter” (if you don’t see it or can only apply/remove existing filters, then you don’t have edit permissions at the account level. Ask your admin to set this up for you)
    4. Afterwards, configure each filter according to its specific settings.

    It’s highly recommended that you get familiar with the filter window so that you can improve Analytics data quality.

    Validate hostname filters (including ghost spam and development environments)

    A preventative measure

    • Spam that appears as ghosts
    • Hostnames for development
    • Sites that scrape data
    • Sites that serve as caches and archives

    Spam can be effectively blocked by this filter. The hostname filter is by contrast preventative. It is rarely updated, contrary to other commonly shared solutions.

    The term “ghost spam” refers to spam that never actually visits your site. A feature in Google Statistics measures the data directly to the company’s servers, a tool which in normal circumstances permits tracking from often-forgotten devices, like coffee makers or refrigerators.

    You collect data from real users, then GA gets it; therefore, you get valid information. Using ghost spam, GA servers are sent spam directly, without recognizing your site URL, so all the data left are fake. 

    Spammers abuse this feature, sending traffic to indiscriminately created tracking codes (UA-0000000-1) to simulate visits to your site.

    Consequently, these spammers have no idea who they’re targeting; therefore, ghost spam always leaves an unreal or (un-set) host. In that case, all ghost spam would be excluded if a filter was created that included only valid hostnames.

    Hostnames and how to find them

    We’re getting to the “tricky” part now. You must compile a list of your valid hostnames in order to create this filter.

    It is basically your tracking code that appears wherever a hostname appears. Hostname reports provide the following information:

    • Change the primary dimension to Hostname in the header by clicking Audience > Select Network.

    Your domain name should appear at least once if your Analytics are active. You may find more than one, in which case, take a look at each one and select those that apply to you.

    There are different types of hostnames available

    The good Hostnames

    •  Domain and subdomains — yourURL.com
    • Connected tools for your analytics — MailChimp or YouTube
    • Gateways for payments — Shopify and other booking systems
    • Translations — Google translate
    • Mobile optimization service (to improve speed) — Google Weblight. 

    Those not suitable for reporting (by bad, I mean useless):

    • Staging/development environments — staging.yourURL.com
    • Archive sites on the internet — web.archive.org
    • Scraping site that doesn’t put the effort to trim any content — Sraper’s URL
    • Sometimes they will use the name of a popular website to trick you, but most times they will show you their URL. Whenever you see a URL you are unfamiliar with, ask yourself, “Is it something I manage?”?“ If not, that hostname is not your domain.
    • (not set hostname )  — It typically comes from spam. Occasionally, it is related to tracking code issues.

    Identifying the source of a campaign (crawler spam, internal)

    The following types of traffic are blocked:

    • Crawler spam
    • Toolkits (Trello, Pingdom, Asana) used internally

    Despite the fact that these hits are referred, you should use the field “Campaign source” in the filter – not the field “Referral”.

    Spam filter for crawlers

    Spam crawlers, on the other hand, are the second most common. As with ghost spam, they leave a fake URL, but they visit your site, as opposed to ghost spam. As a result, they leave an accurate hostname.

    The expression will be made in a similar way to the hostname filter, but this time, the source/URL of the spammy traffic will be input. Unlike include filters, exclude filters can be created multiple times.

    Tools used by internal third parties can be filtered out

    Having the crawlers spam filter and internal third-party tools separately is just my preference since it’s easier to organize them and make them easily accessible for updating.

    Configuration of “internal tools filter”:

    • Name of the filter: Remove sources from internal tools
    • Pattern for filtering: [tool source REGEX]

    REGEX (example): an internal tool

    Trello|asana|redmine

    Don’t filter any traffic that’s sent to you by tools within your company and is also directly from real users. To exclude internal URL queries, instead, select “Exclude Internal URL Query”.

    Language spam filters as well as other types of spam filters

    Most spam will be stopped by the first two filters; spammers may resort to other methods to avoid them.

    You may be shown a reputable source such as Apple or Google in addition to one of your valid hostnames. The spammers have targeted my site (it appears they do not agree with my site; I am not saying everyone knows my site).

    It is not uncommon for spammers to inject their messages into page titles, keywords and even the language of the report, even if they look fine in the host and source.

    If you find any spam in those dimensions/reports, you must select that name from the filter. You should take into account that the report name may not always match the filter name:

    • Language – Settings for language 
    • Referral – source of the campaign 
    • A keyword that is organic — a search term 
    • Providing Internet services — ISPs 
    • Domain of the network – the domain of the ISP

    Filters for bot traffic originating from direct links

    As bot traffic leaves no trace like spam, it’s a bit difficult to filter, but you can still do it if you’re patient.

    Activate bot filtering as soon as possible. Activating it by default would be a good idea, in my opinion.

    View settings can be found in the Analytics administration section. Next to the currency selector, you’ll see the option “Exclude all spiders and bots”

    It is great if the bots were all taken care of in this way – a dream that is now true. The catch is that this only works in conjunction with bots that appear on the IAB’s “recognized spiders and bots list.” It’s certainly a start, but far from adequate.

    In addition to the known bots, there are a lot of non-listed bots, so you’ll have to be detectives and look through your different reports for patterns of direct bot traffic until you find something that can be safely filtered without compromising your user data.

    Choose the “Direct traffic” segment from the Segment list at the top of any report to begin your bot trail search.

    After that, scan different reports for anything suspicious.

    Some reports to begin with:

    • Supplier of services
    • Version for browsers
    • Domain of the network
    • Resolution of the screen
    • Flash version
    • Country/City

    Bot activity signs

    There are some ways to detect bots, even though it is difficult to detect them:

    • Increased direct traffic without a natural cause
    • Older versions of software (browsers, OS, Flash)
    • They only visit the homepage (represented by the slash “/” in GA)
    • Excessive metrics:
    • Nearly 100% bounce rate,
    • Almost no time has elapsed since the session began,
    • Each session is limited to one page,
    • Users who are 100% new.

    This is crucial! Many of these criteria point to bot traffic, so you should look out for it. Although not all bots exhibit these characteristics, not all of them follow the same patterns, so be vigilant.

    “Service Provider” reports have been most helpful to me in identifying bot traffic. The names of large corporations’ Internet service providers are frequently used.

    Similar to the crawler expressions, I have also built one for ISP bots.

    Moz says, the following is a useful ISP configuration, “The bot ISP filter configuration:

    • Filter Name: Exclude bots by ISP
    • Filter Type: Custom > Exclude
    • Filter Field: ISP organization
    • Filter Pattern: [ISP provider REGEX]

    ISP provider bots REGEX (prebuilt)

    hubspot|^google\sllc$|^google\sinc\.$|alibaba\.com\sllc|ovh\shosting\sinc\. “

    Internal IP filtering

    By wrapping up the hostname filter (with the internal traffic coverage), and the campaign sources (with the third party traffic coverage), we can now cover the different types of traffic from sites on internal networks.

    The last and most destructive is traffic generated directly by you or someone on your team while they work on this project.

    By adding a filter, you can exclude all locations on the site that use a public IP address (not a private one).

    Places and people to filter

    • Workplace
    • Assistance
    • Home
    • Builders
    • Hotel
    • Cafeteria
    • Bar
    • Shopping mall
    • Workplaces where you regularly conduct business

    You can determine the  IP address of your location by searching Google for “my IP”. This is one of the versions you will see:

    You can use any version of the IP address to make a list, then use a REGEX to combine them, just like we did with other filters.

    • IP address expression: IP1|IP2|IP3|IP4 and so on.

    The static IP filter configuration:

    • Name of filter: Remove  internal traffic (IP)
    • Type of filter: Custom > Exclude
    • Filter Field: IP Address
    • Filter Pattern: [The IP expression]

    The following situations will result in this filter not being optimal:

    When a filter is used for an IP address, it can lose its effectiveness in several cases:

    • (Required by GDPR) You anonymize your IP address. The IP will be changed to zero when it is anonymized in GA. Accordingly, if your input is 1.23.89. 992, GA will translate it to 1.23.89.0, so enter that in the filter option. There is a possibility you’re excluding IPs that aren’t your own.
    • Dynamic IP (Change of IP address frequently) is offered by your ISP. Having the long version (IPv6) is what causes the issue lately.
    • There are multiple locations where your team works. Nowadays, most companies don’t have centralized offices. Others may work from their home or office, while others may do so on the train, or from a coffee shop. While those places can still be filtered, maintaining the IP list to exclude them can be a challenge.
    • Your team or you travel frequently. Likewise, if you or your team travel constantly, you won’t be able to stay up to date on IP filters.

    The “Advanced internal URL query filter” below is a better choice when you consider more than one of the scenarios above; you might want to try it.

    Filtering internal URL queries

    There is no way to exclude them when they travel, access the site from their personal location, or use a mobile network when they’re in the company.

    URL queries come in handy here. Simply add a query parameter to this filter to use it. I add “?internal” to every link in your site which your team accesses:

    • Newsletters for internal use
    • (Trello, Redmine) Management tools
    • Contacting colleagues by email
    • Additionally, it can be added directly to the address bar of your browser

    Filtering URL queries on the internal network

     

    Essentially, a URL containing the query “? internal ” will be excluded. 

    • Filter Name: Disallow internal traffic (URL Query)
    • Filter Type: Exclude from Custom 
    • Filter Field: Request URI
    • Filter Pattern: \?internal

    When sending out an employee newsletter, for instance, and asking them to look at new posts, this solution is perfect.

    The following pages will be recorded if all users are likely to browse further than only the landing page.

    Query filtering for advanced internal URLs

    Internal traffic filtering has never been easier with this solution!

    The new solution is more comprehensive and uses Google Tag Manager, cookies and a GA custom dimension, and to filter internal traffic dynamically.

    When the solution is in place, although it is significantly more time-consuming, there are several advantages:

    • Maintenance is not required
    • It can be used by any member of the team, no technical knowledge needed
    • You can use it anywhere
    • Suitable for all devices and browsers

    To activate the filter, you just have to add the text “?internal” to any URL of the website.

    In this case, GA will not record the visits from the browser because it will insert a small cookie in it.

    Additionally, the user doesn’t have to add “? internal” as the cookie lasts one year (unless manually deleted).

    Bonus filter: Include traffic only coming from within the company

    There are times when it’s useful to know what traffic employees generate internally – either as part of a marketing effort or as part of curiosity.

    To handle that situation, you could create a new view called “Internal Traffic Only.” You should then use one of those internal filters. Let’s just have one! To be counted, the included filter must match all the included filters.

    Please use the “Advanced internal URL search” filter if you have configured it. Otherwise, select another option.

    As long as you change “Exclude” for “Include,” the configuration will be the same.

    Conclusion

    Google Analytics will report properly when it has real and accurate data.

    The Internet is full of junk and artificial information if it is not filtered properly.

    Even worse, if you don’t realize the data in your reports is bogus, you’ll likely make poor or incorrect decisions about the direction of your site or business.

    About the Author

    My name’s Semil Shah, and I pride myself on being the last digital marketer that you’ll ever need. Having worked internationally across agile and disruptive teams from San Fransico to London, I can help you take what you are doing in digital to a whole next level.

    Scroll to Top