Case Study: Using Google Analytics to Combat Fraud

Nasa Globe

Data is everywhere. When you step back and look at it from a holistic view, it can be quite overwhelming. Take a step closer, and you can begin to connect the dots.

As an E-Commerce Marketing Manager at Vaughan Premier, Ltd, my team leaned heavy on the creative side. I had two to three copywriters who built product pages and created content. I had a photographer and stylist who also acted as my creative lead. I had a graphic designer who also filmed and edited video and managed social media.

They did all of this under my direction, while I took on the roles that required more technical, analytical, and managerial expertise: SEO, SEM, Web Development, etc.

In late 2016 or early 2017, the Customer Service department caught wind of a concerning trend. A few customers called in claiming they had never purchased from us, yet their cards had been charged anywhere from $400 to $700 from our company.

using google analytics to combat fraud
Photo by Jefferson Santos on Unsplash

What was happening?!?

Early Fraudulent Findings

Our early findings found that third parties were using stolen card information to place an order using that individual’s card and billing address, and shipping it to a nearby location within the same zip code.

It was unlikely to be the work of one individual, given that the zip codes spanned the country.

It was possible that some sort of organized crime was taking place, given the breadth of the situation, but also seemed that the amounts were small fries for that type of operation.

No, it was most likely somewhat random individuals following instructions from the dark web to get free product using stolen credit card information. (NOTE: This is all alleged, as we never obtained verifiable proof.)

Using Google Analytics to Combat Fraud in E-Commerce

The Marketing department sat just opposite a cubicle wall from Customer Service. Overhearing the Customer Service manager discussing the issue peaked my interest when she noted that these orders were going to different locations around the country.

At the time, I spent a few hours a day in Google Analytics. While not the most robust analytics system out there, it provided a more than adequate glimpse of customer data.

In e-commerce, you can connect your store and cart to Google Analytics, and create goals to track conversion rate, product purchases, order data, and more. Knowing this, I asked the Customer Service Manager for a list of fraudulent orders.

We knew the fraud was stemming from orders with a different shipping and billing address. If we could find other common data points in how the orders are placed, we can predict whether or not an order is fraudulent.

Patterns Emerge in the Data

What constitutes a pattern? One or multiple data points that align across multiple seemingly random events.

We knew the following simply by knowing which orders were fraudulent:

  1. Only a handful of specific high dollar value items were targeted.
  2. Fraudulent orders would be billed to one address but shipped to another in the same zip code.
  3. Orders only contained one or two unit of those products.

Alright. That’s a starting point. But for a company that shipped out 200 to 300 orders a day minimum, it was far from enough to combat the issue. We needed more.

Knowing the order numbers, I could use Google Analytics to see how those orders were placed and determine whether or not it was fraudulent. Time to connect the dots.

using google analytics to combat fraud
Photo by NASA on Unsplash

Google Analytics allowed me to matrix order data from all of the known fraudulent orders. By digging into the specifics of how these orders were placed, and other data within the dashboard, I was able to find several trends:

  • These orders had no Source / Medium attribution. This means they either visited the site directly (example: a URL copied and pasted into the browser) or their source was masked and hidden (clicking a link from a site not visible to Google).
  • These orders landed directly on the Product Page purchased. This means that it was highly targeted. Suspect orders landed on the product page, made the purchase, and left in the matter of a minute or two.
  • Orders targeted similar high dollar value items within the same product category, and they always contained just one unit.

Now we’re getting somewhere. Other commonalities existed, but the above bullets were enough to allow us to flag any order that came in for those items, and check to see if it was fraudulent.

Flagging Potential Fraudulent Orders

Knowing there’s a pattern of fraud means nothing if you cannot develop a way to combat it. Discussing the issue as an interdepartmental management team, we developed a process in hopes of flagging potentially fraudulent orders.

Interdepartmental Monitoring:

The products in question were marked with instructions for warehouse pickers. If the item was the only on an order, these employees would take it to a supervisor for verification and next steps.

e-commerce warehouse
Photo by chuttersnap on Unsplash

Warehouse workers were in the loop regarding the type of fraud and the patterns we were seeing. This level of transparency was empowering and rewarding for them when they found such an order.

Being in Google Analytics so frequently, I checked incoming orders regularly, sorted by dollar value, to see if any jumped out as possible issues.

The Customer Service manager monitored incoming orders in NetSuite, our Enterprise Resource Planning system, for the same.

Three departments, all working together, and all in the loop to combat fraudulent activity. So we had a process to flag orders that might be fraudulent. But what next?

A Process for Verification

Potentially fraudulent orders would come to the Customer Service manager, who would work with me to confirm they fit the pattern before taking next steps.

If the data points aligned, we decided we need further verification. We would contact the phone number associated with the order, and request further verification the order was legitimate.

e-commerce customer service
Photo by Berkeley Communications on Unsplash

Given the type of product we sold and the dollar values involved, customers were always happy to provide further verification to confirm their order was legitimate. Due to the types of patterns involved, most orders that had these commonalities were fraudulent.

Thriving at the Intersection of Chaos and Innovation

The owners of the company often said “We thrive at the intersection of chaos and innovation.” It was a mantra we took to heart, and it’s one I continue to hold dear to this day as a personal mantra.

Throughout my career, I have found myself at fast paced, diverse organizations. I have been in roles that require me to pivot from one task to another without hesitation or complaint. Roles that require things to be done one way one day, and another way the next. Days that are truly never the same.

It could be stepping away to devise a plan to combat fraud, creating a contingency plan for large-scale events, or assisting on the sales floor on a busy day.

In today’s professional work environment, we have to be comfortable thriving at the intersection of chaos and innovation.

More posts about work I’ve done can be found soon. See them here.

Continue Reading

Analyzing Covid Data in Texas using Tableau

2020 has been rough, there’s no denying that. With COVID and the election, there has never been a better time to really dig into data and see what trends may exist.

Earlier this year, I got quite dismayed with data reporting by many news sources. One particular article on CNN noted that (and I paraphrase) “five states now make up a third of all Covid-19 cases.”

The problem with this statement is that it was blatant click bait, as the five states listed made up a third of the US population.

That brings me to one of the biggest issues I have seen with data-based reporting in 2020: context.

Before we get started, I’ll point out that there are plenty of pre-existing sources available to see data on Covid. After the fact, for example, I stumbled upon a few (now defunct) great dashboards on Covid stats in Texas.

Additionally, for the purpose of this document, I’m not linking to my own Tableau Public account. My laptop at the time of writing is too old to run Tableau Desktop..

A Look at Texas and COVID-19

Let’s take a look at Texas, a state that ranks in the top 3 for total population within the United States. For example, the Dallas-Fort Worth-Arlington area alone has approximately 7 million people, which is roughly the same as the entire population my previous home state of Washington.

To begin, I decided to use Tableau, a data analytics tool and dashboard that’s being used in 40 or more states to report on Covid. I have spent the past several months learning the program and have already used it in several presentations at work to generate and analyze data. But until now, I haven’t used its mapping capabilities.

First, here are the data sets I pulled:

Existing dashboards, most relevant being that of the State of Texas, show totals by county. Total cases by county. Total fatalities by county. So on.

But while those numbers are representative of overall and accumulative state-wide infection and mortality counts and totals by county, I wanted to dig a bit deeper and get a better idea of estimated infection and mortality rates.

Why does this matter?

Let’s take a look at two Texas counties. As of December 16, 2020, Dallas County has reported 147,591 confirmed cases of Covid. Childress County, for comparison, has reported just 1,087 cases.

No comparison, right?

But, if we look at it from a comparative perspective, in other words infection rate, it allows us to compare the two counties.

Dallas has 147,591 confirmed cases in a population of 2.647 million, which accounts for 5.57% of the population. Childress has 1,087 cases in a population of 7,052, which accounts for a whopping 15.41% of the county’s population.*

*County population uses estimated population as of January 1, 2020 as reported on the state demographics website.

Population Density of Texas Counties

Tableau allows users to plot data on maps when certain geographic-based fields exist. It does this using Mapbox. The beauty behind this is that it allows us to visually see trends geographically.

So, the first step was to take a look at the different counties in Texas and visually see population counts by county. I called this Population Density and color coded it on a green to blue gradient with darker green being lesser-dense rural counties and darker blue being higher populated, more urban counties.

I set the darkest threshold at 150,000 and the midpoint at 15,000. There are obviously more counties with populations greater than 150,000; the aforementioned Dallas being a prime example. The midpoint was set as a rough visual with approximately half of counties showing blue (populations greater than the midpoint) and roughly half showing green (below the midpoint).

Let’s look at the chart.

Based on the visual above, we can see that the more urban counties are in the eastern side of the state with the darkest trending along the border or coast and more central. West Texas is quite rural.

Great! Now we have an idea of where the Texas population resides.

Now let’s look at confirmed cases.

Cases & Fatalities by Texas County

The two charts within this section will demonstrate the folly behind much of the Covid reporting. We will look at a map of total cases and total fatalities by Texas County.

Here are the confirmed cases plotted by county:

Now here are the fatalities plotted by county:

Remember the population density chart? Look familiar? Obviously, higher populated counties will report more cases simply because there are more people. Likewise, disproportionately high case counts will lead to much higher fatality counts.

So, how do we even the playing field a little, and compare densely populated counties with sparsely populated ones?

Infection Rate

I generated the Infection Rate as a calculated field, simply taking the last reported Confirmed Cases number per county from the Texas Covid website data export and compared it to the estimated county population as of January 1, 2020 per the Texas Demographics website.

Mortality Rate

Johns-Hopkins calls this the “Observed Case-Fatality Ratio” and reports it at 1.8% for the United States as of December 18, 2020. For this chart, I set the max threshold to be reached at 5%. Among the highest is Kenedy County in Southeast Texas at 15.38% as of December 16, 2020.

We can see that the highest rates are reported in north-central Texas and along the Louisiana border. Does this mean that individuals in these counties are more likely to die of Covid if they get it? Not necessarily, and here’s why:

If we jump back to the infection rate chart, there have been fewer reported cases in many of the counties along the LA border. This could point to unreported spread with actual case count being a lot higher than reported in these counties.

Other factors that may come into play: testing capabilities and availability within these counties, perception of Covid as a threat within these counties, behavioral components like mask-wearing, etc.

Deviations & Outliers

When looking at any data, you must poke holes in it. Only by questioning the data do you glean better insight into why outliers and deviations exist. I’m going to highlight a few questions, and provide hypotheticals as to why.

Why do some counties with extremely high infection rates have extremely low mortality rates? And vice versa: why do some states with extremely high mortality rates have Extremely low infection rates?

Data is only as good as the data we’re given.

Let’s look at Childress, which has the highest infection rate in Texas at 15.41% but an extremely low mortality rate of 0.28%. If the infections are recent, it could be that data does not yet exist on the mortality within the county yet due to many of the cases being active, or having an overwhelmed mortician who has not yet had a chance to pass along findings.

Now take a look at Kenedy in SE Texas near the Mexico border. The county has a population of just 378. Thirteen have been confirmed as having contracted Covid (3.44% of the county population) and two have died for a mortality rate of 15.38%. The high mortality rate could simply be due to the limited data set, or it could point to higher unconfirmed / unreported cases.

What else do we have to take into account?

There are a multitude of other factors we have to take into account here are a few:

  • What are the perceptions of Covid as a threat between urban and rural locations given the politicized nature of the pandemic?
  • How available is testing across all counties, and how willing is the population to get tested?
  • What are differences in reporting between counties for confirmed cases and fatalities?
  • How do the trends look over time, given we are merely accounting for accumulative county data at this point in the pandemic? Could current spikes in cases be skewing mortality rates down given an overwhelmed system?

What are your thoughts? How would you look at the data differently?

Continue Reading

If It Bleeds It Leads

We are all sharing a collective trauma right now.

I’ve heard this conveyed time and time again, as COVID-19 continues to impact our daily lives. In fact, just last week, an officer at my wife’s work succumbed to the illness.

So, the gravity of the situation is grim; I don’t want to deny that with this post whatsoever.

The purpose of this post is highlight a growing polarization I’m seeing in the media, and it doesn’t matter what political leaning a particular publication has. The conservative-leaning Fox News does it just as much as the progressive-leaning CNBC.

Fear by Omission

Today’s highlight comes from CNN–an article title prone to shock value and fear generation.

US tops 5 million Covid-19 cases, with five states making up more than 40% of tally.”

CNN (August 9, 2002)

First thing to note is that US has now topped 5 million COVID-19 cases. That’s the point of the article: yet another terrifying milestone achieved.

But that’s not part that has me on edge.

It’s the second part. The part that notes five states make up about 40% of the total cases.

The article makes an important distinction, that raised my eyebrows:

“To put the number in perspective, that means the United States has had more Covid-19 cases than Ireland has people. The number of cases is also slightly higher than the entire population of Alabama.”

The article goes on to list precisely which states: California, New York, Texas, Florida, and Georgia.

Looking at the list, something immediately jumps out: those are some highly populated states.

This is NOT uncommon. I feel like I’ve seen it again and again in politically-charged generalizations from both sides, in news articles (both titles and content) regardless of source, in opinion articles (which the media is more and more presenting as fact), and in social media memes.

Ignoring the Obvious

So, let’s run some calculations to see why this terror-inducing statement…well…shouldn’t be scary. But first…

Don’t get me wrong. the US hitting 5 million reported cases of COVID-19 IS scary, and I don’t want to discount that statement.

Wikipedia provides a list of states in descending order, ranked by population. It also includes the percent that state has of the TOTAL United States population.

Let’s do the math.

  • California has 11.96% of the US population (rank #1)
  • Texas has 8.68% of the US population (rank #2)
  • Florida has 5.91% of the US population (rank #4)
  • New York has 6.44% of the US population (rank #3)
  • Georgia has 3.18% of the US population (rank #8)

These five states make up 36.17% of the total US population, so just over 40% isn’t nearly as scary when put into that perspective.

If anything, the headline should be “five states are managing the virus a little worse than the others.”

Why This Matters

Most people don’t think in numbers. Most people don’t think critically.

I do.

My first impression likely matched that of most people: fear. Continued fear that the virus is out of control in the states, and that the state where I live is part of that unfortunate few that make up 40% of all cases.

But, given the relational perspective of population by state, 40% of all reported US-based cases stem from a subsection of 35% of the overall population of the US… well, that just makes sense!

We can make wild claims when we don’t step back and question the statistics being touted.

All media does this: left, right, moderate. We just have to be grounded and aware enough to see through the B.S.

Continue Reading