9 Ways to Connect to Web-Based CSVs in Tableau

That title is a bit of a mouthful and this may seem like a somewhat niche topic, but if you’ve ever tried to connect to a csv from the web, you know how painful it can be. There are many public services, including government data portals, that make their data publicly-available in comma-separated-values (CSV) format on their websites. We can easily download these files then connect to them in Tableau, but what if you want to regularly refresh the data on Tableau Cloud or Server? That’s when things start to get tricky. So, in this blog, I’m going to share nine methods for connecting Tableau to a web-based CSV and refreshing them in Tableau Server or Cloud.

 

For this blog, I’m going to use a feed of earthquake data provided by USGS. The USGS provides several options for downloading this data, based on time as shown below.

 


To get the path of the csv file, we can right-click and link and choose “Copy link address” (the name of this option will differ from browser to browser). For example, if we click the “All Earthquakes” option under “Past 30 Days”, we’ll get the following csv: https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_month.csv

 

One last item before I jump into the nine methods is to acknowledge that the examples above are all publicly-available data files. You may have a scenario where your web-based csv requires some sort of authentication. That will complicate (and possibly eliminate) some of these methods, depending on the authentication protocols used.

 

1) The Manual Way

Let’s start with the manual method. In this case, we simply manually download the csv and save it to our computer. We then either manually refresh the Tableau data source and publish to Server or Cloud or, in the case of Cloud, we could use a local Bridge client to refresh the file in a more automated fashion.

 

An alternative is to save the file to a shared network storage location. The advantage of this method is that the file can be accessed by other resources on your network. Thus, you’ll be able to refresh automatically from Server or leverage an instance of Tableau Bridge that does not reside on your computer. For more details on refreshing network-based files, see Auto-Refreshing File-Based Data Sources (note: while focused on Tableau Server, many of the lessons would also apply to Bridge).

 

So why would you want to implement such a manual process for this? Well, perhaps you only need to refresh the data occasionally. In that case, it may not be worth the additional effort to implement a more automated solution, making the manual method good enough for your use case.

 

2) Semi-Automated using Cloud Storage

One of the biggest issues with the manual method is that, if you’re using Tableau Cloud, it requires you to leverage Tableau Bridge. I personally try to avoid Bridge whenever possible as it is just another component that must be maintained, upgraded, etc. We can eliminate this by using some sort of cloud-based storage such as Google Drive, OneDrive, Amazon S3, or any other service for which Tableau has a native connector.

 

Note: For data in Amazon S3, you can either use the new S3 connector or you can connect Amazon Athena to S3 and create a sort of “virtual” database, then use the Tableau Athena connector.

 

In this scenario, we’ll still manually download the csv, but then we’ll upload that file to our cloud service. We then use the appropriate cloud connector in Tableau to connect to the file. When that data source gets published, it will be able to refresh automatically—regardless of whether you’re using Server or Cloud—because the file is accessible via the internet.

 

Note: Be careful when using desktop cloud sync applications. In some cases, these applications will change the internal identifiers of files, causing Tableau to be unable to locate the file, even though the name is the same. Before using these apps, be sure to test to make sure everything works and refreshes as expected. If you do get missing file errors during refreshes, then manually upload the files to the cloud service (via the web) instead of using sync apps.

 

3) Semi-Automated using Code

Our next method automates the downloading of the file from the web using code. Python is my go-to coding language, but you could use the language of your choice. With Python, we can easily load the csv into a dataframe then write that dataframe to a csv file:

 

 

We’d then schedule the code to run on some regular basis using a scheduling tool so that it regularly downloads updated data. From here, we use one of the previously discussed methods to refresh the data on Tableau Cloud or Server. If paired with one of the methods that automatically performs the Cloud/Server refresh, then you could create a fully-automated solution.

 

4) Fully-Automated using Code & a Database

Okay, so the manual and semi-automated methods aren’t good enough for you? Fortunately, you’ve now arrived at the fully-automated section. For the next three methods, we’re going to continue with code-based solutions and simply take the next step towards a fully-automated solution.

 

The first of these options is to write code that will download the data from the csv then push that data into a database. I haven’t written sample code for this, but my general approach is to use the sqlalchemy library to connect to a database then use the dataframe’s to_sql method to insert data.

 

Ideally, we’d use a cloud-based database, such as those offered by AWS, Azure, Google, etc. By doing this, Tableau Cloud would be able to refresh automatically without the need for Tableau Bridge. If you did, however, use an on-premise database, then Bridge would be required for refreshing. If you’re using Server, then you’ll be able to refresh easily whether the database is on-prem or cloud.

 

5) Fully-Automated using Code & Cloud Storage

If you don’t have a database available or just don’t want to deal with the stringency of databases (particularly, the need for well-defined tables, columns, data types, etc.), then you could have your code upload the downloaded file to one of the cloud storage solutions we previously mentioned.

 

This is, unfortunately, somewhat complicated due to the need to build authentication mechanisms into your code. In my experience, many cloud services use different authentication methods and none of them are particularly straightforward. Of course, I’m just an amateur Python developer so folks with more experience will likely find this easier. My recommendation is to engage one of these developers as they’ll be able to quickly develop the code needed and do so in a secure manner that will prevent any unauthorized access to your cloud environment.

 

6) Fully-Automated using Table Extensions

How about a fully-automated method that can be built right into Tableau? Last year, Tableau introduced Table Extensions, a feature that allows you to write code directly in a data model. To add a table extension, we first need to connect to a data source. In this case, we’re not going to use that data source at all, so we can connect to anything—I’ll simply connect to the Superstore data source. I’ve used the “Saved Data Source” so my next step will be to remove all tables from the data model.

 



Next, we drag “New Table Extension” into our data model. This will expose a code window.

 



It will also create a new connection called “tableau-analytics-extension”. We’re only going to use this connection, so we can close the connection to Superstore.

 

Next, we write our code. The code sample below first loads the web-based csv into a dataframe. When using Python, the table extension script must return a Python dictionary, so the final step is to convert the dataframe to a dictionary then return it.




When done with your code, click Apply, then you can start using the data source.

 

Note: Table Extensions work similarly to Live connections in that they refresh each time you open a workbook or refresh a data source.

 

While this method seems pretty easy, the one drawback is that it requires that you have an analytics extension configured (in my case, a TabPy server) and that you’ve connected to it in Tableau Desktop. You can get to the Desktop setup by going to Help | Settings and Performance | Manage Analytics Extension Connection, then specifying the hostname and port of the server. In my case, I’ve configured TabPy to run on my local computer.

 



For more information on setting up a TabPy server, see TabPy Installation Instructions.

 

7) Fully-Automated using a Web Data Connector

What if you don’t want to mess with code at all? There has to be a non-code solution, right? Yes, there is!!! The first solution was developed by Keshia Rose, a former Senior Product Manger at Tableau. If you’re not familiar with Keshia, you should be as she did incredible things at Tableau. And you may remember her from Iron Viz as she acted as co-host for several years.

 

Keshia created a web data connector (WDC) that allows you to easily extract data from a web-based csv. You simply enter your URL, click Get Data! and the WDC does the rest.

 

 

While this method is quite easy to use, there are some drawbacks. First, this is a version 2.0 connector, which requires Tableau Bridge for Tableau Cloud refreshes. For Tableau Server, on the other hand, you can add the WDC to your safe list and it will refresh automatically. Additionally, version 2.0 is slated to be eliminated at some point in the future and replaced by 3.0, so this connector won’t work indefinitely. Hopefully someone creates a similar WDC for 3.0!!

 

8) Fully-Automated using Google Sheets

For our next code-free solution, we’re going to leverage a handy function, IMPORTDATA, in Google Sheets. IMPORTDATA is specifically built for importing data from a text file (csv or tsv) via URL. So, we’ll create a new Google Sheet then type the following formulate into the first cell hit Enter.

 

=IMPORTDATA("https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv")

 



Google Sheets will connect to the csv file then pull in the data.

 



We can then simply use Tableau’s Google Drive connector to get our data and refresh automatically on Cloud or Server. This command will automatically run every time you open the spreadsheet, including when Tableau Cloud/Server open it for a refresh. Thus, you’ll be sure to get updated data.

 

One drawback I’ve seen with this approach is that it can’t handle large data files. For example, I tried to import the 30-day earthquake file and that resulted in an error stating “Resource at url contents exceeded maximum size.” Google Sheets has a 5 million cell limit, but this data file is much smaller than that, so I suspect that Google has some additional limits on the amount of data that can be transferred using this command. For further discussion on this topic and alternatives, see this post on Stack Overflow.

 

Another drawback I’ve seen is that the automatic update of data can often take a bit of time to complete, particularly for slower websites. In those cases, I’ve seen situations where the data isn’t yet updated in the Google Sheet when Tableau tries to refresh, leaving incomplete data. So be sure to test everything before leveraging this method.

 

9) Fully-Automated using Excel & OneDrive

Finally, we can use Excel and OneDrive to do something similar. We start by creating an Excel file one OneDrive/SharePoint. Then go to the Data menu and, in the Get & Transform Data section, choose From Web. Enter your csv into the dialog box.

 



Then click OK then click Connect then Load. The data will be loaded into an Excel “table”.

 



Once loaded, click the Query menu then edit the Properties. In the dialog box, check the option to “Refresh data when opening the file”. This will ensure that the data gets refreshed when Tableau Cloud/Server attempts to refresh it.


 

We can then simply use Tableau’s OneDrive and SharePoint Online connector to get our data and refresh automatically on Cloud or Server.

 

I haven’t used this method as much as # 8, but I suspect it has some of the same limitations. However, as far as size is concerned, Excel can handle a lot more data than Google Sheets. Unlike Google Sheets, I was able to load the full 30-day earthquake file without any problems.

 

Closing

So, there are nine methods for connecting to web-based csv files and refreshing them. I suspect that there are several additional methods for doing this, so if you have any other ideas, feel free to share them in the comments.

 

While this is a somewhat niche use case, I think it’s important to note that some of these methods could be used in other scenarios as well. For example, perhaps you have a business system that doesn’t allow you to connect to its database or doesn’t have APIs. You might have no choice but to manually download the data from that system in order to use it in Tableau. While that’s a manual process, you could leverage some of the techniques shared above to automate the refresh part of it and to avoid use of extra components such as Tableau Bridge.

 

Thanks for reading!!

 

Header image: Photo by Mika Baumeister on Unsplash

 

Ken Flerlage, February 19, 2024

Twitter | LinkedIn | GitHub | Tableau Public



No comments:

Powered by Blogger.