December 23, 2021
In this article, we are going to learn about Amazon Redshift and how to work with CSV files. We will see some of the ways of data import into the Redshift cluster from S3 bucket as well as data export from Redshift to an S3 bucket. This article is written for beginners and users of intermediate level and assumes that you have some basic knowledge of AWS and Python.
Modern businesses tend to generate a lot of data every day. Once the data is generated, it is required to be stored and analyzed so that strategic business decisions can be made based on the insights gained. In today’s world, where more and more organizations are shifting their infrastructure to the cloud, Amazon Web Services, also known as AWS, provides a fully managed cloud data warehousing solution, which is Amazon Redshift.
Amazon Redshift is a fully managed data warehouse on the cloud. It supports Massively Parallel Processing Architecture (MPP), which allows users to process data parallelly. It allows users to load and transform data within Redshift and then make it available for the Business Intelligence tools.
CSV files are a very common and standard format of flat files in which columns and values are separated by a comma. Reading and storing data in CSV files are very simple, they have been used in the industry for over a few decades now. You can see a sample CSV file below.
In this article, you will learn various ways of data import/export from CSV to Redshift and vice versa.
Since CSV is one of the most popular forms of dealing with data in flat files, there are many tools and options to work with such CSV files. As such, there are different ways of how CSV files can be imported and exported from Redshift as well. You will learn about these methods in the later section as follows.
One of the most common ways to import data from a CSV to Redshift is by using the native COPY command. Redshift provides a COPY command using which you can directly import data from your flat files to your Redshift Data warehouse. For this, the CSV file needs to be stored within an S3 bucket in AWS. S3 is abbreviated for Simple Storage Service, where you can store any type of files. The following steps need to be performed in order to import data from a CSV to Redshift using the COPY command:
In the next section, you will see a few examples of using the Redshift COPY command.
First you can create a cluster in Redshift and second create the schema as per your requirements. I will use the same sample CSV schema that you’ve seen in the previous section. In order to create the schema in Redshift, you can simply create a table with the following command.
The next step is to load data into an S3 bucket which can be done by either using the AWS CLI or the web console. If your file is large, you should consider using the AWS CLI.
Now when the CSV file is in S3, you can use the COPY command in Redshift to import the CSV file. Head over to your Redshift query window and type in the following command.
COPY table_name FROM 'path_to_csv_in_s3' credentials 'aws_access_key_id=YOUR_ACCESS_KEY;aws_secret_access_key=YOUR_ACCESS_SECRET_KEY' CSV;
Once the COPY command has been executed successfully, you receive the output as in the above screen print. Now, you can query your data using a simple select statement as follows.
Sometimes, it might be that you do not want to import all the columns from the CSV file into your Redshift table. In that case, you can specify the columns while using the COPY command, and data only from those columns will be imported into Redshift.
As you can see in the above figure, you can explicitly mention names of the commands that need to be imported to the Redshift table.
Another important scenario while importing data from CSV to Redshift using the COPY command is that your CSV file might contain a header and you do not want to import it. In other words, you want to ignore the header from the CSV file from being imported into the Redshift table. In such a case, you need to add a specific parameter IGNOREHEADER to the COPY command and specify the number of lines to be ignored. Usually, if you just want to ignore the header, which is the first line of the CSV file, you need to provide the number as 1.
To accomplish the process in Skyvia, follow these 3 simple steps:
Select Amazon Redshift among the list of data warehouses supported by Skyvia. In the opened Redshift connection window, enter the required parameters, which are Server, Port, User ID, Password and Database. You also need to click Advanced Settings and set parameters for connecting to Amazon S3 storage service. Among them are S3 region to use and either AWS Security Token or AWS Access Key ID and AWS Secret Key. Afterwards, check whether the connection is successful and click Create. You have completed the first step and connected to Amazon Redshift.
Automate uninterrupted data movement from CSV to Redshift on a regular basis by setting schedule for your import package. Click Schedule and enter all required parameters in the Schedule window.
For the first time we recommend that you run your package manually to check how successful your package has been executed. If some of your columns in source and target are mapped incorrectly, you will see errors in your runs and will be able to update mapping settings. Moreover, Skyvia can send error notifications to your email.
Python is one of the most popular programming languages in the modern data world. Almost every service on AWS is supported with the python framework, and you can easily build your integrations with it. We can use Python to build and connect to these services using libraries that are already available. In the following section, you will learn more about loading data from S3 to Redshift using python.
In order to be able to connect to Redshift using python, you need to use a library – “psycopg2”. This library can be installed by running the command as follows.
pip install psycopg2
Once the library is installed, you can start with your python program. You need to import the library into your program as follows and then prepare the connection object. The connection object is prepared by providing the hostname of the Redshift cluster, the port on which it is running, the name of the database and the credentials to connect to the database.
Once the connection is established, you can create a cursor that will be used while executing the query on the Redshift cluster.
In the next step, you need to provide the query that needs to be executed to load the data to Redshift from S3. This is the same query that you have executed on Redshift previously.
Once the query is prepared, the next step is to execute it. You can execute and commit the query by using the following commands:
Now, you can go back to your Redshift cluster and check if the data has been copied from the S3 bucket to the Redshift cluster.
Like loading data from external files into Redshift, there is also an option to export data out of Redshift.
Loding data out of Amazon Redshift can be done using UNLOAD command. You can simply select the data from Redshift and then provide a valid path to your S3 bucket to migrate data to. You can also filter the data in the select statement and then export your data as required. Once the query is ready, use the following command to unload data from Redshift to S3:
UNLOAD ('SELECT * FROM test.sample_csv') TO 's3://csv-redshift-221/Unload_' credentials 'aws_access_key_id=AKIA46SFIWN5JG7KM7O3;aws_secret_access_key=d4qfQNq4zYL39jcy4r4IWAxn4qPz4j8JgULvKa2d' CSV;
Once the UNLOAD command is executed successfully, you can view the new file created under the S3 bucket.
The file is now available in the S3 bucket which can be downloaded and opened by any text editor.
With Skyvia, you can export data from Redshift the same way as you imported data to it. For data migration from Redshift, sign in to Skyvia, open an export package, select Redshift as source, filter data you want to export, configure other package settings, create and run the package. Don't forget to set a schedule for your package. Read more about Redshift export to CSV.
In this article, we've described several ways to import CSV to Redshift and vice versa. For those users who need data import/export from CSV on schedule, Skyvia will be of help. For more information, contact Skyvia support team.