This brief how-to demonstrates how to use the Data Virtuality parquet connector to take any data source table and create a local parquet file. The second half of this tutorial will show how to upload these files to S3.
This article references many technologies, which are explained in further detail here. The primary purpose of this post is to demonstrate how Data Virtuality can facilitate the creation and transfer of parquet files to a remote S3 repository either one time, or automatically on a schedule.
First, create the three data sources that we will need build the process.
- File Data Source
- S3 Data Source
- Parquet Data Source
The File and S3 data sources can both be added using the “Add data source” wizard. Add both of these now and supply the correct credentials information.
Take note of the “Directory Path” of the File data source, as we will use this in a minute. The S3 “Bucket Name” will point to the location of whichever bucket you are going to upload your files to. Make sure that the IAM user, you are using for the S3 data source, also has the correct permissions to create files to this bucket.
Next, we will create the parquet data source.
- The data source can be created by executing the following script.
- Full documentation on this connector can be found here.
- Note, <PATH TO MY PARQUET FOLDER> should be the same “Directory Path” that we used for the File data source created above.
call SYSADMIN.createConnection ( name => 'parquet' ,jbossCLITemplateName => 'parquet' ,connectionOrResourceAdapterProperties => 'path="//"' );; call SYSADMIN.createDataSource ( name => 'parquet' ,translator => 'parquet' ,modelProperties => null ,translatorProperties => null );;
Now, we will need some data to convert into parquet format. If you do not have any data to practice on yet, you can take advantage of Data Virtuality’s virtual layer by creating a virtual view like the one below.
ALTER VIEW "views.my_parquet_test" ("id" STRING, "my_message" STRING, "created_at" TIMESTAMP) AS SELECT CAST(1000 + ROW_NUMBER() OVER(ORDER BY 1) AS STRING), x.*, NOW() FROM ( SELECT 'Hello World!' UNION SELECT 'Welcome to my tutorial.' UNION SELECT 'What is the average air speed of a laden swallow?' ) x;;
However, any table or view from any data source that you have will suffice. Use the SELECT INTO syntax to generate the file.
SELECT "id", "my_message", "created_at" INTO "parquet.message_test" FROM "views.my_parquet_test";;
We will now use the File data source named “parquet_files” and its stored procedure “getFiles”, to retrieve the BLOB content of the parquet file. Then, we will pass the contents of the file to the S3 “saveFile” stored procedure, like so:
/*The first step retrieves the file data as a BLOB object*/
DECLARE BLOB raw_file = SELECT file FROM (call "parquet_files.getFiles"(
"pathAndPattern" => '*.parquet'/* Optional: The path and pattern of what files to return. Currently the only pattern supported is *.<ext>, which returns only the files matching the given extension at the given path. */
/*The second step takes the BLOB object and saves it to a bucket or folder on S3*/
"filePath" => 'parquet-uploads/message_test.parquet'/* Mandatory */,
"file" => raw_file/* Mandatory: The contents to save. Can be one of CLOB, BLOB, or XML */
Note, that “filePath” will create the new parquet file with the name and path specified within the bucket set up during the data source creation.