<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>unload Archives - Albert Nogués</title>
	<atom:link href="https://www.albertnogues.com/tag/unload/feed/" rel="self" type="application/rss+xml" />
	<link>https://www.albertnogues.com/tag/unload/</link>
	<description>Data and Cloud Freelancer</description>
	<lastBuildDate>Thu, 31 Dec 2020 10:02:21 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	

<image>
	<url>https://www.albertnogues.com/wp-content/uploads/2020/12/cropped-cropped-AlbertLogo2-32x32.png</url>
	<title>unload Archives - Albert Nogués</title>
	<link>https://www.albertnogues.com/tag/unload/</link>
	<width>32</width>
	<height>32</height>
</image> 
	<item>
		<title>Unload data from AWS Redshift to S3 in Parquet</title>
		<link>https://www.albertnogues.com/unload-data-from-aws-redshift-to-s3-in-parquet/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=unload-data-from-aws-redshift-to-s3-in-parquet</link>
		
		<dc:creator><![CDATA[Albert]]></dc:creator>
		<pubDate>Thu, 31 Dec 2020 09:59:14 +0000</pubDate>
				<category><![CDATA[AWS]]></category>
		<category><![CDATA[BigData]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Cloud]]></category>
		<category><![CDATA[SQL]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[parquet]]></category>
		<category><![CDATA[redshift]]></category>
		<category><![CDATA[s3]]></category>
		<category><![CDATA[snappy]]></category>
		<category><![CDATA[unload]]></category>
		<guid isPermaLink="false">http://192.168.1.40/?p=1007</guid>

					<description><![CDATA[<p>Following the previous redshift articles in this one I will explain how to export data from redshift to parquet in s3. This can be interesting when we want to archive (infrequently queried) data to be queried cheaper with spectrum, or to store in s3 archive, or to export to another storage solution like glacier. The &#8230; </p>
<p>The post <a href="https://www.albertnogues.com/unload-data-from-aws-redshift-to-s3-in-parquet/">Unload data from AWS Redshift to S3 in Parquet</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></description>
										<content:encoded><![CDATA[
<p>Following the previous redshift articles in this one I will explain how to export data from redshift to parquet in s3. This can be interesting when we want to archive (infrequently queried) data to be queried cheaper with spectrum, or to store in s3 archive, or to export to another storage solution like glacier.</p>



<p>The first thing we need to do is to modify our redshift cluster iam role to allow write to s3. We go to our cluster in the redshift panel, we click on properties, and then we will see the link to the iam role attached to the cluster. we click on it and it will open the IAM role page.</p>



<figure class="wp-block-image size-large"><img fetchpriority="high" decoding="async" width="1024" height="183" src="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet1-1024x183.png" alt="" class="wp-image-1008" srcset="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet1-1024x183.png 1024w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet1-300x54.png 300w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet1-768x137.png 768w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet1-1536x275.png 1536w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet1-336x60.png 336w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet1.png 1628w" sizes="(max-width: 1024px) 100vw, 1024px" /></figure>



<p>Then we add the policy named AmazonS3ReadOnlyAccess as shown in the following pic:</p>



<figure class="wp-block-image size-large"><img decoding="async" width="329" height="143" src="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet2.png" alt="" class="wp-image-1009" srcset="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet2.png 329w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet2-300x130.png 300w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet2-138x60.png 138w" sizes="(max-width: 329px) 100vw, 329px" /></figure>



<p>And with this we already have all the required permissions ready. The next step now is to make sure we have an available s3 bucket. I&#8217;ve created one for the demo purposes with a folder called parquet_exports.</p>



<figure class="wp-block-image size-large"><img decoding="async" width="889" height="422" src="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet3.png" alt="" class="wp-image-1010" srcset="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet3.png 889w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet3-300x142.png 300w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet3-768x365.png 768w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet3-126x60.png 126w" sizes="(max-width: 889px) 100vw, 889px" /></figure>



<p>For starting the extraction, we will use the customers table we used in the previous articles. This table was loaded as well from the TPC-DS test data from s3 in a gzip file but now it sits inside our redshift node. The instruction to unload the data is called <a rel="noreferrer noopener" href="https://docs.aws.amazon.com/es_es/redshift/latest/dg/r_UNLOAD.html" data-type="URL" data-id="https://docs.aws.amazon.com/es_es/redshift/latest/dg/r_UNLOAD.html" target="_blank">UNLOAD</a> <img src="https://s.w.org/images/core/emoji/17.0.2/72x72/1f642.png" alt="🙂" class="wp-smiley" style="height: 1em; max-height: 1em;" /></p>



<p>Since we want our data in parquet + snappy format, which is usually the recommended way (avro is not supported in redshift UNLOAD, only CSV and parquet), we need to express it in the unload statement.</p>



<p>Contrary to spectrum, here we can unload data to buckets in another region. So if thats your need make sure you fill the bucket region as well in the unload statement. The statement is as follows</p>



<pre class="wp-block-code"><code>UNLOAD ('<em>select-statement</em>')
TO '<em>s3://object-path/name-prefix</em>'
<em>authorization</em>
&#91; <em>option</em> &#91; ... ] ]

where <em>option</em> is
{ &#91; FORMAT &#91; AS ] ] CSV | PARQUET
| PARTITION BY ( <em>column_name</em> &#91;, ... ] ) &#91; INCLUDE ]
| MANIFEST &#91; VERBOSE ] 
| HEADER           
| DELIMITER &#91; AS ] '<em>delimiter-char</em>' 
| FIXEDWIDTH &#91; AS ] '<em>fixedwidth-spec</em>'   
| ENCRYPTED &#91; AUTO ]
| BZIP2  
| GZIP 
| ZSTD
| ADDQUOTES 
| NULL &#91; AS ] '<em>null-string</em>'
| ESCAPE
| ALLOWOVERWRITE
| PARALLEL &#91; { ON | TRUE } | { OFF | FALSE } ]
| MAXFILESIZE &#91;AS] <em>max-size</em> &#91; MB | GB ] 
| REGION &#91;AS] 'aws-region' }</code></pre>



<p>As you can see, it allows to pass a select statement instead of a table name. This way we can project the required columns and there is no need to export the whole table if we do not need it. We can also specify the max filesize. Usually with parquet 256 MB is a good split size. And make sure with Parquet not to specify any compression format as otherwise it will crash, as by default redshift already compresses it with Snappy.</p>



<p>To run the export we also need to fetch the arn of our redshift role, and incliude it just after the bucket path:</p>



<pre class="wp-block-code"><code>UNLOAD ('select * from customer')
TO 's3://albertnogues-parquet/parquet_exports/'
iam_role 'arn:aws:iam::742123541312:role/Redshift_Albertnogues.com'
FORMAT AS PARQUET
MAXFILESIZE 256 MB</code></pre>



<p>After about two minutes, the query finished sucessfully. We can go to our s3 bucket to see the parquet files there and check that the split file size is the one we requested</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="1024" height="430" src="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet4-1024x430.png" alt="" class="wp-image-1011" srcset="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet4-1024x430.png 1024w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet4-300x126.png 300w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet4-768x322.png 768w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet4-1536x644.png 1536w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet4-143x60.png 143w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet4.png 1664w" sizes="auto, (max-width: 1024px) 100vw, 1024px" /></figure>



<p>We can check first the size of the table in redshift with the following query:</p>



<pre class="wp-block-code"><code>SELECT "table", tbl_rows, size size_in_MB FROM SVV_TABLE_INFO
order by 1</code></pre>



<figure class="wp-block-table"><table><tbody><tr><td><strong>Table</strong></td><td><strong>Num Rows</strong></td><td><strong>Size (MB)</strong></td></tr><tr><td>customer</td><td>30.000.000</td><td>2098</td></tr></tbody></table></figure>



<p>So its quite clear that the export looks ok as the size is similar. We can now download one fo the parquet files and inspect it with some parquet tool analyzer. I tend to use the python version of parquet-tools based on apache arrow project. You can install it with:</p>



<pre class="wp-block-code"><code>pip install parquet-tools</code></pre>



<p>And then we will inspect the file with the following:</p>



<pre class="wp-block-code"><code>parquet-tools inspect 0001_part_03.parquet</code></pre>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="627" height="620" src="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet5.png" alt="" class="wp-image-1012" srcset="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet5.png 627w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet5-300x297.png 300w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet5-61x60.png 61w" sizes="auto, (max-width: 627px) 100vw, 627px" /></figure>



<p>And if we scroll down a little bit we can see the total number of files on our parquet file:</p>



<figure class="wp-block-image size-large"><img loading="lazy" decoding="async" width="531" height="614" src="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet6.png" alt="" class="wp-image-1013" srcset="https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet6.png 531w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet6-259x300.png 259w, https://www.albertnogues.com/wp-content/uploads/2020/12/RedsiftParquet6-52x60.png 52w" sizes="auto, (max-width: 531px) 100vw, 531px" /></figure>
<p>The post <a href="https://www.albertnogues.com/unload-data-from-aws-redshift-to-s3-in-parquet/">Unload data from AWS Redshift to S3 in Parquet</a> appeared first on <a href="https://www.albertnogues.com">Albert Nogués</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
