Athena window functions

Athena window functions DEFAULT

DML Queries, Functions, and Operators

Athena DML query statements are based on Presto 0.172 for Athena engine version 1 and Presto 0.217 for Athena engine version 2. For information about Athena engine versions, see Athena Engine Versioning.

For links to subsections of the Presto function documentation, see Presto Functions.

Athena does not support all of Presto's features, and there are some significant differences. For more information, see the topics for specific statements in this section and Considerations and Limitations.

For information about DDL statements, see DDL Statements. For a list of unsupported DDL statements, see Unsupported DDL.

Sours: https://docs.aws.amazon.com/athena/latest/ug/functions-operators-reference-section.html

Amazon Athena

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

Athena is out-of-the-box integrated with Amazon Glue Data Catalog, allowing you to create a unified metadata repository across various services, crawl data sources to discover schemas and populate your Catalog with new and modified table and partition definitions, and maintain schema versioning. You can also use Glue’s fully-managed ETL capabilities to transform data or convert it into columnar formats to optimize cost and improve performance.

Sours: https://www.amazonaws.cn/en/athena/
  1. Cameron realty texas
  2. Itd cameras 95
  3. Mastering biology quizlet chapter 8
  4. Ford bronco tumbler

Window Functions#

Window functions perform calculations across rows of the query result. They run after the clause but before the clause. Invoking a window function requires special syntax using the clause to specify the window as follows:

function(args)OVER([PARTITIONBYexpression][ORDERBYexpression[ASC|DESC]][frame])

A is one of:

{RANGE|ROWS} frame_start {RANGE|ROWS} BETWEEN frame_start AND frame_end

and can be any of:

UNBOUNDEDPRECEDINGexpressionPRECEDING-- only allowed in ROWS modeCURRENTROWexpressionFOLLOWING-- only allowed in ROWS modeUNBOUNDEDFOLLOWING

The window definition has 3 components:

  • The clause separates the input rows into different partitions. This is analogous to how the clause separates rows into different groups for aggregate functions. If is not specified, the entire input is treated as a single partition.

  • The clause determines the order in which input rows will be processed by the window function. If is not specified, the ordering is undefined. Note that the ORDER BY clause within window functions does not support ordinals. You need to use actual expressions

  • The clause specifies the sliding window of rows to be processed by the function for a given input row. A frame can be type or type, and it runs from to . If is not specified, a default value of is used.

    In mode, refers specifically to the current row. In mode, refers to any peer row of the current row for the purpose of the . If no is specified, all rows are considered peers of the current row. In mode a frame start of refers to the first peer row of the current row, while a frame end of refers to the last peer row of the current row.

    Frame starts and ends of or are currently only allowed in mode. They define the start or end of the frame as the specified number of rows before or after the current row. The must be of type .

    If no frame is specified, a default frame of is used.

Examples#

The following query ranks orders for each clerk by price:

SELECTorderkey,clerk,totalprice,rank()OVER(PARTITIONBYclerkORDERBYtotalpriceDESC)ASrnkFROMordersORDERBYclerk,rnk

Aggregate Functions#

All Aggregate Functions can be used as window functions by adding the clause. The aggregate function is computed for each row over the rows within the current row’s window frame.

For example, the following query produces a rolling sum of order prices by day for each clerk:

SELECTclerk,orderdate,orderkey,totalprice,sum(totalprice)OVER(PARTITIONBYclerkORDERBYorderdate)ASrolling_sumFROMordersORDERBYclerk,orderdate,orderkey

Ranking Functions#

() → bigint#

Returns the cumulative distribution of a value in a group of values. The result is the number of rows preceding or peer with the row in the window ordering of the window partition divided by the total number of rows in the window partition. Thus, any tie values in the ordering will evaluate to the same distribution value.

() → bigint#

Returns the rank of a value in a group of values. This is similar to , except that tie values do not produce gaps in the sequence.

(n) → bigint#

Divides the rows for each window partition into buckets ranging from to at most . Bucket values will differ by at most . If the number of rows in the partition does not divide evenly into the number of buckets, then the remainder values are distributed one per bucket, starting with the first bucket.

For example, with rows and buckets, the bucket values would be as follows:

() → double#

Returns the percentage ranking of a value in group of values. The result is where is the of the row and is the total number of rows in the window partition.

() → bigint#

Returns the rank of a value in a group of values. The rank is one plus the number of rows preceding the row that are not peer with the row. Thus, tie values in the ordering will produce gaps in the sequence. The ranking is performed for each window partition.

() → bigint#

Returns a unique, sequential number for each row, starting with one, according to the ordering of rows within the window partition.

Value Functions#

Value functions provide an option to specify how null values should be treated when evaluating the function. Nulls can either be ignored () or respected (). By default, null values are respected. If is specified, all rows where the value expresssion is null are excluded from the calculation. If is specified and the value expression is null for all rows, the is returned, or if it is not specified, is returned.

(x) → [same as input]#

Returns the first value of the window.

(x) → [same as input]#

Returns the last value of the window.

(x, offset) → [same as input]#

Returns the value at the specified offset from beginning the window. Offsets start at . The offset can be any scalar expression. If the offset is null or greater than the number of values in the window, null is returned. It is an error for the offset to be zero or negative.

(x[, offset[, default_value]]) → [same as input]#

Returns the value at rows after the current row in the window. Offsets start at , which is the current row. The offset can be any scalar expression. The default is . If the offset is null or larger than the window, the is returned, or if it is not specified is returned.

(x[, offset[, default_value]]) → [same as input]#

Returns the value at rows before the current row in the window Offsets start at , which is the current row. The offset can be any scalar expression. The default is . If the offset is null or larger than the window, the is returned, or if it is not specified is returned.

Sours: https://prestodb.io/docs/current/functions/window.html
AWS re:Invent 2019: [REPEAT 1] Deep dive into Amazon Athena (ANT307-R1)

This post is a lot different from our earlier entries. Think of it as a reference flag post for people interested in a quick lookup for advanced analytics functions and operators used in modern data lake operations based on Presto. So you could, of course, use it in Presto installations, but also in some other commercial products such as AWS Athena that is used widely these days to facilitate analytic operations on Enterprise Data Lakes built on top of Amazon S3.

Without further ado, let’s dive straight into the nuts and bolts of these queries for advanced analytics:

JSON Functions

is_json_scalar(json) → boolean

  • Determines if json is a scalar (i.e. a JSON number, a JSON string, true, false or null).
  • Example:
QueryOutput
SELECT is_json_scalar(‘1’)true
SELECT is_json_scalar(‘[1, 2, 3]’)false

json_array_contains(json, value) → boolean

  • Determines if value exists in json (a string containing a JSON array)
  • Example:
QueryOutput
SELECT json_array_contains(‘[1, 2, 3]’, 2)true

json_array_get(json_array, index) → json

QueryOutput
SELECT json_array_get(‘[“a”, [3, 9], “c”]’, 0)a
SELECT json_array_get(‘[“a”, [3, 9], “c”]’, 10)null
SELECT json_array_get(‘[“c”, [3, 9], “a”]’, -2)JSON ‘[3,9]’

json_array_length(json) → bigint

  • Returns the array length of json (a string containing a JSON array)
  • Example:
QueryOutput
SELECT json_array_length(‘[1, 2, 3]’)3

json_extract(json, json_path) → json

  • Evaluates the JSONPath-like expression json_path on json (a string containing JSON) and returns the result as a JSON string
  • Example:
QueryOutput
SELECT json_extract(json_parse(‘{“email”: {“abcd”: “[email protected]”}, “phone_numbers”: [5678908, 587578575, 668798]}’), ‘$.email.abc’)[email protected]

json_extract_scalar(json, json_path) → varchar

  • Just like json_extract(), but returns the result value as a string (as opposed to being encoded as JSON). The value referenced by json_path must be a scalar (boolean, number or string)
  • Example:
QueryOutput
SELECT json_extract_scalar(json_parse(‘{“email”: {“abcd”: “[email protected]”}, “phone_numbers”: [5678908, 587578575, 9999999]}’), ‘$.phone_numbers[0]’)9999999
SELECT json_extract_scalar(json_parse(‘{“email”: {“abcd”: “[email protected]”}, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}’), ‘$.phone_numbers[1].mob’)587578575
SELECT json_extract_scalar(json_parse(‘{“email”: “[email protected]”, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}’), ‘$[email]’)[email protected]

json_parse(string) → json

  • Returns the JSON value deserialized from the input JSON text. This is an inverse function to json_format()
  • Example:
QueryOutput
SELECT json_parse(‘{“email”: {“abcd”: “[email protected]”}, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}’){“email”: {“abcd”: “[email protected]”}, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}

json_size(json, json_path) → bigint

  • Just like json_extract(), but returns the size of the value. For objects or arrays, the size is the number of members, and the size of a scalar value is zero.
  • Example:
QueryOutput
SELECT json_size(json_parse(‘{“email”: “[email protected]”, “phone_numbers”: [{“home”: 5678908}, {“mob”: 587578575}, {“cell”: 9999999}]}’), ‘$.phone_numbers’)3
SELECT json_size(‘{“x”: {“a”: 1, “b”: 2}}’, ‘$.x.a’)0

Date and Time Functions and Operators

current_date → date

  • Returns the current date as of the start of the query.

current_time → time with time zone

  • Returns the current time as of the start of the query. (with UTC)

current_timestamp → timestamp with time zone

  • Returns the current timestamp as of the start of the query.

current_timezone() → varchar

  • Returns the current time zone in the format defined by IANA (e.g., America/Los_Angeles) or as fixed offset from UTC (Example: +08:35)

date(x) → date

  • This is an alias for CAST(x AS date).
  • Example:
QueryOutput
SELECT DATE(‘2019-08-07’)SELECT CAST(‘2019-08-07’ AS DATE);

now() → timestamp with time zone

  • This is an alias for current_timestamp.
  • SELECT now();

date_trunc(unit, x) → [same as input]

  • Returns x truncated to unit.
  • Example:
QueryOutput
SELECT date_trunc(‘second’, current_timestamp)2019-08-16 06:51:29.000 UTC (returns value upto unit)
SELECT date_trunc(‘minute’, current_timestamp)2019-08-16 06:54:00.000 UTC

Interval Functions

UnitDescription
millisecondMilliseconds
secondSeconds
minuteMinutes
hourHours
dayDays
weekWeeks
monthMonths
quarterQuarters of a year
yearYear

date_add(unit, value, timestamp) → [same as input]

  • Adds an interval value of type unit to timestamp. Subtraction can be performed by using a negative value.
  • Example:
QueryOutput
SELECT date_add(‘month’, 1, current_timestamp)2019-09-16 06:59:55.425 UTC
SELECT date_add(‘day’, 1, current_timestamp)2019-08-17 07:01:26.834 UTC

date_diff(unit, timestamp1, timestamp2) → bigint

  • Returns timestamp2 – timestamp1 expressed in terms of unit.
  • Example:
QueryOutput
SELECT date_diff(‘day’, current_timestamp, date_add(‘day’, 10, current_timestamp))10

parse_duration(string) → interval

  • Parses string of format value unit into an interval, where value is fractional number of unit values
  • Example:
QueryOutput
SELECT parse_duration(‘42.8ms’)0 00:00:00.043
SELECT parse_duration(‘3.81 d’)3 19:26:24.000
SELECT parse_duration(‘5m’)0 00:05:00.000

date_format(timestamp, format) → varchar

  • Formats timestamp as a string using format (converts timestamp to string)
  • Example:
QueryOutput
SELECT date_format(current_timestamp, ‘%Y-%m-%d’)2019-08-16

date_parse(string, format) → timestamp

  • Parses string into a timestamp using format. (converts string to timestamp)
  • Example:
QueryOutput
SELECT date_parse(‘2019-08-16’, ‘%Y-%m-%d’)2019-08-16 00:00:00.000

day(x) → bigint

  • Returns the day of the month from x.
  • Example:
QueryOutput
SELECT day(date_parse(‘2019-08-16’, ‘%Y-%m-%d’))16

day_of_month(x) → bigint

  • This is an alias for day().

day_of_week(x) → bigint

  • Returns the ISO day of the week from x. The value ranges from 1 (Monday) to 7 (Sunday).
  • Example:
QueryOutput
SELECT day_of_week(date_parse(‘2019-08-16’, ‘%Y-%m-%d’))5

year(x) → bigint

Aggregate Functions

array_agg(x) → array<[same as input]>

  • Returns an array created from the input x elements
  • The array_agg() function is an aggregate function that accepts a set of values and returns an array in which each value in the input set is assigned to an element of the array.
  • Syntax:
    array_agg(expression [ORDER BY [sort_expression {ASC | DESC}], […])
  • Example:
Query
SELECT title, array_agg (first_name || ‘ ‘ || last_name) actors FROM film
SELECT title, array_agg (first_name || ‘ ‘ || last_name ORDER BY first_name) actors FROM film

avg(x) → double

  • Returns the average (arithmetic mean) of all input values

bool_and(boolean) → boolean

  • Returns TRUE if every input value is TRUE, otherwise FALSE.

bool_or(boolean) → boolean

  • Returns TRUE if any input value is TRUE, otherwise FALSE.

count(*) → bigint

  • Returns the number of input rows.

count(x) → bigint

  • Returns the number of non-null input values.

count_if(x) → bigint

  • Returns the number of TRUE input values. This function is equivalent to count(CASE WHEN x THEN 1 END).

arbitrary(x) → [same as input]

  • Returns an arbitrary non-null value of x, if one exists.
  • Arbitrary chooses one value out of a set of values. arbitrary is useful for silencing warnings about values neither grouped by or aggregated over.

max_by(x, y) → [same as x]

  • Returns the value of x associated with the maximum value of y over all input values.
  • The max_by takes two arguments and returns the value of the first argument for which the value of the second argument is maximized.
  • If multiple rows maximize the result of the second value, and arbitrary first value is chosen from among them. max_by can be used with both numeric and non-numeric data.
  • Example:
QueryOutput
SELECT max_by(close_date, close_value) as date_of_max_sale FROM sales_pipelinequery returns the date where close_value is maximum

max_by(x, y, n) → array<[same as x]>

  • Returns n values of x associated with the n largest of all input values of y in descending order of y.

min_by(x, y) → [same as x]

  • Returns the value of x associated with the minimum value of y over all input values.

min_by(x, y, n) → array<[same as x]>

  • Returns n values of x associated with the n smallest of all input values of y in ascending order of y.

max(x, n) → array<[same as x]>

  • Returns n largest values of all input values of x.

min(x) → [same as input]

  • Returns the minimum value of all input values.

min(x, n) → array<[same as x]>

  • Returns n smallest values of all input values of x.

reduce_agg(inputValue T, initialState S, inputFunction(S, T, S), combineFunction(S, S, S)) → S

  • Reduces all input values into a single value. inputFunction will be invoked for each input value. In addition to taking the input value, inputFunction takes the current state, initially initialState, and returns the new state. combineFunction will be invoked to combine two states into a new state. The final state is returned
  • Example:
QueryOutput
SELECT id, reduce_agg(value, 0, (a, b) -> a + b, (a, b) -> a + b) FROM (VALUES (1, 2) (1, 3), (1, 4), (2, 20), (2, 30), (2, 40) ) AS t(id, value) GROUP BY id(1, 9), (2, 90)

Array Functions and Operators

Subscript Operator: [ ]

  • The [ ] operator is used to access an element of an array and is indexed starting from one:
  • If a column has values like [1, 3], [45, 46]
  • Example:
QueryOutput
SELECT column[1]1, 45

Concatenation Operator: ||

  • The || operator is used to concatenate an array with an array or an element of the same type.
  • Example:
QueryOutput
SELECT ARRAY [1] || ARRAY [2][1, 2]
SELECT ARRAY [1] || 2[1, 2]
SELECT 2 || ARRAY [1][2, 1]

array_distinct(x) → array

  • Remove duplicate values from the array x.
  • Example:
QueryOutput
SELECT array_distinct(ARRAY[1, 1, 1, 2])[1, 2]

array_intersect(x, y) → array

  • Returns an array of the elements in the intersection of x and y, without duplicates.
  • Common elements in two arrays without duplicates.
  • Example:
QueryOutput
SELECT array_intersect(ARRAY[1, 1, 1, 2], ARRAY[10, 15, 1, 100])[1]

array_union(x, y) → array

  • Returns an array of the elements in the union of x and y, without duplicates.
  • Example:
QueryOutput
SELECT array_union(ARRAY[1, 1, 1, 2], ARRAY[10, 15, 1, 100])[1, 2, 10, 15, 100]

array_except(x, y) → array

  • Returns an array of elements in x but not in y, without duplicates.
  • Example:
QueryOutput
SELECT array_except(ARRAY[1, 1, 1, 2], ARRAY[10, 15, 1, 100])[2]

array_join(x, delimiter, null_replacement) → varchar

  • Concatenates the elements of the given array using the delimiter and an optional string to replace nulls.
  • Example:
QueryOutput
SELECT array_join(ARRAY[1, 71, 81, 92, null], ‘/’, ‘abcd’)1/71/81/92/abcd

array_max(x) → x

  • Returns the maximum value of input array.
  • Example:
QueryOutput
SELECT array_max(ARRAY[1, 71, 81, 92, 100])100
SELECT array_max(ARRAY[1, 71, 81, 92, null])null

array_min(x) → x

  • Returns the minimum value of input array.
  • Example:
QueryOutput
SELECT array_min(ARRAY[1, 71, 81, 92, 100])1

array_position(x, element) → bigint

  • Returns the position of the first occurrence of the element in array x (or 0 if not found).
  • Example:
QueryOutput
SELECT array_position(ARRAY[1, 71, 81, 92, 100], 100)5

array_remove(x, element) → array

  • Remove all elements that equal element from array x.
  • Example:
QueryOutput
SELECT array_remove(ARRAY[1, 71, 81, 92, 100, 1], 1)[71, 81, 92, 100]

array_sort(x) → array

  • Sorts and returns the array x. The elements of x must be orderable. Null elements will be placed at the end of the returned array.
  • Example:
QueryOutput
SELECT array_sort(ARRAY[1, null, 8, 9, 71, 81, 92, 100, 12, 51, 10, 7, 1, null])[1, 1, 7, 8, 9, 10, 12, 51, 71, 81, 92, 100, null, null]

array_sort(array(T), function(T, T, int)) → array(T)

  • Sorts and returns the array based on the given comparator function. The comparator will take two nullable arguments representing two nullable elements of the array. It returns -1, 0, or 1 as the first nullable element is less than, equal to, or greater than the second nullable element. 
  • If the comparator function returns other values (including NULL), the query will fail and raise an error.
  • Example:
QueryOutput
SELECT array_sort(ARRAY [3, 2, 5, 1, 2], (x, y) -> IF(x < y, 1, IF(x = y, 0, -1)))[5, 3, 2, 2, 1] 

cardinality(x) → bigint

  • Returns the cardinality (size) of the array x.
  • Example:
QueryOutput
SELECT cardinality(ARRAY[1, 81, 92, 100, 12, 51, 10])7

arrays_overlap(x, y) → boolean

  • Tests if arrays x and y have any any non-null elements in common. Returns null if there are no non-null elements in common but either array contains null.
  • Example:
QueryOutput
SELECT arrays_overlap(ARRAY[1, 81, 92, 100, 12, 51, 10], ARRAY[101])false
SELECT arrays_overlap(ARRAY[1, 81, 92, 100, 12, 51, 10], ARRAY[101, 51])true

concat(array1, array2, …, arrayN) → array

  • Concatenates the arrays array1, array2, …, arrayN. This function provides the same functionality as the SQL-standard concatenation operator (||).

contains(x, element) → boolean

  • Returns true if the array x contains the element.
  • Example:
QueryOutput
SELECT contains(ARRAY[1, 81, 92, 100, 12, 51, 10], 1)true
SELECT contains(ARRAY[‘abcd’, ‘test’, ‘xyz’], ‘xyz’)true

element_at(array(E), index) → E

  • SQL array indices start at 1
  • Returns element of array at given index. If index > 0, this function provides the same functionality as the SQL-standard subscript operator ([]). If index < 0, element_at accesses elements from the last to the first.
  • Example:
QueryOutput
SELECT contains(ARRAY[1, 81, 92, 100, 12, 51, 10], 2)81

filter(array(T), function(T, boolean)) → array(T)

  • Constructs an array from those elements of array for which function returns true.
  • Example:
QueryOutput
SELECT filter(ARRAY [5, -6, NULL, 7],  x -> x > 0)[5, 7]

flatten(x) → array

  • Flattens an array(array(T)) to an array(T) by concatenating the contained arrays.

reduce(array(T), initialState S, inputFunction(S, T, S), outputFunction(S, R)) → R

  • Returns a single value reduced from array. inputFunction will be invoked for each element in array in order. 
  • In addition to taking the element, inputFunction takes the current state, initially initialState, and returns the new state. outputFunction will be invoked to turn the final state into the result value.
  • It may be the identity function (i -> i).
  • Example:
QueryOutput
SELECT reduce(ARRAY [5, 20, NULL, 50], 0, (s, x) -> s + COALESCE(x, 0), s -> s)75
SELECT reduce(ARRAY [5, 20, NULL, 50], 1, (s, x) -> s * COALESCE(x, 1), s -> s)5000

repeat(element, count) → array

  • Repeat element for count times

reverse(x) → array

  • Returns an array which has the reversed order of array x.
  • Example:
QueryOutput
SELECT reverse(ARRAY[2, 5])[5, 2]

sequence(start, stop) → array(bigint)

  • Generate a sequence of integers from start to stop, increments by 1 if start is less than or equal to stop, otherwise decrements by 1.
  • Example:
QueryOutput
SELECT sequence(2, 7)[2, 3, 4, 5, 6, 7]
SELECT sequence(7, 1)[7, 6, 5, 4, 3, 2, 1]

sequence(start, stop, step) → array(bigint)

  • Generate a sequence of integers from start to stop, incrementing by step.

sequence(start, stop) → array(date)

  • Generate a sequence of dates from start date to stop date, incrementing by 1 day if start date is less than or equal to stop date, otherwise -1 day.

sequence(start, stop, step) → array(date)

  • Generate a sequence of dates from start to stop, incrementing by step. The type of step can be either INTERVAL DAY TO SECOND or INTERVAL YEAR TO MONTH.

shuffle(x) → array

  • Generate a random permutation of the given array x.
  • Example:
QueryOutput
SELECT shuffle(ARRAY[1, 7, 2])[7, 1, 2]

slice(x, start, length) → array

  • Subsets array x starting from index start (or starting from the end if start is negative) with a length of length.
  • Example:
QueryOutput
SELECT slice(ARRAY[1, 7, 2, 87, 12, 09], 2, 4)[7, 2, 87, 12]

transform(array(T), function(T, U)) → array(U)

  • Returns an array that is the result of applying function to each element of array
  • Example:
QueryOutput
SELECT transform(ARRAY [5, NULL, 6], x -> COALESCE(x, 0) + 1)[6, 1, 7]
SELECT transform(ARRAY [‘x’, ‘abc’, ‘z’], x -> x || ‘0’)[‘x0’, ‘abc0’, ‘z0’]

zip(array1, array2[, …]) → array(row)

  • Merges the given arrays, element-wise, into a single array of rows. The M-th element of the N-th argument will be the N-th field of the M-th output element. If the arguments have an uneven length, missing values are filled with NULL.
  • Example:
QueryOutput
SELECT zip(ARRAY[1, 2], ARRAY[‘1b’, null, ‘3b’])[ROW(1, ‘1b’), ROW(2, null), ROW(null, ‘3b’)]

zip_with(array(T), array(U), function(T, U, R)) → array(R)

  • Merges the two given arrays, element-wise, into a single array using function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying function:
  • Example:
QueryOutput
SELECT zip_with(ARRAY[1, 3, 5], ARRAY[‘a’, ‘b’, ‘c’], (x, y) -> (y, x))[ROW(‘a’, 1), ROW(‘b’, 3), ROW(‘c’, 5)]

Window functions

Window functions perform calculations across rows of the query result. They run after the HAVING clause but before the ORDER BY clause. Invoking a window function requires special syntax using the OVER clause to specify the window. A window has three components:

  1. The partition specification, which separates the input rows into different partitions. This is analogous to how the GROUP BY clause separates rows into different groups for aggregate functions.
  2. The ordering specification, which determines the order in which input rows will be processed by the window function.
  3. The window frame, which specifies a sliding window of rows to be processed by the function for a given row. If the frame is not specified, it defaults to RANGE UNBOUNDED PRECEDING, which is the same as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This frame contains all rows from the start of the partition up to the last peer of the current row.

rank() → bigint

  • Returns the rank of a value in a group of values. The rank is one plus the number of rows preceding the row that are not peer with the row. Thus, tie values in the ordering will produce gaps in the sequence. The ranking is performed for each window partition.
  • As shown below, the rank function produces a numerical rank within the current row’s partition for each distinct ORDER BY value, in the order defined by the ORDER BY clause. rank needs no explicit parameter, because its behavior is entirely determined by the OVER clause.
  • Query:
 rowsvaluerank
1namea1
2namea1
3namea1
4nameb4
5nameb4
6namec6
7namec6
8named8
9namee9

row_number() → bigint

  • returns a unique, sequential number for each row, starting with one, according to the ordering of rows within the window partition.
  • Query:
 rowsvaluerow_num
1namea1
2namea2
3namea3
4nameb4
5nameb5
6namec6
7namec7
8named8
9namee9

dense_rank() → bigint

  • returns the rank of a value in a group of values. This is similar to rank(), except that tie values do not produce gaps in the sequence.
  • Query: 
 rowsvaluedense_ran
1namea1
2namea1
3namea1
4nameb2
5nameb2
6namec3
7namec3
8named4
9namee5

Map Functions and Operators

map(array(K), array(V)) → map(K, V)

  • Returns a map created using the given key/value arrays.
  • Example:
QueryOutput
SELECT map(ARRAY[1,3], ARRAY[2,4]) AS a{1=2, 3=4}
SELECT map(ARRAY[‘k1′,’k2’], ARRAY[‘value1’, ‘value2’]){k1=value1, k2=value2}

Subscript Operator: [ ]

  • The [ ] operator is used to retrieve the value corresponding to a given key from a map.
  • Example:
QueryOutput
SELECT map(ARRAY[1,3], ARRAY[2,4]) AS a{1=2, 3=4}
SELECT a[1] FROM (SELECT map(ARRAY[1,3], ARRAY[2,4]) AS a)2

element_at(map(K, V), key) → V

  • Returns value for given key, or NULL if the key is not contained in the map.
  • Example:
Query
SELECT element_at(a, ‘k1’) FROM (SELECT map(ARRAY[‘k1′,’k2’], ARRAY[‘value1’, ‘value2’]) AS a)

cardinality(x) → bigint

  • Returns the cardinality (size) of the map x.

map() → map<unknown, unknown>

  • Returns an empty map.
  • Example:
QueryOutput
SELECT map(){}

map_from_entries(array(row(K, V))) → map(K, V)

  • Returns a map created from the given array of entries.
  • Example:
QueryOutput
SELECT map_from_entries(ARRAY[(1, ‘x’), (2, ‘y’)]){1 -> ‘x’, 2 -> ‘y’}

map_agg(key, value) → map(K, V)

  • Returns a map created from the input key/value pairs.

multimap_from_entries(array(row(K, V))) → map(K, array(V))

  • Returns a multimap created from the given array of entries. Each key can be associated with multiple values.
  • Example:
QueryOutput
SELECT multimap_from_entries(ARRAY[(1, ‘x’), (2, ‘y’), (1, ‘z’)]){1 -> [‘x’, ‘z’], 2 -> [‘y’]}

map_concat(map1(K, V), map2(K, V), …, mapN(K, V)) → map(K, V)

  • Returns the union of all the given maps. If a key is found in multiple given maps, that key’s value in the resulting map comes from the last one of those maps.

map_filter(map(K, V), function(K, V, boolean)) → map(K, V)

  • Constructs a map from those entries of map for which function returns true.
  • Example:
QueryOutput
SELECT map_filter(MAP(ARRAY[], ARRAY[]), (k, v) -> true){}
SELECT map_filter(MAP(ARRAY[10, 20, 30], ARRAY[‘a’, NULL, ‘c’]), (k, v) -> v IS NOT NULL){10 -> a, 30 -> c}

map_keys(x(K, V)) → array(K)

  • Returns all the keys in the map x.
  • Example:
QueryOutput
SELECT map_keys(map(ARRAY[‘k1′,’k2’], ARRAY[‘value1’, ‘value2’]))[k1, k2]

map_values(x(K, V)) → array(V)

  • Returns all the values in the map x.
  • Example:
QueryOutput
SELECT map_values(map(ARRAY[‘k1′,’k2’], ARRAY[‘value1’, ‘value2’]))[value1, value2]

transform_keys(map(K1, V), function(K1, V, K2)) → map(K2, V)

  • Returns a map that applies function to each entry of map and transforms the keys.
  • Example:
QueryOutput
SELECT transform_keys(MAP(ARRAY[], ARRAY[]), (k, v) -> k + 1){}
SELECT transform_keys(MAP(ARRAY [1, 2, 3], ARRAY [‘a’, ‘b’, ‘c’]), (k, v) -> k + 1){2 -> a, 3 -> b, 4 -> c}
SELECT transform_keys(MAP(ARRAY [‘a’, ‘b’], ARRAY [1, 2]), (k, v) -> k || CAST(v as VARCHAR)){a1 -> 1, b2 -> 2}

transform_values(map(K, V1), function(K, V1, V2)) → map(K, V2)

  • Returns a map that applies function to each entry of map and transforms the values.
  • Example:
QueryOutput
SELECT transform_values(MAP(ARRAY[], ARRAY[]), (k, v) -> v + 1){}
SELECT transform_values(MAP(ARRAY [1, 2, 3], ARRAY [10, 20, 30]), (k, v) -> v + k){1 -> 11, 2 -> 22, 3 -> 33}

Thanks for the read. Please feel free to reach out with your comments.

This story is authored by PV Subbareddy.Subbareddy is a Big Data Engineer specializing on Cloud Big Data Services and Apache Spark Ecosystem.

Sours: http://blog.zenof.ai/advanced-analytics-presto-functions-and-operators-quick-review/

Functions athena window

Presto Functions in Amazon Athena

Athena supports some, but not all, of Presto's functions and features. For information, see Considerations and Limitations. For a list of the time zones that can be used with the operator, see Supported Time Zones.

Athena engine version 2

Athena engine version 2 is based on Presto 0.217. For information about related functions, operators, and expressions, see Presto 0.217 Functions and Operators and the following specific sections from the Presto documentation. For the geospatial functions in Athena engine version 2, see Geospatial Functions in Athena engine version 2.

Athena engine version 1

Athena engine version 1 is based on Presto 0.172. For information about related functions, operators, and expressions, see Presto 0.172 Functions and Operators and the following specific sections from the Presto documentation. For the geospatial functions in Athena engine version 1, see Geospatial Functions in Athena engine version 1.

Sours: https://docs.aws.amazon.com/athena/latest/ug/presto-functions.html
#19 Window functions in Redshift

Athena SQL basics – How to write SQL against files

The key difference, unlike traditional SQL queries that run against tables in a database Amazon Athena runs against files. Athena can analyze structured, unstructured and semi-structured data stored in an S3 bucket. It can read Apache Web Logs and data formatted in JSON, ORC, Parquet, TSV, CSV and text files with custom delimiters.

Secondly, Amazon Athena does not store the data being analyzed. Athena does have the concept of databases and tables, but they store metadata regarding the file location and the structure of the data.

Thirdly, Amazon Athena is serverless, which means provisioning capacity,  scaling, patching, and OS maintenance is handled by AWS. And finally, Athena executes SQL queries in parallel, which means faster outputs.

DDL stands for Data Definition Language, and is a part of the Structured Query Language (SQL) class. DDL statements are generally used to create or modify the structural metadata of the actual data. In Amazon Athena, objects such as Databases, Schemas, Tables, Views and Partitions are part of DDL.

Athena SQL DDL is based on Hive DDL, so if you have used the Hadoop framework, these DDL statements and syntax will be quite familiar. Key point to note, not all Hive DDL statements are supported in Amazon Athena SQL. This is because data in Athena is stored externally in S3, and not in a database. For example, DDL statements related to INDEXES, ROLES, LOCKS, IMPORT, EXPORT and COMMIT are not supported in Athena SQL.

The table below lists the 24 DDL statements supported in Athena SQL. For details on Athena DDL syntax, usage and parameters click here.

Sours: https://www.obstkel.com/amazon-athena-sql

You will also like:

Data tutorial: Using Window Functions

What are Window Functions?

Window functions enable users to perform calculations against partitions (i.e. subgroups or sections) of a result set, typically a table or the results from another query. Unlike traditional aggregation functions, which return only a single value for each group defined in the query, window functions return a single value for each input row.

How are window functions helpful?

Window functions increase the efficiency and reduce the complexity of queries that analyze partitions (windows) of a data set by providing an alternative to more complex SQL concepts, e.g. derived queries. Common use cases include:

  • Ranking results within a specific window (e.g. per-group ranking)
  • Accessing data from another row in a specific window (e.g. period-over-period reporting)
  • Aggregation within a specific window (e.g. running totals):

How to use Window Functions

Window functions are initiated with the clause, and are configured using three concepts:

  • window partition () - groups rows into partitions
  • window ordering () - defines the order or sequence of rows within each window
  • window frame () - defines the window by use of an offset from the specified row

Syntax Overview

Definitions

  • expression_list: comma separated list of expressions, typically column names
  • order_list: comma separated list of expressions, typically column names
  • frame_clause: define the offset using , , , ,

Examples

Ranking: per-group ranking

Previous Row: period-over-period analysis

Aggregation Within a Window: running totals

Documentation & Function List:

Please note that subtle differences in usage and syntax exist from database to database. For information specific to your implementation, please use the source-specific documentation provided below.

Postgres

Redshift

Oracle

SQLServer

BigQuery

Sours: https://chartio.com/resources/tutorials/using-window-functions/


4495 4496 4497 4498 4499