Spark SQL Reference（Spark SQL 参考指南）¶

This section covers some key differences between writing Spark SQL data transformations and other types of SQL queries. It also contains a list of the available Spark SQL functions.

:::callout{theme="success" title="Tip"} As you’re looking for resources, keep in mind that Spark SQL is actually based on the HiveQL dialect. You can find more information online about HiveQL. :::

Getting started¶

Basic query format¶

Each SQL data transformation query must create a table. The general format for your SQL query is:

CREATE TABLE _____ AS SELECT _____

Do not include semicolons at the end of your statements. Including semicolons will result in an error.

Comment syntax¶

You can include comments in your SQL code like this:

-- You can create comments using
-- this syntax

/* You can also create comments using
   this syntax */

Referencing datasets¶

To reference a dataset, provide the dataset path surrounded by back-ticks:

CREATE TABLE `/path/to/target/dataset` AS
SELECT * FROM `/path/to/source/dataset`

-- Alternative syntax

CREATE TABLE `/path/to/target/dataset` AS (
    SELECT * FROM `/path/to/source/dataset`
)

Note that dataset names are case-sensitive.

Referencing columns¶

To reference a specific column from a dataset, provide the column name:

SELECT Name FROM `/path/to/source/dataset`

Note that column names are case-sensitive.

Derived columns & aliases¶

A derived column is the result of calling a function on your dataset column(s). You must define an alias for any derived column:

SELECT Sum(Val) AS Total FROM `/path/to/source/dataset`

The following query will result in an error:

SELECT Sum(Val) FROM `/path/to/source/dataset`

Note that alias names are case-sensitive.

Derived columns in SQL clauses¶

You cannot use an alias within the following SQL clauses: WHERE and GROUP BY. Thus, you must refer to the actual function and dataset column(s) in a WHERE or GROUP BY clause:

SELECT Lower(Name) AS LowercaseName
FROM `/path/to/source/dataset` WHERE Lower(Name) = "sara"

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name)

You can use an alias within the following SQL clauses: ORDER BY and HAVING. Thus, any of the following queries will work:

-- Both of the following queries are valid

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name) ORDER BY Total

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name) ORDER BY Sum(Val)

-- Both of the following queries are valid

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name) HAVING Total > 100

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name) HAVING Sum(Val) > 100

Type conversion¶

You can cast an expression to convert it from one type to another. The syntax for casting is:

CAST(expr AS <TYPE>)

Note that expr represents the expression you want to cast, and <TYPE> represents what you type you want to convert your expression to. If CAST(expr` `AS` `<TYPE>) does not succeed, it will return null. The available values for <TYPE> are:

boolean
tinyint
smallint
int
bigint
float
double
decimal
date
timestamp
binary
string

Formatting dates¶

You may want to reformat your date values or convert your strings into date format. A date has the form yyyy-MM-dd and doesn't have a time component. You can use the CAST function along with the available datetime functions to convert strings to dates. Here are some quick examples:

-- Convert a string with the format 'yyyy-MM-dd' to a date
CAST('2016-07-30' AS DATE)

-- Convert a string with the format 'yyyy-MM-dd' to a timestamp (the timestamp will be based on midnight of the date provided)
CAST('2016-07-30' AS TIMESTAMP)

-- Convert a date string with the format 'ddMMyyyy' to a date
TO_DATE(CAST(UNIX_TIMESTAMP('07302016', 'MMddyyyy') AS TIMESTAMP))

-- Convert a timestamp string with the format 'ddMMyyyy HH:mm:ss' to just a date
CAST('2016-07-30 11:29:27' AS DATE)

-- Extract just the date from an ISO 8601 timestamp
TO_DATE('2016-07-30T11:29:27.000+00:00')

-- Extract just the date from a timestamp with the format 'yyyy-MM-dd HH:mm:ss'
TO_DATE('2016-07-30 11:29:27')

Note that you cannot cast a string as a date if the string is not formatted as a date/timestamp. Thus, something like CAST('20160730' AS DATE) will return null.

Aggregate functions¶

Function	Description
APPROX_COUNT_DISTINCT	Returns the approximate number of distinct items in a group.
AVG	Returns the average of the values in a group.
COLLECT_LIST	Returns a list of objects with duplicates.
COLLECT_SET	Returns a set of objects with duplicate elements eliminated.
CORR	Returns the Pearson Correlation Coefficient for two columns.
COUNT	Returns the number of items in a group.
COVAR_POP	Returns the population covariance for two columns.
COVAR_SAMP	Returns the sample covariance for two columns.
FIRST	Returns the first value in a group.
GROUPING	Indicates whether a specified column in a GROUP BY list is aggregated or not.
KURTOSIS	Returns the kurtosis of the values in a group.
LAST	Returns the last value in a group.
MAX	Returns the maximum value of the expression in a group.
MEAN	Returns the average of the values in a group.
MIN	Returns the minimum value of the expression in a group.
SKEWNESS	Returns the skewness of the values in a group.
STDDEV	Alias for stddev_samp.
STDDEV_POP	Returns the population standard deviation of the expression in a group.
STDDEV_SAMP	Returns the sample standard deviation of the expression in a group.
SUM	Returns the sum of all values in the expression.
VARIANCE	Returns the unbiased variance of the values in a group.
VAR_POP	Returns the population variance of the values in a group.
VAR_SAMP	Returns the unbiased variance of the values in a group.

`APPROX_COUNT_DISTINCT`¶

Description: Aggregate function: returns the approximate number of distinct items in a group. The max estimation error allowed defaults to 0.05, unless rsd is provided.

Signature:

APPROXCOUNTDISTINCT(Column e)
APPROXCOUNTDISTINCT(Column e, double rsd)

`AVG`¶

Description: Aggregate function: returns the average of the values in a group.

Signature:

AVG(Column e)

`COLLECT_LIST`¶

Description: Aggregate function: returns a list of objects with duplicates.

Signature:

COLLECT_LIST(Column e)

`COLLECT_SET`¶

Description: Aggregate function: returns a set of objects with duplicate elements eliminated.

Signature:

COLLECT_SET(Column e)

`CORR`¶

Description: Aggregate function: returns the Pearson Correlation Coefficient for two columns.

Signature:

CORR(Column column1, Column column2)

`COUNT`¶

Description: Aggregate function: returns the number of items in a group.

Signature:

COUNT(Column e)

`COVAR_POP`¶

Description: Aggregate function: returns the population covariance for two columns. Signature:

COVAR_POP(Column column1, Column column2)

`COVAR_SAMP`¶

Description: Aggregate function: returns the sample covariance for two columns.

Signature:

COVAR_SAMP(Column column1, Column column2)

`FIRST`¶

Description: Aggregate function: returns the first value in a group. By default, returns the first value it sees. If ignoreNulls is set to true, it will return the first non-null value it sees. If all values are null, it returns null.

Signature:

FIRST(Column e)FIRST(Column e, boolean ignoreNulls)

`GROUPING`¶

Description: Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not. Returns 1 for aggregated or 0 for not aggregated in the result set.

Signature:

GROUPING(Column e)

`KURTOSIS`¶

Description: Aggregate function: returns the kurtosis of the values in a group.

Signature:

KURTOSIS(Column e)

`LAST`¶

Description: Aggregate function: returns the last value in a group. By default, returns the last value it sees. If ignoreNulls is set to true, it will return the last non-null value it sees. If all values are null, it returns null.

Signature:

LAST(Column e)LAST(Column e, boolean ignoreNulls)

`MAX`¶

Description: Aggregate function: returns the maximum value of the expression in a group.

Signature:

MAX(Column e)

`MEAN`¶

Description: Aggregate function: returns the average of the values in a group.

Signature:

MEAN(Column e)

`MIN`¶

Description: Aggregate function: returns the minimum value of the expression in a group.

Signature:

MIN(Column e)

`SKEWNESS`¶

Description: Aggregate function: returns the skewness of the values in a group.

Signature:

SKEWNESS(Column e)

`STDDEV`¶

Description: Aggregate function: alias for stddev_samp.

Signature:

STDDEV(Column e)

`STDDEV_POP`¶

Description: Aggregate function: returns the population standard deviation of the expression in a group.

Signature:

STDDEV_POP(Column e)

`STDDEV_SAMP`¶

Description: Aggregate function: returns the sample standard deviation of the expression in a group.

Signature:

STDDEV_SAMP(Column e)

`SUM`¶

Description: Aggregate function: returns the sum of all values in the expression.

Signature:

SUM(Column e)

`VARIANCE`¶

Description: Aggregate function: returns the unbiased variance of the values in a group. Alias for the VAR_SAMP function.

Signature:

VARIANCE(Column e)

`VAR_POP`¶

Description: Aggregate function: returns the population variance of the values in a group.

Signature:

VAR_POP(Column e)

`VAR_SAMP`¶

Description: Aggregate function: returns the unbiased variance of the values in a group.

Signature:

VAR_SAMP(Column e)

String functions¶

Function	Description
ASCII	Computes the numeric value of the first character of the string column, and returns the result as an int column.
BASE64	Computes the BASE64 encoding of a binary column and returns it as a string column.
CONCAT	Concatenates multiple input string columns together into a single string column.
DECODE	Computes the first argument into a string from a binary using the provided character set.
ENCODE	Computes the first argument into a binary from a string using the provided character set.
FORMAT_NUMBER	Formats numeric column to a format like ‘#,###,###.##’ that’s rounded to `d` decimal places.
GET_JSON_OBJECT	Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
INSTR	Locate the position of the first occurrence of substring in the given string column.
JSON_TUPLE	Creates a new row for a json column according to the given field names.
LENGTH	Computes the length of a given string or binary column.
LEVENSHTEIN	Computes the Levenshtein distance of the two given string columns.
LOWER	Converts a string column to lower case.
LPAD	Left pad the string column with pad to a length of len.
LTRIM	Trim the spaces from left end for the specified string value.
REGEXP_EXTRACT	Extract a specific group matched by a Java regex, from the specified string column.
REGEXP_REPLACE	Replace all substrings of the specified string value that match regexp with rep.
REPEAT	Repeats a string column n times, and returns it as a new string column.
REVERSE	Reverses the string column and returns it as a new string column.
RPAD	Right pad the string column with pad to a length of len.
RTRIM	Trim the spaces from right end for the specified string value.
SOUNDEX	Return the soundex code for the specified expression.
SPLIT	Splits str around pattern (pattern is a regular expression).
SUBSTRING	Returns a substring for the string or binary type column.
SUBSTRING_INDEX	Returns the substring from the string before count occurrences of the delimiter.
TRANSLATE	Translate any character in the src by a character in replaceString.
TRIM	Trim the spaces from both ends for the specified string column.
UNBASE64	Decodes a BASE64 encoded string column and returns it as a binary column.
UNHEX	Inverse of hex.
UPPER	Converts a string column to upper case.

`ASCII`¶

Description: Computes the numeric value of the first character of the string column, and returns the result as an int column.

Signature:

ASCII(Column e)

`BASE64`¶

Description: Computes the BASE64 encoding of a binary column and returns it as a string column.

Signature:

BASE64(Column e)

`CONCAT`¶

Description: Concatenates multiple input string columns together into a single string column.

Signature:

CONCAT(Column... exprs)

`DECODE`¶

Description: Computes the first argument into a string from a binary using the provided charset, which is one of the following:

‘US-ASCII’
‘ISO-8859-1’
‘UTF-8’
‘UTF-16BE’
‘UTF-16LE’
‘UTF-16’

If either argument is null, the result will also be null.

Signature:

DECODE(Column value, String charset)

`ENCODE`¶

Description: Computes the first argument into a binary from a string using the provided charset, which is one of the following:

‘US-ASCII’
‘ISO-8859-1’
‘UTF-8’
‘UTF-16BE’
‘UTF-16LE’
‘UTF-16’

If either argument is null, the result will also be null.

Signature:

ENCODE(Column value, String charset)

`FORMAT_NUMBER`¶

Description: Formats numeric column x to a format like ‘#,###,###.##’ that’s rounded to d decimal places with HALF_EVEN round mode (also known as Gaussian rounding or bankers’ rounding). Returns the result as a string column. If d is 0, the result has no decimal point or fractional part. If d is less than 0, the result will be null.

Signature:

FORMAT_NUMBER(Column x, int d)

`GET_JSON_OBJECT`¶

Description: Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. Returns null if the input json string is invalid.

Signature:

GET_JSON_OBJECT(Column e, String path)

`INSTR`¶

Description: Locate the position of the first occurrence of substring in the given string column. Returns null if either of the arguments are null, and returns 0 if substring could not be found in str. The resulting position is 1 based index, not zero based.

Signature:

INSTR(Column str, String substring)

`JSON_TUPLE`¶

Description: Creates a new row for a json column according to the given field names.

Signature:

JSON_TUPLE(Column json, scala.collection.Seq<String> fields)
JSON_TUPLE(Column json, String... fields)

`LENGTH`¶

Description: Computes the length of a given string or binary column.

Signature:

LENGTH(Column e)

`LEVENSHTEIN`¶

Description: Computes the Levenshtein distance of the two given string columns.

Signature:

LEVENSHTEIN(Column l, Column r)

`LOWER`¶

Description: Converts a string column to lower case.

Signature:

LOWER(Column e)

`LPAD`¶

Description: Left pad the string column with pad to a length of len.

Signature:

LPAD(Column str, int len, String pad)

`LTRIM`¶

Description: Trim the spaces from left end for the specified string value.

Signature:

LTRIM(Column e)

`REGEXP_EXTRACT`¶

Description: Extract a specific group (groupIdx) matched by a Java regex (exp), from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned.

Signature:

REGEXP_EXTRACT(Column e, String exp, int groupIdx)

`REGEXP_REPLACE`¶

Description: Replace all substrings of the specified string value that match the Java regex pattern with replacement.

Signature:

REGEXP_REPLACE(Column e, String pattern, String replacement)

`REPEAT`¶

Description: Repeats a string column n times, and returns it as a new string column.

Signature:

REPEAT(Column str, int n)

`REVERSE`¶

Description: Reverses the string column and returns it as a new string column.

Signature:

REVERSE(Column str)

`RPAD`¶

Description: Right pad the string column with pad to a length of len.

Signature:

RPAD(Column str, int len, String pad)

`RTRIM`¶

Description: Trim the spaces from right end for the specified string value.

Signature:

RTRIM(Column e)

`SOUNDEX`¶

Description: Return the soundex code for the specified expression.

Signature:

SOUNDEX(Column e)

`SPLIT`¶

Description: Splits str around pattern where pattern is a regular expression.

Signature:

SPLIT(Column str, String pattern)

`SUBSTRING`¶

Description: Returns the substring starting at pos with a length of len when str is String type. Returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.

Signature:

SUBSTRING(Column str, int pos, int len)

`SUBSTRING_INDEX`¶

Description: Returns the substring from string str before count occurrences of the delimiter delim. Performs a case-sensitive match when searching for delim. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned.

Signature:

SUBSTRING_INDEX(Column str, String delim, int count)

`TRANSLATE`¶

Description: Translate any character in the src by a character in replaceString. The characters in replaceString correspond to the characters in matchingString. The translate will happen when any character in the string matches the character in the matchingString.

Signature:

TRANSLATE(Column src, String matchingString, String replaceString)

`TRIM`¶

Description: Trim the spaces from both ends for the specified string column.

Signature:

TRIM(Column e)

`UNBASE64`¶

Description: Decodes a BASE64 encoded string column and returns it as a binary column.

Signature:

UNBASE64(Column e)

`UNHEX`¶

Description: Inverse of hex. Interprets each pair of characters as a hexadecimal number and converts to the byte representation of the number.

Signature:

UNHEX(Column column)

`UPPER`¶

Description: Converts a string column to upper case.

Signature:

UPPER(Column e)

Datetime functions¶

Function	Description
ADD_MONTHS	Returns the date that is numMonths after startDate.
DATEDIFF	Returns the number of days from start to end.
DATE_ADD	Returns the date that is days days after start.
DATE_FORMAT	Converts a date/timestamp/string to a value of string in the format specified by the date format.
DATE_SUB	Returns the date that is days days before start.
DAYOFMONTH	Extracts the day of the month as an integer from a given date/timestamp/string.
DAYOFYEAR	Extracts the day of the year as an integer from a given date/timestamp/string.
FROM_UNIXTIME	Converts the number of seconds from unix epoch to a string representing the timestamp of that moment in the current system time zone in the given format.
FROM_UTC_TIMESTAMP	Assumes given timestamp is UTC and converts to given timezone.
HOUR	Extracts the hours as an integer from a given date/timestamp/string.
LAST_DAY	Given a date column, returns the last day of the month which the given date belongs to.
MINUTE	Extracts the minutes as an integer from a given date/timestamp/string.
MONTH	Extracts the month as an integer from a given date/timestamp/string.
MONTHS_BETWEEN	Returns number of months between dates date1 and date2.
NEXT_DAY	Given a date column, returns the first date which is later than the value of the date column that is on the specified day of the week.
QUARTER	Extracts the quarter as an integer from a given date/timestamp/string.
SECOND	Extracts the seconds as an integer from a given date/timestamp/string.
TO_DATE	Converts the column into type date.
TO_UTC_TIMESTAMP	Assumes given timestamp is in given timezone and converts to UTC.
TRUNC	Returns date truncated to the unit specified by the format.
UNIX_TIMESTAMP	Converts time string in format yyyy-MM-dd HH\:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
WEEKOFYEAR	Extracts the week number as an integer from a given date/timestamp/string.
WINDOW	Generates tumbling or sliding time windows given a timestamp specifying column.
YEAR	Extracts the year as an integer from a given date/timestamp/string.

`ADD_MONTHS`¶

Description: Returns the date that is numMonths after the date/timestamp/string specified by startDate. If startDate is a string, it must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

If startDate has a time component, it’s ignored. The resulting month has the same day component as that of startDate. If startDate is the last day of the month or if the resulting month has fewer days than the day component of startDate, the last day of the resulting month is returned.

Signature:

ADD_MONTHS(Column startDate, int numMonths)

`DATEDIFF`¶

Description: Returns the number of days from start to end.

Signature:

DATEDIFF(Column end, Column start)

`DATE_ADD`¶

Description: Returns the date that is days days after start.

Signature:

DATE_ADD(Column start, int days)

`DATE_FORMAT`¶

Description: Converts a date/timestamp/string to a value of string in the format specified by the date format. If dateExpr is a string, it must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

For the pattern string format, pattern letters of SimpleDateFormat can be used. Refer to the Java SimpleDateFormat docs ↗ for more information.

Signature:

DATE_FORMAT(Column dateExpr, String format)

`DATE_SUB`¶

Description: Returns the date that is days days before start.

Signature:

DATE_SUB(Column start, int days)

`DAYOFMONTH`¶

Description: Extracts the day of the month as an integer from a given date/timestamp/string. Strings must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

Signature:

DAYOFMONTH(Column e)

`DAYOFYEAR`¶

Description: Extracts the day of the year as an integer from a given date/timestamp/string. Strings must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

Signature:

DAYOFYEAR(Column e)

`FROM_UNIXTIME`¶

Description: Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format. If no pattern string f is provided, the default format for the resulting string is yyyy-MM-dd HH\:mm:ss. To change the format of the resulting string, provide a pattern string f. Pattern letters of SimpleDateFormat can be used for the pattern string. Refer to the Java SimpleDateFormat docs ↗ for more information.

Signature:

FROM_UNIXTIME(Column ut)FROM_UNIXTIME(Column ut, String f)

`FROM_UTC_TIMESTAMP`¶

Description: Assumes given timestamp is UTC and converts it to given the timezone specified by tz. Given a timestamp (which corresponds to a certain time of day in UTC), returns another timestamp that corresponds to the same time of day in the given timezone specified by tz.

Signature:

FROM_UTC_TIMESTAMP(Column ts, String tz)

`HOUR`¶

Description: Extracts the hours as an integer from a given date/timestamp/string. Strings must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

Signature:

HOUR(Column e)

`LAST_DAY`¶

Description: Given a date column, returns the last day of the month which the given date belongs to.

Signature:

LAST_DAY(Column e)

`MINUTE`¶

Description: Extracts the minutes as an integer from a given date/timestamp/string. Strings must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

Signature:

MINUTE(Column e)

`MONTH`¶

Description: Extracts the month as an integer from a given date/timestamp/string. Strings must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

Signature:

MONTH(Column e)

`MONTHS_BETWEEN`¶

Description: Returns number of months between dates date1 and date2. If date1 is later than date2, the result is positive. If date1 is earlier than date2, the result is negative. If the two dates are the saem days of the months or both the last days of months, the result is an integer. Otherwise, the result is rounded to 8 decimal places.

Signature:

MONTHS_BETWEEN(Column date1, Column date2)

`NEXT_DAY`¶

Description: Given a date column, returns the first date which is later than the value of the date column that is on the specified dayOfWeek. The dayOfWeek parameter is case insensitive and can be one of the following:

‘Mon’
‘Tue’
‘Wed’
‘Thu’
‘Fri’
‘Sat’
‘Sun’

Signature:

NEXT_DAY(Column date, String dayOfWeek)

`QUARTER`¶

Description: Extracts the quarter as an integer from a given date/timestamp/string. Strings must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’. The resulting quarter is in the range 1 to 4.

Signature:

QUARTER(Column e)

`SECOND`¶

Description: Extracts the seconds as an integer from a given date/timestamp/string. Strings must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

Signature:

SECOND(Column e)

`TO_DATE`¶

Description: Converts the column into type date.

Signature:

TO_DATE(Column e)

`TO_UTC_TIMESTAMP`¶

Description: Assumes the provided timestamp (ts) is in the provided timezone (tz) and converts to UTC.

Signature:

TO_UTC_TIMESTAMP(Column ts, String tz)

`TRUNC`¶

Description: Returns date truncated to the unit specified by the format. The format parameter is ‘year’, ‘yyyy’, ‘yy’ for truncating by year, or ‘month’, ‘mon’, ‘mm’ for truncating by month.

Signature:

TRUNC(Column date, String format)

`UNIX_TIMESTAMP`¶

Description: Converts time string to Unix timestamp (in seconds) using the default timezone and the default locale. Returns null if it fails to convert the string to a timestamp.

If no pattern string is provided, the default format for the input column s is yyyy-MM-dd HH\:mm:ss. If the input column is in a different format, provide a pattern string p. Pattern letters of SimpleDateFormat can be used for the pattern string. Refer to the Java SimpleDateFormat docs ↗ for more information.

Signature:

UNIX_TIMESTAMP(Column s)UNIX_TIMESTAMP(Column s, String p)

`WEEKOFYEAR`¶

Description: Extracts the week number as an integer from a given date/timestamp/string. Strings must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

Signature:

WEEKOFYEAR(Column e)

`WINDOW`¶

Description: Generates tumbling or sliding time windows given a timestamp specifying column. If slideDurationis not provided, generates tumbling time windows. If slideDuration is provided, generate sliding time windows by bucketizing rows into one or more time windows.

Window starts are inclusive and window ends are exclusive, so 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported.

The windows start beginning at 1970-01-01 00:00:00 UTC, unless startTime is provided. This function accepts the following parameters:

timeColumn: The column to use as the timestamp for windowing by time. The time column must be of TimestampType.
windowDuration: A string specifying the width of the window, e.g. 10 minutes, 1 second. Note that the duration is a fixed length of time, and does not vary over time according to a calendar.
slideDuration: A string specifying the sliding interval of the window, e.g. 1 minute. A new window will be generated every slideDuration. Must be less than or equal to the windowDuration. This duration is likewise absolute, and does not vary according to a calendar.
startTime: The offset with respect to 1970-01-01 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour (e.g. 12:15-13:15, 13:15-14:15…) provide startTime as 15 minutes.

Signature:

WINDOW(Column timeColumn, String windowDuration)WINDOW(Column timeColumn, String windowDuration, String slideDuration)WINDOW(Column timeColumn, String windowDuration, String slideDuration, String startTime)

`YEAR`¶

Description: Extracts the year as an integer from a given date/timestamp/string. Strings must be in the format ‘yyyy-MM-dd’ or ‘yyyy-MM-dd HH\:mm:ss’.

Signature:

YEAR(Column e)

Math functions¶

Function	Description
ABS	Computes the absolute value.
ACOS	Computes the cosine inverse of the given value.
ASIN	Computes the sine inverse of the given value.
ATAN	Computes the tangent inverse of the given value.
ATAN2	Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).
BIN	Returns the string representation of the binary value of the given long column.
BROUND	Returns the rounded value of the column e.
CBRT	Computes the cube-root of the given value.
CEIL	Computes the ceiling of the given value.
CONV	Convert a number in a string column from one base to another.
COS	Computes the cosine of the given value.
COSH	Computes the hyperbolic cosine of the given value.
EXP	Computes the exponential of the given value.
EXPM1	Computes the exponential of the given value minus one.
FACTORIAL	Computes the factorial of the given value.
FLOOR	Computes the floor of the given value.
HYPOT	Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow.
LOG	Computes the natural logarithm of the given value.
LOG10	Computes the logarithm of the given value in base 10.
LOG1P	Computes the natural logarithm of the given value plus one.
LOG2	Computes the logarithm of the given column in base 2.
NEGATIVE	Unary minus.
PMOD	Returns the positive value of dividend mod divisor.
POW	Returns the value of the first argument raised to the power of the second argument.
RINT	Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
ROUND	Returns the value of the column e rounded to 0 decimal places.
SHIFTLEFT	Shift the given value numBits left.
SHIFTRIGHT	Shift the given value numBits right.
SHIFTRIGHTUNSIGNED	Unsigned shift the given value numBits right.
SIGNUM	Computes the signum of the given value.
SIN	Computes the sine of the given value.
SINH	Computes the hyperbolic sine of the given value.
SQRT	Computes the square root of the specified float value.
TAN	Computes the tangent of the given value.
TANH	Computes the hyperbolic tangent of the given value.
TODEGREES	Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
TORADIANS	Converts an angle measured in degrees to an approximately equivalent angle measured in radians.

`ABS`¶

Description: Computes the absolute value.

Signature:

ABS(Column e)

`ACOS`¶

Description: Computes the cosine inverse of the given value. The returned angle is in the range 0.0 through pi.

Signature:

ACOS(Column e)

`ASIN`¶

Description: Computes the sine inverse of the given value. The returned angle is in the range -pi/2 through pi/2.

Signature:

ASIN(Column e)

`ATAN`¶

Description: Computes the tangent inverse of the given value.

Signature:

ATAN(Column e)

`ATAN2`¶

Description: Returns the angle theta from the conversion of rectangular coordinates (x, y) to polar coordinates (r, theta).

Signature:

ATAN2(Column l, Column r)ATAN2(Column l, double r)ATAN2(Column l, String rightName)

`BIN`¶

Description: Returns the string representation of the binary value of the given long column.

Signature:

BIN(Column e)

`BROUND`¶

Description: When just the column e is provided, returns the value of the column e rounded to 0 decimal places with HALF_EVEN round mode. This is also known as Gaussian rounding or bankers’ rounding.

When scale is also provided, returns the value of the column e rounded to scale decimal places (with HALF_EVEN round mode if scale >= 0 or at integral part when scale < 0).

Signature:

BROUND(Column e)BROUND(Column e, int scale)

`CBRT`¶

Description: Computes the cube-root of the given value.

Signature:

CBRT(Column e)

`CEIL`¶

Description: Computes the ceiling of the given value.

Signature:

CEIL(Column e)

`CONV`¶

Description: Convert a number in a string column from one base to another.

Signature:

CONV(Column num, int fromBase, int toBase)

`COS`¶

Description: Computes the cosine of the given value.

Signature:

COS(Column e)

`COSH`¶

Description: Computes the hyperbolic cosine of the given value.

Signature:

COSH(Column e)

`EXP`¶

Description: Computes the exponential of the given value.

Signature:

EXP(Column e)

`EXPM1`¶

Description: Computes the exponential of the given value minus one.

Signature:

EXPM1(Column e)

`FACTORIAL`¶

Description: Computes the factorial of the given value.

Signature:

FACTORIAL(Column e)

`FLOOR`¶

Description: Computes the floor of the given value.

Signature:

FLOOR(Column e)

`HYPOT`¶

Description: Computes sqrt(a^2^ + b^2^) without intermediate overflow or underflow.

Signature:

HYPOT(Column l, Column r)HYPOT(Column l, double r)HYPOT(Column l, String rightName)

`LOG`¶

Description: Computes the natural logarithm of the given value.

Signature:

LOG(Column e)

`LOG10`¶

Description: Computes the logarithm of the given value in base 10.

Signature:

LOG10(Column e)

`LOG1P`¶

Description: Computes the natural logarithm of the given value plus one.

Signature:

LOG1P(Column e)

`LOG2`¶

Description: Computes the logarithm of the given column in base 2.

Signature:

LOG2(Column expr)

`NEGATIVE`¶

Description: Unary minus (negate the expression).

Signature:

NEGATIVE(Column e)

`PMOD`¶

Description: Returns the positive value of dividend mod divisor.

Signature:

PMOD(Column dividend, Column divisor)

`POW`¶

Description: Returns the value of the first argument raised to the power of the second argument.

Signature:

POW(Column l, Column r)POW(Column l, double r)POW(Column l, String rightName)

`RINT`¶

Description: Returns the double value that is closest in value to the argument and is equal to a mathematical integer.

Signature:

RINT(Column e)

`ROUND`¶

Description: Returns the value of the column e rounded to 0 decimal places.

Signature:

ROUND(Column e)ROUND(Column e, int scale)

`SHIFTLEFT`¶

Description: Shift the given value numBits left.

Signature:

SHIFTLEFT(Column e, int numBits)

`SHIFTRIGHT`¶

Description: Shift the given value numBits right.

Signature:

SHIFTRIGHT(Column e, int numBits)

`SHIFTRIGHTUNSIGNED`¶

Description: Unsigned shift the given value numBits right.

Signature:

SHIFTRIGHTUNSIGNED(Column e, int numBits)

`SIGNUM`¶

Description: Computes the signum of the given value.

Signature:

SIGNUM(Column e)

`SIN`¶

Description: Computes the sine of the given value.

Signature:

SIN(Column e)

`SINH`¶

Description: Computes the hyperbolic sine of the given value.

Signature:

SINH(Column e)

`SQRT`¶

Description: Computes the square root of the specified float value.

Signature:

SQRT(Column e)

`TAN`¶

Description: Computes the tangent of the given value.

Signature:

TAN(Column e)

`TANH`¶

Description: Computes the hyperbolic tangent of the given value.

Signature:

TANH(Column e)

`TODEGREES`¶

Description: Converts an angle measured in radians to an approximately equivalent angle measured in degrees.

Signature:

TODEGREES(Column e)

`TORADIANS`¶

Description: Converts an angle measured in degrees to an approximately equivalent angle measured in radians.

Signature:

TORADIANS(Column e)

Null functions¶

Function	Description
COALESCE	Returns the first column that is not null, or null if all inputs are null.
ISNULL	Return true iff the column is null.
NULLIF	Returns null if both arguments are equal, otherwise returns the first argument.

`COALESCE`¶

Description: Returns the first column that is not null, or null if all inputs are null.

Signature:

COALESCE(Column... e)

`ISNULL`¶

Description: Return true iff the column is null.

Signature:

ISNULL(Column e)

`NULLIF`¶

Description: Returns null if both arguments are equal, otherwise returns the first argument.

Signature:

NULLIF(Column l, Column r)

Collection functions¶

Function	Description
ARRAY	Creates a new array column.
ARRAY_CONTAINS	Returns true if the array contains value.
EXPLODE	Creates a new row for each element in the given array or map column.
MAP	Creates a new map column.
POSEXPLODE	Creates a new row for each element with position in the given array or map column.
SIZE	Returns length of array or map.
SORT_ARRAY	Sorts the input array for the given column in ascending/descending order.
STRUCT	Creates a new struct column.

`ARRAY`¶

Description: Creates a new array column.

Signature:

ARRAY(Column... cols)

`ARRAY_CONTAINS`¶

Description: Returns true if the array contains value.

Signature:

ARRAY_CONTAINS(Column column, Object value)

`EXPLODE`¶

Description: Creates a new row for each element in the given array or map column.

Signature:

EXPLODE(Column e)

`MAP`¶

Description: Creates a new map column.

Signature:

MAP(Column... cols)

`POSEXPLODE`¶

Description: Creates a new row for each element with position in the given array or map column.

Signature:

POSEXPLODE(Column e)

`SIZE`¶

Description: Returns length of array or map.

Signature:

SIZE(Column e)

`SORT_ARRAY`¶

Description: Sorts the input array for the given column in ascending/descending order, according to the natural ordering of the array elements. The sorting defaults to ascending order, unless asc is set to false.

Signature:

SORT_ARRAY(Column e)SORT_ARRAY(Column e, boolean asc)

`STRUCT`¶

Description: Creates a new struct column.

Signature:

STRUCT(Column... cols)

Window functions¶

Function	Description
LAG	Returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row.
LEAD	Returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.

`LAG`¶

Description: Window function: returns the value that is offset rows before the current row, and null if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.

Signature:

LAG(Column e, int offset)LAG(Column e, int offset, Object defaultValue)

`LEAD`¶

Description: Window function: returns the value that is offset rows after the current row, and null if there is less than offset rows after the current row.

Signature:

LEAD(Column e, int offset)LEAD(Column e, int offset, Object defaultValue)

Other functions¶

Function	Description
CRC32	Calculates the cyclic redundancy check value of a binary column and returns the value as a bigint.
GREATEST	Returns the greatest value of the list of values, skipping null values.
HASH	Calculates the hash code of given columns, and returns the result as an int column.
HEX	Computes hex value of the given column.
ISNAN	Return true iff the column is NaN.
LEAST	Returns the least value of the list of values, skipping null values.
MD5	Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.
NANVL	Returns col1 if it is not NaN, or col2 if col1 is NaN.
NOT	Inversion of boolean expression.
SHA1	Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.
SHA2	Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.
WHEN	Evaluates a list of conditions and returns one of multiple possible result expressions.

`CRC32`¶

Description: Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.

Signature:

CRC32(Column e)

`GREATEST`¶

Description: Returns the greatest value of the list of values, skipping null values. Compares values using the “>” operator. This function takes at least 2 parameters. It will return null iff all parameters are null.

Signature:

GREATEST(Column... exprs)

`HASH`¶

Description: Calculates the hash code of given columns, and returns the result as an int column.

Signature:

HASH(Column... cols)

`HEX`¶

Description: Computes hex value of the given column. If column is of type int or binary, returns the number is a string in hexadecimal format. If column is of type string, converts each character into its hexadecimal representation and returns the resulting string.

Signature:

HEX(Column column)

`ISNAN`¶

Description: Return true iff the column is NaN.

Signature:

ISNAN(Column e)

`LEAST`¶

Description: Returns the least value of the list of values, skipping null values. Compares values using the “<” operator. This function takes at least 2 parameters. It will return null iff all parameters are null.

Signature:

LEAST(Column... exprs)

`MD5`¶

Description: Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.

Signature:

MD5(Column e)

`NANVL`¶

Description: Returns col1 if it is not NaN, or col2 if col1 is NaN.

Signature:

NANVL(Column col1, Column col2)

`NOT`¶

Description: Inversion of Boolean expression (that is, NOT).

Signature:

NOT(Column e)

`SHA1`¶

Description: Calculates the SHA-1 digest of a binary column and returns the value as a 40 character hex string.

Signature:

SHA1(Column e)

`SHA2`¶

Description: Calculates the SHA-2 family of hash functions of a binary column and returns the value as a hex string.

Signature:

SHA2(Column e, int numBits)

`WHEN`¶

Description: Evaluates a list of conditions and returns one of multiple possible result expressions.

Signature:

WHEN(Column condition, Object value)

中文翻译¶

Spark SQL 参考指南¶

本节涵盖编写 Spark SQL 数据转换与其他类型 SQL 查询之间的一些关键差异。它还包含可用 Spark SQL 函数的列表。

:::callout{theme="success" title="提示"} 在查找资源时，请记住 Spark SQL 实际上基于 HiveQL 方言。您可以在线找到更多关于 HiveQL 的信息。 :::

入门指南¶

基本查询格式¶

每个 SQL 数据转换查询必须创建一个表。SQL 查询的一般格式为：

CREATE TABLE _____ AS SELECT _____

不要在语句末尾包含分号。 包含分号将导致错误。

注释语法¶

您可以在 SQL 代码中包含注释，如下所示：

-- 您可以使用此语法创建注释

/* 您也可以使用此语法创建注释 */

引用数据集¶

要引用数据集，请提供用反引号括起来的数据集路径：

CREATE TABLE `/path/to/target/dataset` AS
SELECT * FROM `/path/to/source/dataset`

-- 替代语法

CREATE TABLE `/path/to/target/dataset` AS (
    SELECT * FROM `/path/to/source/dataset`
)

请注意，数据集名称区分大小写。

引用列¶

要引用数据集中的特定列，请提供列名：

SELECT Name FROM `/path/to/source/dataset`

请注意，列名区分大小写。

派生列与别名¶

派生列是对数据集列调用函数的结果。您必须为任何派生列定义别名：

SELECT Sum(Val) AS Total FROM `/path/to/source/dataset`

以下查询将导致错误：

SELECT Sum(Val) FROM `/path/to/source/dataset`

请注意，别名名称区分大小写。

SQL 子句中的派生列¶

您不能在以下 SQL 子句中使用别名：WHERE 和 GROUP BY。因此，您必须在 WHERE 或 GROUP BY 子句中引用实际的函数和数据集列：

SELECT Lower(Name) AS LowercaseName
FROM `/path/to/source/dataset` WHERE Lower(Name) = "sara"

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name)

您可以在以下 SQL 子句中使用别名：ORDER BY 和 HAVING。因此，以下任何查询都可以正常工作：

-- 以下两个查询均有效

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name) ORDER BY Total

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name) ORDER BY Sum(Val)

-- 以下两个查询均有效

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name) HAVING Total > 100

SELECT Lower(Name) AS LowercaseName, Sum(Val) AS Total
FROM `/path/to/source/dataset` GROUP BY Lower(Name) HAVING Sum(Val) > 100

类型转换¶

您可以转换表达式以将其从一种类型转换为另一种类型。转换的语法为：

CAST(expr AS <TYPE>)

请注意，expr 表示您要转换的表达式，<TYPE> 表示您要将表达式转换成的类型。如果 CAST(expr AS <TYPE>) 不成功，它将返回 null。<TYPE> 的可用值为：

boolean
tinyint
smallint
int
bigint
float
double
decimal
date
timestamp
binary
string

格式化日期¶

您可能希望重新格式化日期值或将字符串转换为日期格式。date 的格式为 yyyy-MM-dd，并且没有时间组件。您可以使用 CAST 函数以及可用的日期时间函数将字符串转换为日期。以下是一些快速示例：

-- 将格式为 'yyyy-MM-dd' 的字符串转换为日期
CAST('2016-07-30' AS DATE)

-- 将格式为 'yyyy-MM-dd' 的字符串转换为时间戳（时间戳将基于所提供日期的午夜）
CAST('2016-07-30' AS TIMESTAMP)

-- 将格式为 'ddMMyyyy' 的日期字符串转换为日期
TO_DATE(CAST(UNIX_TIMESTAMP('07302016', 'MMddyyyy') AS TIMESTAMP))

-- 将格式为 'ddMMyyyy HH:mm:ss' 的时间戳字符串转换为仅日期
CAST('2016-07-30 11:29:27' AS DATE)

-- 从 ISO 8601 时间戳中提取仅日期
TO_DATE('2016-07-30T11:29:27.000+00:00')

-- 从格式为 'yyyy-MM-dd HH:mm:ss' 的时间戳中提取仅日期
TO_DATE('2016-07-30 11:29:27')

请注意，如果字符串的格式不是日期/时间戳格式，则不能将其转换为日期。因此，像 CAST('20160730' AS DATE) 这样的操作将返回 null。

聚合函数¶

函数	描述
APPROX_COUNT_DISTINCT	返回组中不同项的近似数量。
AVG	返回组中值的平均值。
COLLECT_LIST	返回包含重复项的对象列表。
COLLECT_SET	返回消除重复元素后的对象集合。
CORR	返回两列的皮尔逊相关系数。
COUNT	返回组中的项数。
COVAR_POP	返回两列的总体协方差。
COVAR_SAMP	返回两列的样本协方差。
FIRST	返回组中的第一个值。
GROUPING	指示 GROUP BY 列表中的指定列是否被聚合。
KURTOSIS	返回组中值的峰度。
LAST	返回组中的最后一个值。
MAX	返回组中表达式的最大值。
MEAN	返回组中值的平均值。
MIN	返回组中表达式的最小值。
SKEWNESS	返回组中值的偏度。
STDDEV	stddev_samp 的别名。
STDDEV_POP	返回组中表达式的总体标准差。
STDDEV_SAMP	返回组中表达式的样本标准差。
SUM	返回表达式中所有值的总和。
VARIANCE	返回组中值的无偏方差。
VAR_POP	返回组中值的总体方差。
VAR_SAMP	返回组中值的无偏方差。

`APPROX_COUNT_DISTINCT`¶

描述： 聚合函数：返回组中不同项的近似数量。除非提供了 rsd，否则最大估计误差默认为 0.05。

签名：

APPROXCOUNTDISTINCT(Column e)
APPROXCOUNTDISTINCT(Column e, double rsd)

`AVG`¶

描述： 聚合函数：返回组中值的平均值。

签名：

AVG(Column e)

`COLLECT_LIST`¶

描述： 聚合函数：返回包含重复项的对象列表。

签名：

COLLECT_LIST(Column e)

`COLLECT_SET`¶

描述： 聚合函数：返回消除重复元素后的对象集合。

签名：

COLLECT_SET(Column e)

`CORR`¶

描述： 聚合函数：返回两列的皮尔逊相关系数。

签名：

CORR(Column column1, Column column2)

`COUNT`¶

描述： 聚合函数：返回组中的项数。

签名：

COUNT(Column e)

`COVAR_POP`¶

描述： 聚合函数：返回两列的总体协方差。

签名：

COVAR_POP(Column column1, Column column2)

`COVAR_SAMP`¶

描述： 聚合函数：返回两列的样本协方差。

签名：

COVAR_SAMP(Column column1, Column column2)

`FIRST`¶

描述： 聚合函数：返回组中的第一个值。默认情况下，返回它看到的第一个值。如果 ignoreNulls 设置为 true，它将返回它看到的第一个非空值。如果所有值都为空，则返回 null。

签名：

FIRST(Column e)FIRST(Column e, boolean ignoreNulls)

`GROUPING`¶

描述： 聚合函数：指示 GROUP BY 列表中的指定列是否被聚合。在结果集中，聚合返回 1，未聚合返回 0。

签名：

GROUPING(Column e)

`KURTOSIS`¶

描述： 聚合函数：返回组中值的峰度。

签名：

KURTOSIS(Column e)

`LAST`¶

描述： 聚合函数：返回组中的最后一个值。默认情况下，返回它看到的最后一个值。如果 ignoreNulls 设置为 true，它将返回它看到的最后一个非空值。如果所有值都为空，则返回 null。

签名：

LAST(Column e)LAST(Column e, boolean ignoreNulls)

`MAX`¶

描述： 聚合函数：返回组中表达式的最大值。

签名：

MAX(Column e)

`MEAN`¶

描述： 聚合函数：返回组中值的平均值。

签名：

MEAN(Column e)

`MIN`¶

描述： 聚合函数：返回组中表达式的最小值。

签名：

MIN(Column e)

`SKEWNESS`¶

描述： 聚合函数：返回组中值的偏度。

签名：

SKEWNESS(Column e)

`STDDEV`¶

描述： 聚合函数：stddev_samp 的别名。

签名：

STDDEV(Column e)

`STDDEV_POP`¶

描述： 聚合函数：返回组中表达式的总体标准差。

签名：

STDDEV_POP(Column e)

`STDDEV_SAMP`¶

描述： 聚合函数：返回组中表达式的样本标准差。

签名：

STDDEV_SAMP(Column e)

`SUM`¶

描述： 聚合函数：返回表达式中所有值的总和。

签名：

SUM(Column e)

`VARIANCE`¶

描述： 聚合函数：返回组中值的无偏方差。VAR_SAMP 函数的别名。

签名：

VARIANCE(Column e)

`VAR_POP`¶

描述： 聚合函数：返回组中值的总体方差。

签名：

VAR_POP(Column e)

`VAR_SAMP`¶

描述： 聚合函数：返回组中值的无偏方差。

签名：

VAR_SAMP(Column e)

字符串函数¶

函数	描述
ASCII	计算字符串列第一个字符的数值，并将结果作为 int 列返回。
BASE64	计算二进制列的 BASE64 编码，并将其作为字符串列返回。
CONCAT	将多个输入字符串列连接成一个字符串列。
DECODE	使用提供的字符集将第一个参数从二进制计算为字符串。
ENCODE	使用提供的字符集将第一个参数从字符串计算为二进制。
FORMAT_NUMBER	将数字列格式化为类似 '#,###,###.##' 的格式，并四舍五入到 `d` 位小数。
GET_JSON_OBJECT	根据指定的 json 路径从 json 字符串中提取 json 对象，并返回提取的 json 对象的 json 字符串。
INSTR	定位给定字符串列中子字符串第一次出现的位置。
JSON_TUPLE	根据给定的字段名称为 json 列创建新行。
LENGTH	计算给定字符串或二进制列的长度。
LEVENSHTEIN	计算两个给定字符串列的莱文斯坦距离。
LOWER	将字符串列转换为小写。
LPAD	使用 pad 对字符串列进行左填充，使其达到 len 长度。
LTRIM	去除指定字符串值左侧的空格。
REGEXP_EXTRACT	从指定的字符串列中提取由 Java 正则表达式匹配的特定组。
REGEXP_REPLACE	将指定字符串值中与正则表达式匹配的所有子字符串替换为 rep。
REPEAT	将字符串列重复 n 次，并将其作为新的字符串列返回。
REVERSE	反转字符串列并将其作为新的字符串列返回。
RPAD	使用 pad 对字符串列进行右填充，使其达到 len 长度。
RTRIM	去除指定字符串值右侧的空格。
SOUNDEX	返回指定表达式的 soundex 代码。
SPLIT	根据模式（模式是正则表达式）拆分字符串。
SUBSTRING	返回字符串或二进制类型列的子字符串。
SUBSTRING_INDEX	返回字符串中在分隔符出现 count 次之前的子字符串。
TRANSLATE	将 src 中的任何字符替换为 replaceString 中的字符。
TRIM	去除指定字符串列两端的空格。
UNBASE64	解码 BASE64 编码的字符串列，并将其作为二进制列返回。
UNHEX	hex 的反函数。
UPPER	将字符串列转换为大写。

`ASCII`¶

描述： 计算字符串列第一个字符的数值，并将结果作为 int 列返回。

签名：

ASCII(Column e)

`BASE64`¶

描述： 计算二进制列的 BASE64 编码，并将其作为字符串列返回。

签名：

BASE64(Column e)

`CONCAT`¶

描述： 将多个输入字符串列连接成一个字符串列。

签名：

CONCAT(Column... exprs)

`DECODE`¶

描述： 使用提供的 charset 将第一个参数从二进制计算为字符串，charset 可以是以下之一：

'US-ASCII'
'ISO-8859-1'
'UTF-8'
'UTF-16BE'
'UTF-16LE'
'UTF-16'

如果任一参数为 null，结果也将为 null。

签名：

DECODE(Column value, String charset)

`ENCODE`¶

描述： 使用提供的 charset 将第一个参数从字符串计算为二进制，charset 可以是以下之一：

'US-ASCII'
'ISO-8859-1'
'UTF-8'
'UTF-16BE'
'UTF-16LE'
'UTF-16'

如果任一参数为 null，结果也将为 null。

签名：

ENCODE(Column value, String charset)

`FORMAT_NUMBER`¶

描述： 将数字列 x 格式化为类似 '#,###,###.##' 的格式，并使用 HALF_EVEN 舍入模式（也称为高斯舍入或银行家舍入）四舍五入到 d 位小数。将结果作为字符串列返回。如果 d 为 0，则结果没有小数点或小数部分。如果 d 小于 0，则结果将为 null。

签名：

FORMAT_NUMBER(Column x, int d)

`GET_JSON_OBJECT`¶

描述： 根据指定的 json path 从 json 字符串中提取 json 对象，并返回提取的 json 对象的 json 字符串。如果输入的 json 字符串无效，则返回 null。

签名：

GET_JSON_OBJECT(Column e, String path)

`INSTR`¶

描述： 定位给定字符串列中 substring 第一次出现的位置。如果任一参数为 null，则返回 null；如果在 str 中找不到 substring，则返回 0。结果位置是基于 1 的索引，而不是基于 0 的索引。

签名：

INSTR(Column str, String substring)

`JSON_TUPLE`¶

描述： 根据给定的字段名称为 json 列创建新行。

签名：

JSON_TUPLE(Column json, scala.collection.Seq<String> fields)
JSON_TUPLE(Column json, String... fields)

`LENGTH`¶

描述： 计算给定字符串或二进制列的长度。

签名：

LENGTH(Column e)

`LEVENSHTEIN`¶

描述： 计算两个给定字符串列的莱文斯坦距离。

签名：

LEVENSHTEIN(Column l, Column r)

`LOWER`¶

描述： 将字符串列转换为小写。

签名：

LOWER(Column e)

`LPAD`¶

描述： 使用 pad 对字符串列进行左填充，使其达到 len 长度。

签名：

LPAD(Column str, int len, String pad)

`LTRIM`¶

描述： 去除指定字符串值左侧的空格。

签名：

LTRIM(Column e)

`REGEXP_EXTRACT`¶

描述： 从指定的字符串列中提取由 Java 正则表达式 (exp) 匹配的特定组 (groupIdx)。如果正则表达式不匹配，或指定的组不匹配，则返回空字符串。

签名：

REGEXP_EXTRACT(Column e, String exp, int groupIdx)

`REGEXP_REPLACE`¶

描述： 将指定字符串值中与 Java 正则表达式 pattern 匹配的所有子字符串替换为 replacement。

签名：

REGEXP_REPLACE(Column e, String pattern, String replacement)

`REPEAT`¶

描述： 将字符串列重复 n 次，并将其作为新的字符串列返回。

签名：

REPEAT(Column str, int n)

`REVERSE`¶

描述： 反转字符串列并将其作为新的字符串列返回。

签名：

REVERSE(Column str)

`RPAD`¶

描述： 使用 pad 对字符串列进行右填充，使其达到 len 长度。

签名：

RPAD(Column str, int len, String pad)

`RTRIM`¶

描述： 去除指定字符串值右侧的空格。

签名：

RTRIM(Column e)

`SOUNDEX`¶

描述： 返回指定表达式的 soundex 代码。

签名：

SOUNDEX(Column e)

`SPLIT`¶

描述： 根据 pattern 拆分 str，其中 pattern 是正则表达式。

签名：

SPLIT(Column str, String pattern)

`SUBSTRING`¶

描述： 当 str 为字符串类型时，返回从 pos 开始、长度为 len 的子字符串。当 str 为二进制类型时，返回从字节位置 pos 开始、长度为 len 的字节数组切片。

签名：

SUBSTRING(Column str, int pos, int len)

`SUBSTRING_INDEX`¶

描述： 返回字符串 str 中在分隔符 delim 出现 count 次之前的子字符串。搜索 delim 时执行区分大小写的匹配。如果 count 为正数，则返回最终分隔符（从左计数）左侧的所有内容。如果 count 为负数，则返回最终分隔符（从右计数）右侧的所有内容。

签名：

SUBSTRING_INDEX(Column str, String delim, int count)

`TRANSLATE`¶

描述： 将 src 中的任何字符替换为 replaceString 中的字符。replaceString 中的字符对应于 matchingString 中的字符。当字符串中的任何字符与 matchingString 中的字符匹配时，将进行替换。

签名：

TRANSLATE(Column src, String matchingString, String replaceString)

`TRIM`¶

描述： 去除指定字符串列两端的空格。

签名：

TRIM(Column e)

`UNBASE64`¶

描述： 解码 BASE64 编码的字符串列，并将其作为二进制列返回。

签名：

UNBASE64(Column e)

`UNHEX`¶

描述： hex 的反函数。将每对字符解释为十六进制数，并转换为该数字的字节表示形式。

签名：

UNHEX(Column column)

`UPPER`¶

描述： 将字符串列转换为大写。

签名：

UPPER(Column e)

日期时间函数¶

函数	描述
ADD_MONTHS	返回 startDate 之后 numMonths 个月的日期。
DATEDIFF	返回从 start 到 end 的天数。
DATE_ADD	返回 start 之后 days 天的日期。
DATE_FORMAT	将日期/时间戳/字符串转换为由日期格式指定的字符串值。
DATE_SUB	返回 start 之前 days 天的日期。
DAYOFMONTH	从给定的日期/时间戳/字符串中提取月份中的第几天作为整数。
DAYOFYEAR	从给定的日期/时间戳/字符串中提取年份中的第几天作为整数。
FROM_UNIXTIME	将自 Unix 纪元以来的秒数转换为表示该时刻在当前系统时区中的时间戳的字符串，格式为给定格式。
FROM_UTC_TIMESTAMP	假定给定的时间戳是 UTC 时间，并将其转换为给定的时区。
HOUR	从给定的日期/时间戳/字符串中提取小时作为整数。
LAST_DAY	给定日期列，返回该日期所属月份的最后一天。
MINUTE	从给定的日期/时间戳/字符串中提取分钟作为整数。
MONTH	从给定的日期/时间戳/字符串中提取月份作为整数。
MONTHS_BETWEEN	返回日期 date1 和 date2 之间的月数。
NEXT_DAY	给定日期列，返回该日期列值之后、位于指定星期几的第一个日期。
QUARTER	从给定的日期/时间戳/字符串中提取季度作为整数。
SECOND	从给定的日期/时间戳/字符串中提取秒作为整数。
TO_DATE	将列转换为日期类型。
TO_UTC_TIMESTAMP	假定给定的时间戳位于给定的时区，并将其转换为 UTC 时间。
TRUNC	返回按格式指定的单位截断的日期。
UNIX_TIMESTAMP	使用默认时区和默认区域设置，将格式为 yyyy-MM-dd HH\:mm:ss 的时间字符串转换为 Unix 时间戳（以秒为单位）。
WEEKOFYEAR	从给定的日期/时间戳/字符串中提取周数作为整数。
WINDOW	根据指定时间戳列生成滚动或滑动时间窗口。
YEAR	从给定的日期/时间戳/字符串中提取年份作为整数。

`ADD_MONTHS`¶

描述： 返回 startDate 指定的日期/时间戳/字符串之后 numMonths 个月的日期。如果 startDate 是字符串，则其格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

如果 startDate 有时间组件，则忽略该组件。结果月份具有与 startDate 相同的日期组件。如果 startDate 是月份的最后一天，或者结果月份的天数少于 startDate 的日期组件，则返回结果月份的最后一天。

签名：

ADD_MONTHS(Column startDate, int numMonths)

`DATEDIFF`¶

描述： 返回从 start 到 end 的天数。

签名：

DATEDIFF(Column end, Column start)

`DATE_ADD`¶

描述： 返回 start 之后 days 天的日期。

签名：

DATE_ADD(Column start, int days)

`DATE_FORMAT`¶

描述： 将日期/时间戳/字符串转换为由日期 format 指定的字符串值。如果 dateExpr 是字符串，则其格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

对于模式字符串 format，可以使用 SimpleDateFormat 的模式字母。有关更多信息，请参阅 Java SimpleDateFormat 文档 ↗。

签名：

DATE_FORMAT(Column dateExpr, String format)

`DATE_SUB`¶

描述： 返回 start 之前 days 天的日期。

签名：

DATE_SUB(Column start, int days)

`DAYOFMONTH`¶

描述： 从给定的日期/时间戳/字符串中提取月份中的第几天作为整数。字符串的格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

签名：

DAYOFMONTH(Column e)

`DAYOFYEAR`¶

描述： 从给定的日期/时间戳/字符串中提取年份中的第几天作为整数。字符串的格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

签名：

DAYOFYEAR(Column e)

`FROM_UNIXTIME`¶

描述： 将自 Unix 纪元（1970-01-01 00:00:00 UTC）以来的秒数转换为表示该时刻在当前系统时区中的时间戳的字符串，格式为给定格式。如果未提供模式字符串 f，则结果字符串的默认格式为 yyyy-MM-dd HH\:mm:ss。要更改结果字符串的格式，请提供模式字符串 f。模式字符串可以使用 SimpleDateFormat 的模式字母。有关更多信息，请参阅 Java SimpleDateFormat 文档 ↗。

签名：

FROM_UNIXTIME(Column ut)FROM_UNIXTIME(Column ut, String f)

`FROM_UTC_TIMESTAMP`¶

描述： 假定给定的时间戳是 UTC 时间，并将其转换为由 tz 指定的时区。给定一个时间戳（对应于 UTC 中的某个时间点），返回另一个时间戳，该时间戳对应于由 tz 指定的时区中的相同时间点。

签名：

FROM_UTC_TIMESTAMP(Column ts, String tz)

`HOUR`¶

描述： 从给定的日期/时间戳/字符串中提取小时作为整数。字符串的格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

签名：

HOUR(Column e)

`LAST_DAY`¶

描述： 给定日期列，返回该日期所属月份的最后一天。

签名：

LAST_DAY(Column e)

`MINUTE`¶

描述： 从给定的日期/时间戳/字符串中提取分钟作为整数。字符串的格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

签名：

MINUTE(Column e)

`MONTH`¶

描述： 从给定的日期/时间戳/字符串中提取月份作为整数。字符串的格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

签名：

MONTH(Column e)

`MONTHS_BETWEEN`¶

描述： 返回日期 date1 和 date2 之间的月数。如果 date1 晚于 date2，则结果为正数。如果 date1 早于 date2，则结果为负数。如果两个日期是月份中的同一天，或者都是月份的最后一天，则结果为整数。否则，结果四舍五入到 8 位小数。

签名：

MONTHS_BETWEEN(Column date1, Column date2)

`NEXT_DAY`¶

描述： 给定 date 列，返回该日期列值之后、位于指定 dayOfWeek 的第一个日期。dayOfWeek 参数不区分大小写，可以是以下之一：

'Mon'
'Tue'
'Wed'
'Thu'
'Fri'
'Sat'
'Sun'

签名：

NEXT_DAY(Column date, String dayOfWeek)

`QUARTER`¶

描述： 从给定的日期/时间戳/字符串中提取季度作为整数。字符串的格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。结果季度范围为 1 到 4。

签名：

QUARTER(Column e)

`SECOND`¶

描述： 从给定的日期/时间戳/字符串中提取秒作为整数。字符串的格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

签名：

SECOND(Column e)

`TO_DATE`¶

描述： 将列转换为日期类型。

签名：

TO_DATE(Column e)

`TO_UTC_TIMESTAMP`¶

描述： 假定提供的时间戳 (ts) 位于提供的时区 (tz)，并将其转换为 UTC 时间。

签名：

TO_UTC_TIMESTAMP(Column ts, String tz)

`TRUNC`¶

描述： 返回按 format 指定的单位截断的日期。format 参数为 'year'、'yyyy'、'yy' 表示按年截断，或 'month'、'mon'、'mm' 表示按月截断。

签名：

TRUNC(Column date, String format)

`UNIX_TIMESTAMP`¶

描述： 使用默认时区和默认区域设置，将时间字符串转换为 Unix 时间戳（以秒为单位）。如果无法将字符串转换为时间戳，则返回 null。

如果未提供模式字符串，则输入列 s 的默认格式为 yyyy-MM-dd HH\:mm:ss。如果输入列采用不同的格式，请提供模式字符串 p。模式字符串可以使用 SimpleDateFormat 的模式字母。有关更多信息，请参阅 Java SimpleDateFormat 文档 ↗。

签名：

UNIX_TIMESTAMP(Column s)UNIX_TIMESTAMP(Column s, String p)

`WEEKOFYEAR`¶

描述： 从给定的日期/时间戳/字符串中提取周数作为整数。字符串的格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

签名：

WEEKOFYEAR(Column e)

`WINDOW`¶

描述： 根据指定时间戳列生成滚动或滑动时间窗口。如果未提供 slideDuration，则生成滚动时间窗口。如果提供了 slideDuration，则通过将行分桶到一个或多个时间窗口中来生成滑动时间窗口。

窗口开始包含在内，窗口结束不包含在内，因此 12:05 将位于窗口 [12:05,12:10) 中，但不在 [12:00,12:05) 中。窗口支持微秒精度。不支持按月为单位的窗口。

窗口从 1970-01-01 00:00:00 UTC 开始，除非提供了 startTime。此函数接受以下参数：

timeColumn：用作按时间分窗口的时间戳的列。时间列必须为 TimestampType 类型。
windowDuration：指定窗口宽度的字符串，例如 10 minutes、1 second。请注意，持续时间是固定的时间长度，不会根据日历随时间变化。
slideDuration：指定窗口滑动间隔的字符串，例如 1 minute。每经过一个 slideDuration 就会生成一个新窗口。必须小于或等于 windowDuration。此持续时间同样是绝对的，不会根据日历变化。
startTime：相对于 1970-01-01 00:00:00 UTC 的偏移量，用于开始窗口间隔。例如，要获得每小时滚动窗口，从整点过 15 分钟开始（例如 12:15-13:15、13:15-14:15...），请将 startTime 设置为 15 minutes。

签名：

WINDOW(Column timeColumn, String windowDuration)WINDOW(Column timeColumn, String windowDuration, String slideDuration)WINDOW(Column timeColumn, String windowDuration, String slideDuration, String startTime)

`YEAR`¶

描述： 从给定的日期/时间戳/字符串中提取年份作为整数。字符串的格式必须为 'yyyy-MM-dd' 或 'yyyy-MM-dd HH\:mm:ss'。

签名：

YEAR(Column e)

数学函数¶

函数	描述
ABS	计算绝对值。
ACOS	计算给定值的反余弦。
ASIN	计算给定值的反正弦。
ATAN	计算给定值的反正切。
ATAN2	返回从直角坐标 (x, y) 到极坐标 (r, theta) 转换的角度 theta。
BIN

Spark SQL Reference（Spark SQL 参考指南）¶

Getting started¶

Basic query format¶

Comment syntax¶

Referencing datasets¶

Referencing columns¶

Derived columns & aliases¶

Derived columns in SQL clauses¶

Type conversion¶

Formatting dates¶

Aggregate functions¶

APPROX_COUNT_DISTINCT¶

AVG¶

COLLECT_LIST¶

COLLECT_SET¶

CORR¶

COUNT¶

COVAR_POP¶

COVAR_SAMP¶

FIRST¶

GROUPING¶

KURTOSIS¶

LAST¶

MAX¶

MEAN¶

MIN¶

SKEWNESS¶

STDDEV¶

STDDEV_POP¶

STDDEV_SAMP¶

SUM¶

VARIANCE¶

VAR_POP¶

VAR_SAMP¶

String functions¶

ASCII¶

BASE64¶

CONCAT¶

DECODE¶

ENCODE¶

FORMAT_NUMBER¶

GET_JSON_OBJECT¶

INSTR¶

JSON_TUPLE¶

LENGTH¶

LEVENSHTEIN¶

LOWER¶

LPAD¶

LTRIM¶

REGEXP_EXTRACT¶

REGEXP_REPLACE¶

REPEAT¶

REVERSE¶

RPAD¶

RTRIM¶

SOUNDEX¶

SPLIT¶

SUBSTRING¶

SUBSTRING_INDEX¶

TRANSLATE¶

TRIM¶

UNBASE64¶

UNHEX¶

UPPER¶

Datetime functions¶

ADD_MONTHS¶

DATEDIFF¶

DATE_ADD¶

DATE_FORMAT¶

DATE_SUB¶

DAYOFMONTH¶

DAYOFYEAR¶

FROM_UNIXTIME¶

FROM_UTC_TIMESTAMP¶

HOUR¶

LAST_DAY¶

MINUTE¶

MONTH¶

MONTHS_BETWEEN¶

NEXT_DAY¶

`APPROX_COUNT_DISTINCT`¶

`AVG`¶

`COLLECT_LIST`¶

`COLLECT_SET`¶

`CORR`¶

`COUNT`¶

`COVAR_POP`¶

`COVAR_SAMP`¶

`FIRST`¶

`GROUPING`¶

`KURTOSIS`¶

`LAST`¶

`MAX`¶

`MEAN`¶

`MIN`¶

`SKEWNESS`¶

`STDDEV`¶

`STDDEV_POP`¶

`STDDEV_SAMP`¶

`SUM`¶

`VARIANCE`¶

`VAR_POP`¶

`VAR_SAMP`¶

`ASCII`¶

`BASE64`¶

`CONCAT`¶

`DECODE`¶

`ENCODE`¶

`FORMAT_NUMBER`¶

`GET_JSON_OBJECT`¶

`INSTR`¶

`JSON_TUPLE`¶

`LENGTH`¶

`LEVENSHTEIN`¶

`LOWER`¶

`LPAD`¶

`LTRIM`¶

`REGEXP_EXTRACT`¶

`REGEXP_REPLACE`¶

`REPEAT`¶

`REVERSE`¶

`RPAD`¶

`RTRIM`¶

`SOUNDEX`¶

`SPLIT`¶

`SUBSTRING`¶

`SUBSTRING_INDEX`¶

`TRANSLATE`¶

`TRIM`¶

`UNBASE64`¶

`UNHEX`¶

`UPPER`¶

`ADD_MONTHS`¶

`DATEDIFF`¶

`DATE_ADD`¶

`DATE_FORMAT`¶

`DATE_SUB`¶

`DAYOFMONTH`¶

`DAYOFYEAR`¶

`FROM_UNIXTIME`¶

`FROM_UTC_TIMESTAMP`¶

`HOUR`¶

`LAST_DAY`¶

`MINUTE`¶

`MONTH`¶

`MONTHS_BETWEEN`¶

`NEXT_DAY`¶

`QUARTER`¶

`SECOND`¶

`TO_DATE`¶

`TO_UTC_TIMESTAMP`¶

`TRUNC`¶

`UNIX_TIMESTAMP`¶

`WEEKOFYEAR`¶

`WINDOW`¶

`YEAR`¶

`ABS`¶

`ACOS`¶

`ASIN`¶

`ATAN`¶