跳转至

Dates and timestamps(日期与时间戳)

1. Add or subtract dates

Add or subtract days to date

  • F.date_add(start, days)
  • F.date_sub(start, days)

Add months to date

  • F.add_months(start, months)

Get number of days or months between two dates

  • F.datediff(end, start)
  • F.months_between(date1, date2)

Get the last day of the month

  • F.last_day(date)

Get date of the next dayOfWeek

  • F.next_day(date, dayOfWeek)

2. Date values

Get the year, month, day, minute, second

  • F.year(column)
  • F.month(column)
  • F.dayofmonth(column)
  • F.hour(column)
  • F.minute(column)
  • F.second(column)

Get business quarter from date

  • F.quarter(column)

Get day or week of year from date

  • F.dayofyear(column)
  • F.weekofyear(column)

3. Formatting

Date & time format syntax

Here's a quick reference:

Format Example
yyyy-MM-dd 1997-01-31
yyyy-MM-dd HH\:mm:ss 1997-01-31 23:59:59

Date formatting string patterns are based on the Java class java.text.SimpleDateFormat. The complete reference is available in the Date & Time Format Syntax Table ↗.

Converting from string

  • F.to_date(column, format=None)
  • F.to_timestamp(column, format=None)
  • F.to_utc_timestamp(timestamp, tz)
  • F.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')

Converting to string

  • F.date_format(date, format)
  • F.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss')
  • F.from_utc_timestamp(timestamp, tz)

Casting from long to timestamp

Some systems store timestamps as a long datatype, in milliseconds. PySpark SQL stores timestamps in seconds. We must divide the long version of the timestamp by 1000 to properly cast it to timestamp:

casted_timestamp = (F.col('timestamp') / 1000).cast("timestamp")
df = df.withColumn("timestamp", casted_timestamp)
# 1531860192661 =>  Tuesday, July 17, 2018 8:43:12 PM

We can also use F.from_unixtime(timestamp) for clarity:

timestamp = F.from_unixtime(F.col('timestamp') / 1000)
df = df.withColumn("timestamp", timestamp)

:::callout{theme="neutral"} When casting from long to timestamp, we lose a level of granularity. SQL cannot store percentages or decimals of seconds. :::

Truncating

  • F.trunc(date, format)
  • F.date_trunc(format, timestamp)

中文翻译

日期与时间戳

1. 日期的加减运算

对日期增加或减少天数

  • F.date_add(start, days)
  • F.date_sub(start, days)

对日期增加月份

  • F.add_months(start, months)

获取两个日期之间的天数或月份数

  • F.datediff(end, start)
  • F.months_between(date1, date2)

获取当月的最后一天

  • F.last_day(date)

获取下一个指定星期几的日期

  • F.next_day(date, dayOfWeek)

2. 日期值提取

获取年、月、日、分钟、秒

  • F.year(column)
  • F.month(column)
  • F.dayofmonth(column)
  • F.hour(column)
  • F.minute(column)
  • F.second(column)

从日期获取业务季度

  • F.quarter(column)

从日期获取一年中的第几天或第几周

  • F.dayofyear(column)
  • F.weekofyear(column)

3. 格式化

日期与时间格式语法

以下为快速参考:

格式 示例
yyyy-MM-dd 1997-01-31
yyyy-MM-dd HH\:mm:ss 1997-01-31 23:59:59

日期格式化字符串模式基于 Java 类 java.text.SimpleDateFormat。完整参考请参见日期与时间格式语法表 ↗

从字符串转换

  • F.to_date(column, format=None)
  • F.to_timestamp(column, format=None)
  • F.to_utc_timestamp(timestamp, tz)
  • F.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')

转换为字符串

  • F.date_format(date, format)
  • F.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss')
  • F.from_utc_timestamp(timestamp, tz)

long 类型转换为 timestamp 类型

某些系统以毫秒为单位的 long 数据类型存储时间戳。PySpark SQL 则以秒为单位存储时间戳。我们必须将 long 类型的时间戳除以 1000,才能正确将其转换为 timestamp 类型:

casted_timestamp = (F.col('timestamp') / 1000).cast("timestamp")
df = df.withColumn("timestamp", casted_timestamp)
# 1531860192661 => 2018年7月17日 星期二 晚上8:43:12

为清晰起见,我们也可以使用 F.from_unixtime(timestamp)

timestamp = F.from_unixtime(F.col('timestamp') / 1000)
df = df.withColumn("timestamp", timestamp)

:::callout{theme="neutral"} 从 long 类型转换为 timestamp 类型时,会丢失一定程度的精度。SQL 无法存储秒的百分比或小数部分。 :::

截断操作

  • F.trunc(date, format)
  • F.date_trunc(format, timestamp)