Dates and timestamps(日期与时间戳)¶
1. Add or subtract dates¶
Add or subtract days to date¶
F.date_add(start, days)F.date_sub(start, days)
Add months to date¶
F.add_months(start, months)
Get number of days or months between two dates¶
F.datediff(end, start)F.months_between(date1, date2)
Get the last day of the month¶
F.last_day(date)
Get date of the next dayOfWeek¶
F.next_day(date, dayOfWeek)
2. Date values¶
Get the year, month, day, minute, second¶
F.year(column)F.month(column)F.dayofmonth(column)F.hour(column)F.minute(column)F.second(column)
Get business quarter from date¶
F.quarter(column)
Get day or week of year from date¶
F.dayofyear(column)F.weekofyear(column)
3. Formatting¶
Date & time format syntax¶
Here's a quick reference:
| Format | Example |
|---|---|
| yyyy-MM-dd | 1997-01-31 |
| yyyy-MM-dd HH\:mm:ss | 1997-01-31 23:59:59 |
Date formatting string patterns are based on the Java class java.text.SimpleDateFormat. The complete reference is available in the Date & Time Format Syntax Table ↗.
Converting from string¶
F.to_date(column, format=None)F.to_timestamp(column, format=None)F.to_utc_timestamp(timestamp, tz)F.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')
Converting to string¶
F.date_format(date, format)F.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss')F.from_utc_timestamp(timestamp, tz)
Casting from long to timestamp¶
Some systems store timestamps as a long datatype, in milliseconds. PySpark SQL stores timestamps in seconds. We must divide the long version of the timestamp by 1000 to properly cast it to timestamp:
casted_timestamp = (F.col('timestamp') / 1000).cast("timestamp")
df = df.withColumn("timestamp", casted_timestamp)
# 1531860192661 => Tuesday, July 17, 2018 8:43:12 PM
We can also use F.from_unixtime(timestamp) for clarity:
timestamp = F.from_unixtime(F.col('timestamp') / 1000)
df = df.withColumn("timestamp", timestamp)
:::callout{theme="neutral"}
When casting from long to timestamp, we lose a level of granularity. SQL cannot store percentages or decimals of seconds.
:::
Truncating¶
F.trunc(date, format)F.date_trunc(format, timestamp)
中文翻译¶
日期与时间戳¶
1. 日期的加减运算¶
对日期增加或减少天数¶
F.date_add(start, days)F.date_sub(start, days)
对日期增加月份¶
F.add_months(start, months)
获取两个日期之间的天数或月份数¶
F.datediff(end, start)F.months_between(date1, date2)
获取当月的最后一天¶
F.last_day(date)
获取下一个指定星期几的日期¶
F.next_day(date, dayOfWeek)
2. 日期值提取¶
获取年、月、日、分钟、秒¶
F.year(column)F.month(column)F.dayofmonth(column)F.hour(column)F.minute(column)F.second(column)
从日期获取业务季度¶
F.quarter(column)
从日期获取一年中的第几天或第几周¶
F.dayofyear(column)F.weekofyear(column)
3. 格式化¶
日期与时间格式语法¶
以下为快速参考:
| 格式 | 示例 |
|---|---|
| yyyy-MM-dd | 1997-01-31 |
| yyyy-MM-dd HH\:mm:ss | 1997-01-31 23:59:59 |
日期格式化字符串模式基于 Java 类
java.text.SimpleDateFormat。完整参考请参见日期与时间格式语法表 ↗。
从字符串转换¶
F.to_date(column, format=None)F.to_timestamp(column, format=None)F.to_utc_timestamp(timestamp, tz)F.unix_timestamp(timestamp=None, format='yyyy-MM-dd HH:mm:ss')
转换为字符串¶
F.date_format(date, format)F.from_unixtime(timestamp, format='yyyy-MM-dd HH:mm:ss')F.from_utc_timestamp(timestamp, tz)
从 long 类型转换为 timestamp 类型¶
某些系统以毫秒为单位的 long 数据类型存储时间戳。PySpark SQL 则以秒为单位存储时间戳。我们必须将 long 类型的时间戳除以 1000,才能正确将其转换为 timestamp 类型:
casted_timestamp = (F.col('timestamp') / 1000).cast("timestamp")
df = df.withColumn("timestamp", casted_timestamp)
# 1531860192661 => 2018年7月17日 星期二 晚上8:43:12
为清晰起见,我们也可以使用 F.from_unixtime(timestamp):
timestamp = F.from_unixtime(F.col('timestamp') / 1000)
df = df.withColumn("timestamp", timestamp)
:::callout{theme="neutral"}
从 long 类型转换为 timestamp 类型时,会丢失一定程度的精度。SQL 无法存储秒的百分比或小数部分。
:::
截断操作¶
F.trunc(date, format)F.date_trunc(format, timestamp)