Coming from Python(从 Python 迁移)¶
If you have experience with Python, you are maybe accustomed to manipulating data procedurally or imperatively: providing the exact steps needed to transform your data from one state to another. SQL, in contrast, is declarative, meaning that you describe the result you are looking for and the software handles generating that result. PySpark is a library for conveniently building complicated SQL queries via Python: it attempts to provide access to SQL concepts in Python's procedural syntax. This takes advantage of the flexibility of Python, convenience of SQL and parallel processing power of Spark.
It will be helpful to evolve your conceptual model to think in terms of the dataset as a whole and process the data based on columns instead of rows. Instead of manipulating data directly using variables, lists, dictionaries, loops, etc., we work in terms of DataFrames. This means that instead of using Python primitives and operators, we'll use Spark's built in operators that work on DataFrames at scale in a distributed fashion.
Examples¶
Suppose you have a list of numbers in Python and you want to add 5 to each.
old_list = [1,2,3]
new_list = []
for i in old_list:
added_number = i + 5
new_list.append(added_number)
print new_list
>>> [6,7,8]
In PySpark, this would resemble
new_dataframe = old_dataframe.withColumn('added_number', old_dataframe.number + 5)
new_dataframe now represents the following,
| number | added_number |
|---|---|
| 1 | 6 |
| 2 | 7 |
| 3 | 8 |
Interestingly, the DataFrame object does not actually contain your data in memory: it is a reference to the data in Spark. DataFrames are lazily evaluated. When we ask Spark to actually do something with a DataFrame (for example write it out to Foundry) it walks through all of the intermediate DataFrames we created, generates an optimized query plan, and executes it on the Spark cluster. This allows Foundry to scale beyond the amount of data that can fit in memory on a single server or on your laptop.
中文翻译¶
从 Python 迁移¶
如果你有 Python 使用经验,你可能习惯于过程式或命令式地操作数据:提供将数据从一种状态转换到另一种状态所需的确切步骤。相比之下,SQL 是声明式的,这意味着你描述想要的结果,而软件负责生成该结果。PySpark 是一个通过 Python 便捷地构建复杂 SQL 查询的库:它试图在 Python 的过程式语法中提供对 SQL 概念的访问。这充分利用了 Python 的灵活性、SQL 的便捷性以及 Spark 的并行处理能力。
你需要调整自己的概念模型,学会从整个数据集的角度思考,并基于列而非行来处理数据。我们不再直接使用变量、列表、字典、循环等来操作数据,而是以 DataFrame 为单位进行工作。这意味着我们将使用 Spark 内置的、能够以分布式方式大规模处理 DataFrame 的运算符,而不是使用 Python 的基本类型和运算符。
示例¶
假设你在 Python 中有一个数字列表,想要给每个数字加上 5。
old_list = [1,2,3]
new_list = []
for i in old_list:
added_number = i + 5
new_list.append(added_number)
print new_list
>>> [6,7,8]
在 PySpark 中,这类似于:
new_dataframe = old_dataframe.withColumn('added_number', old_dataframe.number + 5)
new_dataframe 现在表示以下内容:
| number | added_number |
|---|---|
| 1 | 6 |
| 2 | 7 |
| 3 | 8 |
有趣的是,DataFrame 对象实际上并不在内存中包含你的数据:它是对 Spark 中数据的引用。DataFrame 是惰性求值的。当我们要求 Spark 对 DataFrame 执行实际操作时(例如将其写入 Foundry),它会遍历我们创建的所有中间 DataFrame,生成优化的查询计划,并在 Spark 集群上执行该计划。这使得 Foundry 能够扩展到超出单台服务器或笔记本电脑内存容量的数据规模。