泰坦尼克号灾难预测

使用 Deep Learning 创建一个预测模型，来预测哪些乘客在泰坦尼克号沉船中可以幸存下来。

Titanic

背景

The sinking of the Titanic is one of the most infamous shipwrecks in history.
On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

挑战

了解了相关信息及背景后，决定采用PyTorch深度学习框架，通过使用乘客数据（即姓名，年龄，性别，社会经济阶层等）构建Titanic的二分类预测模型。

数据预处理

import pandas as pd
import numpy as np

def data_preprocessed():
    datas = pd.read_csv('data/train.csv', delimiter=',')
    data = pd.get_dummies(datas[['Pclass', 'Sex', 'SibSp', 'Parch', 'Survived']])
    survive = data.pop(data.columns[3])
    data['Survived'] = survive

    with open('train.csv', 'w')as f:
        np.savetxt(f, data, delimiter=',')

def testdata_preprocessed():
    datas = pd.read_csv('data/test.csv', delimiter=',')
    data = pd.get_dummies(datas[['Pclass', 'Sex', 'SibSp', 'Parch']])

    with open('test.csv', 'w')as f:
        np.savetxt(f, data, delimiter=',')


if __name__ == '__main__':
    testdata_preprocessed()
    data_preprocessed()

'''
采用pandas和numpy对数据集进行预处理，将其分别保存在对应的csv文件中。
'''

对数据进行预处理后，开始编写主程序构建预测模型。

导入相应模块

import numpy as np
import pandas as pd
import torch
import matplotlib.pyplot as plt

编写神经网络模型

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_hidden2, n_hidden3, n_output):
        super().__init__()
        self.feature = torch.nn.Linear(n_feature, n_hidden)
        self.layer = torch.nn.Linear(n_hidden, n_hidden2)
        self.layer2 = torch.nn.Linear(n_hidden2, n_hidden3)
        self.predict = torch.nn.Linear(n_hidden3, n_output)
    
    def forward(self, x):
        x = torch.sigmoid(self.feature(x))
        x = torch.sigmoid(self.layer(x))
        x = torch.sigmoid(self.layer2(x))
        x = torch.sigmoid(self.predict(x))
        return x
'''
构建一层输入层，两层隐含层，一层输出层。全部采用sigmoid激活函数连接层。
'''

加载数据集

def data_handling():
    data_preprocessed()
    data = np.loadtxt('train.csv', delimiter=',', dtype=np.float32)
    x = torch.from_numpy(data[:, :-1])
    y = torch.from_numpy(data[:, [-1]])

    test_x = np.loadtxt('test.csv', delimiter=',', dtype=np.float32)
    test_x = torch.from_numpy(test_x)

    test_y = pd.read_csv('data/gender_submission.csv', delimiter=',', index_col=0)
    test_y = test_y.values
    return x, y, test_x, test_y
'''
加载预处理的数据集，和验证集。
'''

迭代计算

def main():
    all_data = data_handling()
    x = all_data[0]
    y = all_data[1]
    test_x = all_data[2]
    test_y = all_data[3]

    net = Net(5, 330, 300, 215, 1)
    optimizer = torch.optim.Adam(net.parameters(), lr=0.00015)
    loss_func = torch.nn.BCELoss()
    plt.ion()
    plt.show()

    epochs = 1000
    for epoch in range(epochs):
        y_train_hat = net(x)
        loss_train = loss_func(y_train_hat, y)

        optimizer.zero_grad()
        loss_train.backward()
        optimizer.step()

        y_test_hat = net(test_x)
        if epoch % 5 == 0:
            plt.cla()
            spot_1, = plt.plot(test_y, 'r*', lw=2)
            spot_2, = plt.plot(y_test_hat.data.numpy(), 'bo', lw=2)
            plt.xlabel('epoch')
            plt.ylabel('result')
            plt.legend((spot_1, spot_2), ['real', 'test'])
            plt.pause(0.1)
        if epoch == epochs-1:
            threshold = 0.5
            preduction = (y_test_hat >= threshold).int()
            preduction = preduction.data.numpy()
            result = np.equal(test_y, preduction)
            correct_ratio = np.mean(result)
            print(correct_ratio)
            with open('result.csv', 'w')as f:
                np.savetxt(f, preduction, delimiter=',', fmt='%d')
    plt.ioff()
    plt.show()
'''
定义输入层单元数即为决策变量数，隐含层各层单元数，输出层单元数为一。采用Adam优化器，设置学习率为0.00015，选择BCELoss二分类损失函数。
对数据集进行1000次迭代学习。
'''

运行程序

if __name__ == '__main__':
    main()
'''
运行主程序
'''

预测模型准确率及result数据集

通过深度学习预测模型得到幸存者数据集 result.csv 并与 gender_submission.csv 数据集进行预测模型准确性的验证。

result
correct

预测的准确性到了 97.85% ，表明模型训练得还不错。{后来提交后也继续训练过多次，最高到达过 98.80% 的准确率！}

提交数据

提交 result.csv 结果。

grade

得到了 77.751 分数，排名 4502 名次。很不错的成绩！😁