Abstract
Data-driven product development is a key technology for systems engineering especially for consumer-oriented industries such as the automotive industry. The basic prerequisite for all data driven approaches is data itself. Due to the increasing networking capabilities of modern vehicles, automotive manufactures are able to record and store customer data in the form of internal vehicle bus signals. The challenge in using this data is that it is not designed for external use, but for internal communication to ensure the safety and functionality of the vehicle. Therefore, the main question is how to extract customer needs and consumer-relevant information within this data using the process of data mining (DM). Consequently, in this paper, a literature review on the aforementioned use case is conducted. Based on the literature research, a DM simulation game is conducted to determine the suitability of existing DM processes in the area of requirements elicitation. Finally, a process extension is proposed that helps to systematically focus the DM process on customer-relevant information and thus accelerate the overall process.
The development of modern systems faces several challenges across all industries. System Design often still emerges bottom up from individual pieces, instead of top down from an architectur
First steps in data driven automotive systems engineering (ASE) haven been made by Bach et a
First, the state of the art for ASE is briefly presented followed by a more detailed report of DM, presenting the common methodologies and differences between them. Then, the methodology used to derive the basic approach for automotive DM is explained, followed by an exemplary application of said approach. Lastly, the methodology is concluded and an outlook for the use of DM in ASE is given.

Fig.1 General procedure from data collection to data driven requirements
The state-of-the-art in systems engineering is first briefly presented, followed by the state-of-the-art in DM in a bit more detail.
Systems engineering in general and automotive systems engineering especially are key areas of expertise for managing complex systems in the automotive worl
Application assistance for systems engineering are provided by generic process models such as the V-model according to VDI 220
The term DM is used to refer to just one-step of one of the earliest DM methodologies called Knowledge Discovery in Databases (KDD
The tools and methods used in DM are framed inside of systematic processes. Motivated by the fact that blind application of said techniques can lead to the discovery of meaningless and invalid knowledg
The main distinction among the models lies in the recommended number and scope of their specific step
Developed by an industry consortium CRISP-DM is designed to be domain independen

Fig.2 Process model Cross Industry Standard for DM (CRISP-DM
Business Understanding: This first phase focuses on understanding the project goals and requirements from a business perspective. Based on that knowledge the task gets translated into a DM problem definition including a preliminary plan for achieving the set goal
Data Understanding: This phase begins with an initial data collection and continues with activities to familiarize oneself with the data, identify data quality problems, gain first insights into the data or uncover interesting subsets in order to form hypotheses for hidden informatio
Data Preparation: The data preparation phase deals with all tasks to transform the initial raw data into the final dataset that is being fed to the modeling tools in the next phase. As indicated in
Modeling: In the modeling phase numerous modeling techniques are selected and utilized depending on the DM problem. The model parameters are optimized through iterative application. The selected modeling technique often determines the specific form of the fed data. Therefore, as mentioned above stepping one step back to data preparation is often unavoidabl
Evaluation: This second to last phase deals with evaluating the developed model (possibly models) including thoroughly reviewing the executed steps along the way. The main task is to determine whether all goals set during the business understanding phase are sufficiently met. The evaluation phase ends with a decision on how to use the DM result
Deployment: In the last phase, the evaluated model(s) are deployed to enable data driven decisions and support in the business process for the end customer. Depending on the end customers’ requirements deployment might be as straightforward as generating a report or as complicated as implementing a repeatable DM proces
In this section, the practical approach to find a suitable DM methodology is described first. Subsequently, the DM framework proposed in this paper is presented, resulting from the accumulated knowledge from the application of the former. The goal being to find the most suitable DM methodology for ASE. As mentioned in the state of the art DM section, there have been several models developed throughout recent years. These standardized processes like CRISP-DM, KDD, SEMMA etc. are designed to be industry agnostic and as such might not fulfill the requirements of each domai
3.1 Simulation game for processing different theoretical requirement elicitation examples through DM
In order to find the most suitable DM methodology for ASE a simulation game on how to theoretically address three different DM problems is conducted, focusing on how to repeatedly and reliably discover hidden knowledge on how the customer uses the vehicle. Question 1 (Q1): How does the customer use the manual shift mode of the automatic transmission? Question 2 (Q2): How does the customer use the preinstalled navigation system? Question 3 (Q3): What is the customer’s maximum speed? Three different types of questions were specifically picked to yield a wide field of possible DM tasks. Each question intended to focus on different aspects of the customers interactions with the vehicle: Q1 to focus on interactions with the mechanical part of the vehicle in this case the different shift input methods (paddles / stick), Q2 to focus on the interactions with the digital part of the vehicle and Q3 to focus on the whole vehicle including its surroundings like traffic or weather conditions. Each of these questions was examined separately in a DM simulation game in which the individual steps were processed one after the other. Due to the abstract nature of a simulation game, the focus could inevitably not be on specific models, the associated parameter tuning, data cleaning or something similar. Instead, the focus is primarily on the domain understanding, data understanding and evaluation phases.
In order not to get tangled up by the specifics of each individual DM process, the simulation game was performed using only the CRISP-DM methodology to first get a baseline of how to generally apply DM projects in the context of ASE. Documenting the steps along the way enabled the analysis of the three applications and the search for similarities as well as differences to deduct more general recommendations on automotive bus mining. Originally, the purpose of the analysis was to see if any of the other techniques offered significant advantages over the CRISP-DM methodology. Partly during the application, but at least during the analysis, it became clear that most of the DM methods were applicable in principle, but only with the right framework supporting them. Three critical application weaknesses were discovered which, if circumvented, increase the chances of a successful DM project and thus the discovery of new insights into customer behavior.
The first and biggest weakness is right at the start of the process in the business understanding step. The simulation game shows that the instructions, due to their industry-agnostic design, are too broad and too unspecific for a simple, reproducible application. In particular, if the goal of the DM project is to discover new knowledge, the business understanding should be based on a system which ensures that no aspect of the customer behavior regarding the initial question is excluded. Without some sort of systematically approach, discovering hidden knowledge is like searching for the needle in the haystack.
The second issue discovered is that the data preprocessing of the vehicle bus signals is in some cases so complex that a separate DM process is necessary. The main reason for this is how the data is sent over the bus. Since the data is designed for internal communication and not for retrospective analysis, theoretically simple data preparation task requires a great deal of effort. An example of this is the determination of the type of road the vehicle is on (e.g., motorway, country road or city road). Depending on the data protection regulations, it may be forbidden to save the GPS position of the vehicle, as well as the use of the navigation system data, as these also contain personal data. The only way around these regulations is to design models that can determine the type of road based on other signals that are not relevant to data protection. This can quickly escalate in complexity if not only simple street types are to be determined, but more detailed levels are desired, for example, to distinguish between freeway entrances and exits or country roads with and without serpentines.
The last issue results from the previous one. Since the data preprocessing step can in some cases lead to separate DM tasks, which in turn require their own data preprocessing steps, the time required for the entire DM project is greatly underestimated. This requires some sort representation since the time spent on the entire project as well as the individual steps is an important aspect of the overall acceptance as well as transparency of such DM projects.
The framework orients itself on the four fundamental steps of all DM processes, domain understanding, data understanding, modeling, and evaluation (compare Section 2.2). The overarching idea behind the proposed framework is to categorize customer behavior by means of use cases that are as detailed as possible and then to use them in the form of metadata for further analysis. The categorized metadata serve as a simplified representation of all vehicle bus signals relevant to the initial usage question. The whole framework is depicted in

Fig.3 Process model of proposed DM framework
The proposed DM framework starts with a usage question, similar to the Q1, Q2 and Q3 questions above. Based on this question, the first process step takes place: Use Case Derivation. As alluded to above, the task of this step is to map any user behaviour related to the usage question in the form of use cases. In order to be able to create as accurate a picture as possible of customer behaviour, the simple but effective method of w-questions is recommended. This technique provides a repeatable system for categorization while allowing a differentiated view of customer usage. Furthermore, each question offers the possibility for further detailing. For example, if the question “Where?” is considered, this can be answered with increasing level of detail by: Country, state, county, city, etc. If the entire fleet of customer vehicles of a manufacturer or even group is examined, region specific requirements might be derived as a result. The question of the type of road, which was already mentioned in the last section, also forms another layer of the “Where?” question and can again be considered in various degrees of detail. Other notable examples, are the questions “Why?” and “How?” These are usually more complex and therefore require more effort than others, but they can also tell us more about customer behaviour. An example is question Q3, where it can be more interesting to find out why customers did not drive faster than their actual speed. At the same time, the question is much more difficult to answer due to the numerous possible influences such as: weather conditions, surrounding traffic or in case of electric vehicles the state of charge of the battery. Logically, depending on the questions to be investigated, different detailed w-questions and consequently different use cases arise. This enables a further benefit of the Domain Understanding method proposed by us, since all questions and use cases can be recorded in the form of catalogues and used for other analyses. This ensures a high degree of reproducibility.
Once all use cases have been captured, the next process phase Signal Abstraction (Signal Mining) begins. The goal of this phase is to map the previously derived use cases with suitable bus signals in order to record the occurrence of the use case in customer behaviour in the form of scenario/meta data. This is where the aforementioned varying complexity in mapping the use cases becomes apparent. Without privacy laws, for example, all of the “Where?” questions mentioned above could be answered simply by an accurate GPS position of the vehicle. If these laws apply, however, other solutions must be found. For example, rule-based or classification algorithms that can map the searched use case by combinations of other signals. As soon as such a separate DM task arises, established DM methods such as CRISP-DM can be used.
With the generated metadata, the main analysis phase Scenario Filtering (Scenario Mining) can start either exclusively on a metadata basis or in addition to raw data. The aim is to gain new insights and information about customer behaviour through mining, e.g. in the form of structure-discovery processes or intelligent combination of metadata, e.g. through conditional scenarios. Processes like CRISP-DM can also be used here. Finally, in the last phase, Interpretation, the collected partial results are brought together and evaluated in relation to the initial question. The desired overall process product is an objective, data driven requirements recommendation.
In this section, part of the exemplary application of the proposed DM framework is presented based on question Q3: What is the customer’s maximum speed? As a preliminary disclaimer, the data used for the application was recorded from test vehicles and not from customer vehicles. Accordingly, this is only a presentation of the methodology and not the data or the knowledge gained as there are likely to be differences in the driving behavior of test drivers and normal customers. Nevertheless, the data are recordings from real road traffic on German roads, so except for the driver, there should be no difference to the data that could be collected from customers. Thus, there is nothing to prevent the methods developed from also being applicable to customer data.
This is primarily focused on the middle two process phases and in particular on a question already mentioned in the previous part: Why did the driver not speed up? The identification and subsequent combination of the constraining influences allows a conditional statement about the speed chosen by drivers when they are not constrained by external influences such as surrounding traffic or speed limits. This can provide a new perspective on the same data, increasing the quality of the information and providing new knowledge about the data.
In the first phase of the process, the w-question method was used to identify numerous influences that prevent drivers from driving faster, such as the type of road (especially in Germany with its mostly unrestricted autobahns), speed limits, vehicles ahead, and weather conditions. The mapping of these identified influences onto simple categorical metadata in the second phase epitomizes the varying complexity mentioned in the introduction of the framework. Speed limits and temperature (as a simplified representation of weather conditions) could be mapped directly through simple bus signals. In order to be able to map the existence of a vehicle in front by means of a categorical variable, a more complex rule-based algorithm had to be used and, due to data protection regulations in Germany, even a machine learning classification algorithm for determining the road type. The classification model required extensive preprocessing to match the GPS positions recorded by test vehicles, for which other data protection regulations apply, to different road types with the help of OpenStreetMa

Fig.4 Exemplary representation of road classification, unclassified trip sections above and classified sections below
Both the rule-based vehicle detection and the road type classification are examples of complex preprocessing and separate DM, respectively. This underlines why the associated process step is called signal mining. Furthermore, both methods illustrate the reusability of the developed methods, as both can be used for other problems without further ado.
The derived metadata could be used in the next step of the process to make a conditional statement about the question listed at the beginning. Based on the individual metadata, scenarios of varying detail were formed, and one potentially limiting environmental factor was filtered per scenario layer.

Fig.5 Representation of the speed distribution per scenario, each scenario filters one factor influencing the choice of speed
It is probably not possible to identify all influences on the choice of speed, let alone filter them. Other influences such as temperature were not relevant for this evaluation, as all rides were recorded under similar conditions. Nevertheless, the scenario illustrates a clear trend. The lower the number of factors on the choice of speed, the higher the drivers choose their own speed. Also remarkable is the decreasing dispersion of the recorded speeds per scenario, which can be easily recognized by the shrinking interquartile range of the individual boxplots.
With regard to the issue under consideration, it becomes clear how important the methods such as the scenario filtering shown here are for requirements elicitation in the automotive sector. Assume that for a new vehicle project, a requirement for the maximum speed of the vehicle is sought. Based on the unfiltered data (Scenario 1), it would be reasonable to set a requirement recommendation for the maximum speed to a value around the upper whisker of the boxplot (about 160 km/h). In this way, only about 1.5% of the recorded values would be unattainable for the new vehicle. However, if only the data where drivers were not restricted in their choice of speed were consulted (Scenario 5), the same approach would result in a recommendation for a maximum speed of over 200 km/h. At the same time, the data show that the first requirement recommendation would result in about 50% of all driving situations in which the driver has a free choice of speed being unattainable for a vehicle developed according to this requirement. This could lead to customers either being dissatisfied with this product or even ruling out a purchase from it. Alternatively, the data could be used to set a threshold that determines how much current customer behavior may be constrained by the requirements of the new product. For example, if the goal is to maintain 90% of the current possible customer behavior in terms of the maximum speed driven, the maximum speed requirement may be reduced to 185 km/h.
The example illustrates the added value of conditional information for requirements elicitation. The method presented here allows the user to compile his own brand-relevant scenarios via the collection of metadata and to design his products specifically according to a scenario-dependent product usage analysis, resulting in objective, data driven and customer-centric products.
Data driven customer integration in the product development process is one of the key technologies that needs to be mastered in order to remain competitive in a world of increasing product complexity and growing customer market power. One approach to master this challenge is customer DM to support the requirements elicitation process. On the basis of an intensive DM simulation game, this paper worked out why the existing DM processes are not suitable for this purpose. The identified weaknesses of the existing processes as well as the collected insights of the business game were used to design a suitable DM process for customer DM. This proposed methodology was explained in detail and then the added value of this was presented through an application example. One of the key aspects of the proposed methodology is the categorization of customer behavior through metadata. The analysis of this metadata provides a reliable and repeatable system that facilitates the discovery of new knowledge in the data. For this reason, it can be stated that metadata categorization is a key technique for understanding the intricacies of the inner workings and interactions of modern vehicle bus systems, and thus for understanding and mapping customer behavior. In the long term, there is the possibility of categorizing every aspect of customer-vehicle interaction to enable true big data analytics.
A major downside of this approach, lies in the fact that this type of data analysis is by definition reactive. Only the customer behavior that is possible in the context of the current vehicle can be analyzed. Therefore, this approach must always be used in conjunction with other proactive customer analyses.
Research needs for the process model proposed here include the suitability of the developed methods in the real requirements derivation process. Subsequently, it would be particularly interesting to see to what extent currently existing requirements deviate from the purely data driven requirements. Furthermore, it should be investigated which methods are best suited to discover new patterns and thus new knowledge from large categorized customer databases through unsupervised learning methods.
References
MAURER M, WINNER H. Automotive systems engineering[J]. Berlin: Springer, 2013. [Baidu Scholar]
D'AMBROSIO J, SOREMEKUN G. Systems engineering challenges and MBSE opportunities for automotive system design[C]// 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). Banff: IEEE Xplore, 2017: 2075. [Baidu Scholar]
MASCHOTTA R, WICHMANN A, ZIMMERMANN A, et al. Integrated automotive requirements engineering with a SysML-based domain-specific language[C]// 2019 IEEE International Conference on Mechatronics (ICM). Ilmenau, Germany: IEEE Xplore, 2019: 402. DOI: https://doi.org/10.1109/ICMECH.2019.8722951. [Baidu Scholar]
FANK P, BOJA D, ABTHOFF T. Big data driven vehicle development – Technology and potential[C]// 2020 Internationales Stuttgarter Symposium. Wiesbaden: Springer Fachmedien Wiesbaden, 2020: 315. DOI: https://doi.org/10.1007/978-3-658-30995-4_31. [Baidu Scholar]
FRAGNER A, KREIS A, HIRZ M. Virtual tools to support design and production engineering: Early detection of stone chips to optimize production processes[C]// 2020 IEEE 7th International Conference on Industrial Engineering and Applications (ICIEA). Paris: IEEE, 2020: 399. DOI: https://doi.org/10.1109/ICIEA49774.2020.9102004. [Baidu Scholar]
BACH J, LANGNER J, OTTEN S, et al. Data-driven development, a complementing approach for automotive systems engineering[C]// 2017 IEEE International Systems Engineering Symposium (ISSE). IEEE, 2017: 1. DOI: https://doi.org/10.1109/SysEng.2017.8088295. [Baidu Scholar]
BAJZEK M, FRITZ J, HICK H. Systems engineering principles[C]// HICK H, KÜPPER K, SORGER H. Systems Engineering for Automotive Powertrain Development. Cham: Springer International Publishing, 149. DOI: https://doi.org/10.1007/978-3-319-99629-5_7. [Baidu Scholar]
SILLITTO H, MARTIN J, MC KINNEY D,et al. Systems Engineering and System Definitions[Z]. San Diego: INCOSE — International Council on Systems Engineering, 2019. [Baidu Scholar]
VDI. Entwicklung cyber-physischer mechatronischer systeme (CPMS): VDI/VDE 2206[S]. 2020[2021-10-11]. https://www.vdi.de/richtlinien/details/vdivde-2206-entwicklung-cyber-physischer-mechatronischer-systeme-cpms. [Baidu Scholar]
VDI. Methodik zum Entwickeln und Konstruieren technischer Systeme und Produkte: VDI 2221[S]. 1993[2021-10-11]. https://www.vdi.de/richtlinien/details/vdi-2221-methodik-zum-entwickeln-und-konstruieren-technischer-systeme-und-produkte. [Baidu Scholar]
BAJZEK M, FRITZ J, HICK H. Systems engineering processes[C]// HICK H, KÜPPER K, SORGER H. Systems Engineering for Automotive Powertrain Development. Cham: Springer International Publishing, 2021. 235. DOI: https://doi.org/10.1007/978-3-319-99629-5_9. [Baidu Scholar]
ESCH J, RETTMANN A, MARZINEAK S. A systems engineering approach to electromagnetic compatibility[C]// LIEBL J. Der Antrieb von morgen 2021. Berlin: Springer Berlin Heidelberg, 2021: 167. DOI: https://doi.org/10.1007/978-3-662-63403-5_11. [Baidu Scholar]
International Organization for Standardization. Information technology — Process assessment: ISO/IEC 15504-5: 2012[S]. [2021-10-11]. https://www.iso.org/standard/60555.html. [Baidu Scholar]
FAYYAD U, PIATETSKY-SHAPIRO G, SMYTH P. From data mining to knowledge discovery in databases[J]. AI Magazine, 1996, 17: 37. [Baidu Scholar]
FAYYAD U, PIATETSKY-SHAPIRO G, SMYTH P. The KDD process for extracting useful knowledge from volumes of data[J]. Commun ACM, 1996, 39(11): 27. DOI: https://doi.org/10.1145/240455.240464. [Baidu Scholar]
KURGAN L A, MUSILEK P. A survey of knowledge discovery and data mining process models[J]. The Knowledge Engineering Review, 2006, 21(1): 1. DOI: https://doi.org/10.1017/S0269888906000737. [Baidu Scholar]
MARISCAL G, MARBÁN Ó, FERNÁNDEZ C. A survey of data mining and knowledge discovery process models and methodologies[J]. The Knowledge Engineering Review, 2010, 25(2): 137. DOI: https://doi.org/10.1017/S0269888910000032. [Baidu Scholar]
HAN J W, KAMBER M, PEI J. Data Mining: Concepts and Techniques[M]. [S. l.]: Elsevier professional, 2011. [Baidu Scholar]
ROTONDO A, QUILLIGAN F. Evolution Paths for Knowledge Discovery and Data Mining Process Models[J]. SN Computer Science, 2020, 1(2): 109. DOI: https://doi.org/10.1007/s42979-020-0117-6. [Baidu Scholar]
CIOS K J, KURGAN L A. Trends in Data Mining and Knowledge Discovery[C]// PAL N R, JAIN L. Advanced Techniques in Knowledge Discovery and Data Mining. London: Advanced Information and Knowledge Processing, Springer London, 2005: 1. DOI: https://doi.org/10.1007/1-84628-183-0_1. [Baidu Scholar]
PLOTNIKOVA V, DUMAS M, MILANI F. Adaptations of data mining methodologies: A systematic literature review[J]. PeerJ Computer Science, 2020, 6(2): e267. DOI: https://doi.org/10.7717/peerj-cs.267. [Baidu Scholar]
CHAPMAN P, CLINTON J, KERBER R, et al. CRISP-DM 1.0. Step-by-step data mining guide[A]. [S. l.]: CRISP-DM Consortium, 2000. [Baidu Scholar]
MARBÁN Ó, MARISCAL G, SEGOVI J. 2009. A data mining & knowledge discovery process model [C]// PONCE J, KARAHOC A. Data Mining and Knowledge Discovery in Real Life Applications.[S. l.]: I-Tech Education and Publishing, 2009. DOI: https://doi.org/10.5772/6438. [Baidu Scholar]
MARTÍNEZ-PLUMED F, CONTRERAS-OCHANDO L, FERRI C, et al. CASP-DM context aware standard process for data mining [DB/OL]. (2017-09-19) [2021-05-28]. https://doi.org/10.48550/arXiv.1709.09003. [Baidu Scholar]
PECHENIZKIY M, PUURONEN S, TSYMBAL A. Does relevance matter to data mining research?[C]// KACPRZYK J, LIN T Y, XIE Y, et al. Data Mining: Foundations and Practice: Part of the Studies in Computational Intelligence book series (SCI, volume 118). Berlin: Springer Berlin Heidelberg, 2008: 251. DOI: https://doi.org/10.1007/978-3-540-78488-3_15. [Baidu Scholar]
PLOTNIKOVA V, DUMAS M, MILANI F. Adapting the CRISP-DM data mining process: A case study in the financial services domain[C]// CHERFI S, PERINI A, NURCAN S. Research Challenges in Information Science: Lecture Notes in Business Information Processing. Cham: Springer International Publishing, 2021. 55. DOI: https://doi.org/10.1007/978-3-030-75018-3_4. [Baidu Scholar]
OpenStreetMap.org. Map data [DB/OL]. [2021-05-20]. https://www.openstreetmap.org. [Baidu Scholar]